Data Ingestion
Azure-based ingestion
Objective
In this lab, we focus on data ingestion. By the end of this, you will be able to:
- [Part A] Ingest real-world data into Azure
This section intentionally stops before processing & model training. The processed output will later be used for:
- Supervised learning (Linear Regression)
- Unsupervised learning (K-Means clustering)
What you need:
Always stop your Azure ML compute instance when not in use to avoid Azure credits consumption.
Part A - Getting data into Azure (Ingestion)
Below are 2 clear ingestion pathways:
- Path 1: Use Azure Open Datasets (no manual file downloads)
- Path 2: Manual download from TLC + upload to Blob (if you want to understand externally ingesting)
Path 1 - Ingestion using Azure Open Datasets
This is the easiest way to get started. Azure Open Datasets provides pre-hosted NYC taxi data without manual downloads.
Add photos to explain how to create, use compute, get to ML Studio
# verify environment -- anything between 3.8-3.12 is good
import sys
print(sys.executable)
print(sys.version)
!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q azureml-opendatasets
# restart kernel
!pip install -q azureml-opendatasets pandas pyarrow
from azureml.opendatasets import NycTlcYellow
print("OK: NycTlcYellow imported")
# pyarrow handles parquet files
from dateutil import parser
start_date = parser.parse('2018-05-01')
end_date = parser.parse('2018-06-01')
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()
nyc_tlc_df.head()We use a limited date range (1-2 months) to keep compute and storage costs manageable for learning purposes.
Path 2 β Ingestion using Azure Blob Storage (manual upload)
This ingestion path demonstrates how data typically enters Azure from an external source, such as a public data portal or a partner system. Unlike Azure Open Datasets, this approach requires explicit storage management, which is common in real production pipelines.
Use this approach when: - Data comes from external partners or systems - You need full control over storage configuration - You want to understand the complete ingestion workflow
Download data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
File name: yellow_tripdata_2025-07.parquet
Search for Storage Account in Azure
In the Azure Portal:
- Search for Storage accounts
- Click Create
- Choose:
- Subscription: Azure Student
- Resource group: create a new one (example:
rg-mlops-lab) - Storage account name: unique name (example:
mlopslabstorage)
Click Review + Create β Create
Create a Blob container
Once the storage account is created:
- Open the Storage Account
- Go to Containers
- Click + Container
- Name:
nyc-tlc-blob - Public access level: Private
This container will hold raw data.
Upload the raw Parquet file
- Open the
nyc-tlc-blobcontainer - Click Upload
- Select
yellow_tripdata_2025-07.parquet - Click Upload
Always set container access to Private. Never expose raw data publicly unless absolutely necessary.
Read data from Blob Storage
Once uploaded, you can read the file directly for processing in Azure ML Studio.
!pip -q install azure-storage-blob pandas pyarrow
from azure.storage.blob import BlobServiceClient
from pathlib import Path
import pandas as pd
# copy paste from Azure Portal
CONNECTION_STRING = "Enter connection string from Security + networking -> Access keys"
# Blob details
CONTAINER_NAME = "nyc-tlc-blob"
BLOB_NAME = "yellow_tripdata_2025-07.parquet"
# Local destination inside notebook VM
LOCAL_DIR = Path("data_raw")
LOCAL_DIR.mkdir(exist_ok=True)
LOCAL_PATH = LOCAL_DIR / "yellow_tripdata_2025-07.parquet"
blob_service = BlobServiceClient.from_connection_string(CONNECTION_STRING)
blob_client = blob_service.get_blob_client(
container=CONTAINER_NAME,
blob=BLOB_NAME
)
with open(LOCAL_PATH, "wb") as f:
f.write(blob_client.download_blob().readall())
print("Downloaded to:", LOCAL_PATH)
df = pd.read_parquet(LOCAL_PATH)
df.shape
df.head()In production, use Managed Identity instead of connection strings. Connection strings should never be committed to code or notebooks.
Path 3 β Ingestion using Azure Blob Storage (Terraform-Automated Infrastructure)
This path demonstrates a production-grade approach where infrastructure is provisioned automatically using Infrastructure as Code (IaC), while data upload remains a manual step.
Use this approach when: - You need reproducible infrastructure across environments - You want to automate resource provisioning - Youβre following DevOps best practices - You need consistent configuration management
Infrastructure Setup (Automated)
The Azure Storage infrastructure is provisioned automatically using Terraform.
During deployment, the following resources are created:
- Azure Storage Account (
taxizonelookup) - Blob Container (
taxidata) - Private access configuration for secure data storage
This ensures a consistent, reproducible environment across deployments.
Data Upload (Manual Step)
While infrastructure is automated, the dataset itself must be uploaded manually.
Required Format:
- File type:
.parquet - Uploaded to:
taxizonelookup β taxidatacontainer
Steps to Upload:
- Go to Azure Portal
- Open Storage Accounts
- Select storage account name:
taxizonelookup - Navigate to Containers
- Open container name:
taxidata - Click Upload
- Select your
.parquetfile - Confirm upload
The dataset is approximately 180 MB in size, which is too large to store in version control. Manual upload:
- Keeps large datasets out of GitHub
- Maintains clean DevOps practices
- Separates infrastructure from data
- Ensures secure handling of production datasets
Pipeline Flow
Terraform β Creates Storage Account + Container
User β Uploads .parquet file manually
Azure ML β Reads data via Datastore
Notebook β Performs EDA/Training
Infrastructure is code (automated). Data is content (manual). This separation is a best practice in production MLOps workflows.