Data Ingestion
Azure-based ingestion
Objective
In this lab, we focus on data ingestion. By the end of this, you will be able to:
- [Part A] Ingest real-world data into Azure
This section intentionally stops before processing & model training. The processed output will later be used for:
- Supervised learning (Linear Regression)
- Unsupervised learning (K-Means clustering)
What you need:
Always stop your Azure ML compute instance when not in use to avoid Azure credits consumption.
Part A - Getting data into Azure (Ingestion)
Below are 2 clear ingestion pathways:
- Path 1: Use Azure Open Datasets (no manual file downloads)
- Path 2: Manual download from TLC + upload to Blob (if you want to understand externally ingesting)
Path 1 - Ingestion using Azure Open Datasets
This is the easiest way to get started. Azure Open Datasets provides pre-hosted NYC taxi data without manual downloads.
Add photos to explain how to create, use compute, get to ML Studio
# verify environment -- anything between 3.8-3.12 is good
import sys
print(sys.executable)
print(sys.version)
!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q azureml-opendatasets
# restart kernel
!pip install -q azureml-opendatasets pandas pyarrow
from azureml.opendatasets import NycTlcYellow
print("OK: NycTlcYellow imported")
# pyarrow handles parquet files
from dateutil import parser
start_date = parser.parse('2018-05-01')
end_date = parser.parse('2018-06-01')
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()
nyc_tlc_df.head()We use a limited date range (1-2 months) to keep compute and storage costs manageable for learning purposes.
Path 2 – Ingestion using Azure Blob Storage (manual upload)
This ingestion path demonstrates how data typically enters Azure from an external source, such as a public data portal or a partner system. Unlike Azure Open Datasets, this approach requires explicit storage management, which is common in real production pipelines.
Use this approach when: - Data comes from external partners or systems - You need full control over storage configuration - You want to understand the complete ingestion workflow
Download data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
File name: yellow_tripdata_2025-07.parquet
Search for Storage Account in Azure
In the Azure Portal:
- Search for Storage accounts
- Click Create
- Choose:
- Subscription: Azure Student
- Resource group: create a new one (example:
rg-mlops-lab) - Storage account name: unique name (example:
mlopslabstorage)
Click Review + Create → Create
Create a Blob container
Once the storage account is created:
- Open the Storage Account
- Go to Containers
- Click + Container
- Name:
nyc-tlc-blob - Public access level: Private
This container will hold raw data.
Upload the raw Parquet file
- Open the
nyc-tlc-blobcontainer - Click Upload
- Select
yellow_tripdata_2025-07.parquet - Click Upload
Always set container access to Private. Never expose raw data publicly unless absolutely necessary.
Read data from Blob Storage
Once uploaded, you can read the file directly for processing in Azure ML Studio.
!pip -q install azure-storage-blob pandas pyarrow
from azure.storage.blob import BlobServiceClient
from pathlib import Path
import pandas as pd
# copy paste from Azure Portal
CONNECTION_STRING = "Enter connection string from Security + networking -> Access keys"
# Blob details
CONTAINER_NAME = "nyc-tlc-blob"
BLOB_NAME = "yellow_tripdata_2025-07.parquet"
# Local destination inside notebook VM
LOCAL_DIR = Path("data_raw")
LOCAL_DIR.mkdir(exist_ok=True)
LOCAL_PATH = LOCAL_DIR / "yellow_tripdata_2025-07.parquet"
blob_service = BlobServiceClient.from_connection_string(CONNECTION_STRING)
blob_client = blob_service.get_blob_client(
container=CONTAINER_NAME,
blob=BLOB_NAME
)
with open(LOCAL_PATH, "wb") as f:
f.write(blob_client.download_blob().readall())
print("Downloaded to:", LOCAL_PATH)
df = pd.read_parquet(LOCAL_PATH)
df.shape
df.head()In production, use Managed Identity instead of connection strings. Connection strings should never be committed to code or notebooks.