Data Ingestion

Azure-based ingestion

Objective

In this lab, we focus on data ingestion. By the end of this, you will be able to:

[Part A] Ingest real-world data into Azure

Scope

This section intentionally stops before processing & model training. The processed output will later be used for:

Supervised learning (Linear Regression)
Unsupervised learning (K-Means clustering)

What you need:

An Azure student account

Cost Management

Always stop your Azure ML compute instance when not in use to avoid Azure credits consumption.

Part A - Getting data into Azure (Ingestion)

Below are 2 clear ingestion pathways:

Path 1: Use Azure Open Datasets (no manual file downloads)
Path 2: Manual download from TLC + upload to Blob (if you want to understand externally ingesting)

Path 1 - Ingestion using Azure Open Datasets

Recommended Approach

This is the easiest way to get started. Azure Open Datasets provides pre-hosted NYC taxi data without manual downloads.

Add photos to explain how to create, use compute, get to ML Studio

# verify environment -- anything between 3.8-3.12 is good
import sys

print(sys.executable)
print(sys.version)

!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q azureml-opendatasets

# restart kernel

!pip install -q azureml-opendatasets pandas pyarrow

from azureml.opendatasets import NycTlcYellow
print("OK: NycTlcYellow imported")

# pyarrow handles parquet files

from dateutil import parser

start_date = parser.parse('2018-05-01')
end_date = parser.parse('2018-06-01')

nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()

nyc_tlc_df.head()

Date Range Selection

We use a limited date range (1-2 months) to keep compute and storage costs manageable for learning purposes.

Path 2 – Ingestion using Azure Blob Storage (manual upload)

This ingestion path demonstrates how data typically enters Azure from an external source, such as a public data portal or a partner system. Unlike Azure Open Datasets, this approach requires explicit storage management, which is common in real production pipelines.

When to Use This Path

Use this approach when: - Data comes from external partners or systems - You need full control over storage configuration - You want to understand the complete ingestion workflow

Download data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

File name: yellow_tripdata_2025-07.parquet

Search for Storage Account in Azure

In the Azure Portal:

Search for Storage accounts
Click Create
Choose:
- Subscription: Azure Student
- Resource group: create a new one (example: rg-mlops-lab)
- Storage account name: unique name (example: mlopslabstorage)

Click Review + Create → Create

Create a Blob container

Once the storage account is created:

Open the Storage Account
Go to Containers
Click + Container
Name: nyc-tlc-blob
Public access level: Private

This container will hold raw data.

Upload the raw Parquet file

Open the nyc-tlc-blob container
Click Upload
Select yellow_tripdata_2025-07.parquet
Click Upload

Security Note

Always set container access to Private. Never expose raw data publicly unless absolutely necessary.

Read data from Blob Storage

Once uploaded, you can read the file directly for processing in Azure ML Studio.

!pip -q install azure-storage-blob pandas pyarrow

from azure.storage.blob import BlobServiceClient
from pathlib import Path
import pandas as pd

# copy paste from Azure Portal
CONNECTION_STRING = "Enter connection string from Security + networking -> Access keys"

# Blob details
CONTAINER_NAME = "nyc-tlc-blob"
BLOB_NAME = "yellow_tripdata_2025-07.parquet"

# Local destination inside notebook VM
LOCAL_DIR = Path("data_raw")
LOCAL_DIR.mkdir(exist_ok=True)

LOCAL_PATH = LOCAL_DIR / "yellow_tripdata_2025-07.parquet"

blob_service = BlobServiceClient.from_connection_string(CONNECTION_STRING)
blob_client = blob_service.get_blob_client(
    container=CONTAINER_NAME,
    blob=BLOB_NAME
)

with open(LOCAL_PATH, "wb") as f:
    f.write(blob_client.download_blob().readall())

print("Downloaded to:", LOCAL_PATH)

df = pd.read_parquet(LOCAL_PATH)

df.shape

df.head()

Production Best Practice

In production, use Managed Identity instead of connection strings. Connection strings should never be committed to code or notebooks.

Path 3 – Ingestion using Azure Blob Storage (Terraform-Automated Infrastructure)

This path demonstrates a production-grade approach where infrastructure is provisioned automatically using Infrastructure as Code (IaC), while data upload remains a manual step.

When to Use This Path

Use this approach when: - You need reproducible infrastructure across environments - You want to automate resource provisioning - You’re following DevOps best practices - You need consistent configuration management

Infrastructure Setup (Automated)

The Azure Storage infrastructure is provisioned automatically using Terraform.

During deployment, the following resources are created:

Azure Storage Account (taxizonelookup)
Blob Container (taxidata)
Private access configuration for secure data storage

This ensures a consistent, reproducible environment across deployments.

Data Upload (Manual Step)

While infrastructure is automated, the dataset itself must be uploaded manually.

Required Format:

File type: .parquet
Uploaded to: taxizonelookup → taxidata container

Steps to Upload:

Go to Azure Portal
Open Storage Accounts
Select storage account name: taxizonelookup
Navigate to Containers
Open container name: taxidata
Click Upload
Select your .parquet file
Confirm upload

Why Manual Upload?

The dataset is approximately 180 MB in size, which is too large to store in version control. Manual upload:

Keeps large datasets out of GitHub
Maintains clean DevOps practices
Separates infrastructure from data
Ensures secure handling of production datasets

Pipeline Flow

Terraform → Creates Storage Account + Container
User → Uploads .parquet file manually
Azure ML → Reads data via Datastore
Notebook → Performs EDA/Training

Infrastructure vs Data

Infrastructure is code (automated). Data is content (manual). This separation is a best practice in production MLOps workflows.