Data Ingestion

Azure-based ingestion

Objective

In this lab, we focus on data ingestion. By the end of this, you will be able to:

  • [Part A] Ingest real-world data into Azure
Scope

This section intentionally stops before processing & model training. The processed output will later be used for:

  • Supervised learning (Linear Regression)
  • Unsupervised learning (K-Means clustering)

What you need:

Cost Management

Always stop your Azure ML compute instance when not in use to avoid Azure credits consumption.

Part A - Getting data into Azure (Ingestion)

Below are 2 clear ingestion pathways:

  • Path 1: Use Azure Open Datasets (no manual file downloads)
  • Path 2: Manual download from TLC + upload to Blob (if you want to understand externally ingesting)

Path 1 - Ingestion using Azure Open Datasets

Recommended Approach

This is the easiest way to get started. Azure Open Datasets provides pre-hosted NYC taxi data without manual downloads.

Add photos to explain how to create, use compute, get to ML Studio

# verify environment -- anything between 3.8-3.12 is good
import sys

print(sys.executable)
print(sys.version)

!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q azureml-opendatasets

# restart kernel

!pip install -q azureml-opendatasets pandas pyarrow

from azureml.opendatasets import NycTlcYellow
print("OK: NycTlcYellow imported")

# pyarrow handles parquet files

from dateutil import parser

start_date = parser.parse('2018-05-01')
end_date = parser.parse('2018-06-01')

nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()

nyc_tlc_df.head()
Date Range Selection

We use a limited date range (1-2 months) to keep compute and storage costs manageable for learning purposes.

Path 2 – Ingestion using Azure Blob Storage (manual upload)

This ingestion path demonstrates how data typically enters Azure from an external source, such as a public data portal or a partner system. Unlike Azure Open Datasets, this approach requires explicit storage management, which is common in real production pipelines.

When to Use This Path

Use this approach when: - Data comes from external partners or systems - You need full control over storage configuration - You want to understand the complete ingestion workflow

Download data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

File name: yellow_tripdata_2025-07.parquet

Search for Storage Account in Azure

In the Azure Portal:

  • Search for Storage accounts
  • Click Create
  • Choose:
    • Subscription: Azure Student
    • Resource group: create a new one (example: rg-mlops-lab)
    • Storage account name: unique name (example: mlopslabstorage)

Click Review + Create β†’ Create

Create a Blob container

Once the storage account is created:

  • Open the Storage Account
  • Go to Containers
  • Click + Container
  • Name: nyc-tlc-blob
  • Public access level: Private

This container will hold raw data.

Upload the raw Parquet file

  • Open the nyc-tlc-blob container
  • Click Upload
  • Select yellow_tripdata_2025-07.parquet
  • Click Upload
Security Note

Always set container access to Private. Never expose raw data publicly unless absolutely necessary.

Read data from Blob Storage

Once uploaded, you can read the file directly for processing in Azure ML Studio.

!pip -q install azure-storage-blob pandas pyarrow

from azure.storage.blob import BlobServiceClient
from pathlib import Path
import pandas as pd

# copy paste from Azure Portal
CONNECTION_STRING = "Enter connection string from Security + networking -> Access keys"

# Blob details
CONTAINER_NAME = "nyc-tlc-blob"
BLOB_NAME = "yellow_tripdata_2025-07.parquet"

# Local destination inside notebook VM
LOCAL_DIR = Path("data_raw")
LOCAL_DIR.mkdir(exist_ok=True)

LOCAL_PATH = LOCAL_DIR / "yellow_tripdata_2025-07.parquet"

blob_service = BlobServiceClient.from_connection_string(CONNECTION_STRING)
blob_client = blob_service.get_blob_client(
    container=CONTAINER_NAME,
    blob=BLOB_NAME
)

with open(LOCAL_PATH, "wb") as f:
    f.write(blob_client.download_blob().readall())

print("Downloaded to:", LOCAL_PATH)

df = pd.read_parquet(LOCAL_PATH)

df.shape

df.head()
Production Best Practice

In production, use Managed Identity instead of connection strings. Connection strings should never be committed to code or notebooks.

Path 3 – Ingestion using Azure Blob Storage (Terraform-Automated Infrastructure)

This path demonstrates a production-grade approach where infrastructure is provisioned automatically using Infrastructure as Code (IaC), while data upload remains a manual step.

When to Use This Path

Use this approach when: - You need reproducible infrastructure across environments - You want to automate resource provisioning - You’re following DevOps best practices - You need consistent configuration management

Infrastructure Setup (Automated)

The Azure Storage infrastructure is provisioned automatically using Terraform.

During deployment, the following resources are created:

  • Azure Storage Account (taxizonelookup)
  • Blob Container (taxidata)
  • Private access configuration for secure data storage

This ensures a consistent, reproducible environment across deployments.

Data Upload (Manual Step)

While infrastructure is automated, the dataset itself must be uploaded manually.

Required Format:

  • File type: .parquet
  • Uploaded to: taxizonelookup β†’ taxidata container

Steps to Upload:

  1. Go to Azure Portal
  2. Open Storage Accounts
  3. Select storage account name: taxizonelookup
  4. Navigate to Containers
  5. Open container name: taxidata
  6. Click Upload
  7. Select your .parquet file
  8. Confirm upload
Why Manual Upload?

The dataset is approximately 180 MB in size, which is too large to store in version control. Manual upload:

  • Keeps large datasets out of GitHub
  • Maintains clean DevOps practices
  • Separates infrastructure from data
  • Ensures secure handling of production datasets

Pipeline Flow

Terraform β†’ Creates Storage Account + Container
User β†’ Uploads .parquet file manually
Azure ML β†’ Reads data via Datastore
Notebook β†’ Performs EDA/Training
Infrastructure vs Data

Infrastructure is code (automated). Data is content (manual). This separation is a best practice in production MLOps workflows.

Back to top