Data Ingestion

Azure-based ingestion

Objective

In this lab, we focus on data ingestion. By the end of this, you will be able to:

  • [Part A] Ingest real-world data into Azure
NoteScope

This section intentionally stops before processing & model training. The processed output will later be used for:

  • Supervised learning (Linear Regression)
  • Unsupervised learning (K-Means clustering)

What you need:

WarningCost Management

Always stop your Azure ML compute instance when not in use to avoid Azure credits consumption.

Part A - Getting data into Azure (Ingestion)

Below are 2 clear ingestion pathways:

  • Path 1: Use Azure Open Datasets (no manual file downloads)
  • Path 2: Manual download from TLC + upload to Blob (if you want to understand externally ingesting)

Path 1 - Ingestion using Azure Open Datasets

TipRecommended Approach

This is the easiest way to get started. Azure Open Datasets provides pre-hosted NYC taxi data without manual downloads.

Add photos to explain how to create, use compute, get to ML Studio

# verify environment -- anything between 3.8-3.12 is good
import sys

print(sys.executable)
print(sys.version)

!{sys.executable} -m pip install -q --upgrade pip
!{sys.executable} -m pip install -q azureml-opendatasets

# restart kernel

!pip install -q azureml-opendatasets pandas pyarrow

from azureml.opendatasets import NycTlcYellow
print("OK: NycTlcYellow imported")

# pyarrow handles parquet files

from dateutil import parser

start_date = parser.parse('2018-05-01')
end_date = parser.parse('2018-06-01')

nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()

nyc_tlc_df.head()
NoteDate Range Selection

We use a limited date range (1-2 months) to keep compute and storage costs manageable for learning purposes.

Path 2 – Ingestion using Azure Blob Storage (manual upload)

This ingestion path demonstrates how data typically enters Azure from an external source, such as a public data portal or a partner system. Unlike Azure Open Datasets, this approach requires explicit storage management, which is common in real production pipelines.

TipWhen to Use This Path

Use this approach when: - Data comes from external partners or systems - You need full control over storage configuration - You want to understand the complete ingestion workflow

Download data from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

File name: yellow_tripdata_2025-07.parquet

Search for Storage Account in Azure

In the Azure Portal:

  • Search for Storage accounts
  • Click Create
  • Choose:
    • Subscription: Azure Student
    • Resource group: create a new one (example: rg-mlops-lab)
    • Storage account name: unique name (example: mlopslabstorage)

Click Review + CreateCreate

Create a Blob container

Once the storage account is created:

  • Open the Storage Account
  • Go to Containers
  • Click + Container
  • Name: nyc-tlc-blob
  • Public access level: Private

This container will hold raw data.

Upload the raw Parquet file

  • Open the nyc-tlc-blob container
  • Click Upload
  • Select yellow_tripdata_2025-07.parquet
  • Click Upload
WarningSecurity Note

Always set container access to Private. Never expose raw data publicly unless absolutely necessary.

Read data from Blob Storage

Once uploaded, you can read the file directly for processing in Azure ML Studio.

!pip -q install azure-storage-blob pandas pyarrow

from azure.storage.blob import BlobServiceClient
from pathlib import Path
import pandas as pd

# copy paste from Azure Portal
CONNECTION_STRING = "Enter connection string from Security + networking -> Access keys"

# Blob details
CONTAINER_NAME = "nyc-tlc-blob"
BLOB_NAME = "yellow_tripdata_2025-07.parquet"

# Local destination inside notebook VM
LOCAL_DIR = Path("data_raw")
LOCAL_DIR.mkdir(exist_ok=True)

LOCAL_PATH = LOCAL_DIR / "yellow_tripdata_2025-07.parquet"

blob_service = BlobServiceClient.from_connection_string(CONNECTION_STRING)
blob_client = blob_service.get_blob_client(
    container=CONTAINER_NAME,
    blob=BLOB_NAME
)

with open(LOCAL_PATH, "wb") as f:
    f.write(blob_client.download_blob().readall())

print("Downloaded to:", LOCAL_PATH)

df = pd.read_parquet(LOCAL_PATH)

df.shape

df.head()
ImportantProduction Best Practice

In production, use Managed Identity instead of connection strings. Connection strings should never be committed to code or notebooks.

Back to top