Dataset choice and source

What dataset will we use?

NYC Taxi & Limousine Commission (TLC) Trip Record Data

We’ll use Yellow taxi trips for a few recent months (example: 3 months) to keep compute/storage on Azure manageable. This is a real-world, large-scale data with timestamps, locations, distance, passenger count, and fare components

This dataset represents actual taxi trips in New York City and contains rich information such as:

Trip start and end timestamps
Pickup and drop-off locations
Trip distance
Passenger count
Fare and payment details

Why This Dataset?

Because of this rich information, the dataset is well-suited for both:

Supervised learning (predicting trip duration or fare)
Unsupervised learning (identifying taxi demand hotspots)

Official source:

NYC TLC Trip Record Data portal

Data dictionary for Yellow Taxi trip record

Understanding the columns:

The Yellow Taxi dataset contains many fields. For this lab, we focus on the most relevant columns needed for EDA, ETL, and modeling.

Time-related columns:

tpep_pickup_datetime: Timestamp when the passenger was picked up
tpep_dropoff_datetime: Timestamp when the passenger was dropped off

Location columns:

PULocationID: Taxi Zone ID where the trip started
DOLocationID: Taxi Zone ID where the trip ended

Trip characteristics:

trip_distance: Distance traveled during the trip (in miles)
passenger_count: Number of passengers in the taxi

Fare-related columns:

fare_amount: Base fare charged for the trip (excluding taxes, tips, tolls)
total_amount: Total amount charged, including fare, taxes, tolls, tips, and surcharges

Data Dictionary

For complete column definitions and data types, refer to the official data dictionary.