Dataset choice and source
What dataset will we use?
NYC Taxi & Limousine Commission (TLC) Trip Record Data
We’ll use Yellow taxi trips for a few recent months (example: 3 months) to keep compute/storage on Azure manageable. This is a real-world, large-scale data with timestamps, locations, distance, passenger count, and fare components
This dataset represents actual taxi trips in New York City and contains rich information such as:
- Trip start and end timestamps
- Pickup and drop-off locations
- Trip distance
- Passenger count
- Fare and payment details
TipWhy This Dataset?
Because of this rich information, the dataset is well-suited for both:
- Supervised learning (predicting trip duration or fare)
- Unsupervised learning (identifying taxi demand hotspots)
Official source:
Understanding the columns:
The Yellow Taxi dataset contains many fields. For this lab, we focus on the most relevant columns needed for EDA, ETL, and modeling.
Time-related columns:
tpep_pickup_datetime: Timestamp when the passenger was picked uptpep_dropoff_datetime: Timestamp when the passenger was dropped off
Location columns:
PULocationID: Taxi Zone ID where the trip startedDOLocationID: Taxi Zone ID where the trip ended
Trip characteristics:
trip_distance: Distance traveled during the trip (in miles)passenger_count: Number of passengers in the taxi
Fare-related columns:
fare_amount: Base fare charged for the trip (excluding taxes, tips, tolls)total_amount: Total amount charged, including fare, taxes, tolls, tips, and surcharges
NoteData Dictionary
For complete column definitions and data types, refer to the official data dictionary.