Dataset choice and source

What dataset will we use?

NYC Taxi & Limousine Commission (TLC) Trip Record Data

We’ll use Yellow taxi trips for a few recent months (example: 3 months) to keep compute/storage on Azure manageable. This is a real-world, large-scale data with timestamps, locations, distance, passenger count, and fare components

This dataset represents actual taxi trips in New York City and contains rich information such as:

  • Trip start and end timestamps
  • Pickup and drop-off locations
  • Trip distance
  • Passenger count
  • Fare and payment details
TipWhy This Dataset?

Because of this rich information, the dataset is well-suited for both:

  • Supervised learning (predicting trip duration or fare)
  • Unsupervised learning (identifying taxi demand hotspots)

Official source:

NYC TLC Trip Record Data portal

Data dictionary for Yellow Taxi trip record

Understanding the columns:

The Yellow Taxi dataset contains many fields. For this lab, we focus on the most relevant columns needed for EDA, ETL, and modeling.

Time-related columns:

  • tpep_pickup_datetime: Timestamp when the passenger was picked up
  • tpep_dropoff_datetime: Timestamp when the passenger was dropped off

Location columns:

  • PULocationID: Taxi Zone ID where the trip started
  • DOLocationID: Taxi Zone ID where the trip ended

Trip characteristics:

  • trip_distance: Distance traveled during the trip (in miles)
  • passenger_count: Number of passengers in the taxi

Fare-related columns:

  • fare_amount: Base fare charged for the trip (excluding taxes, tips, tolls)
  • total_amount: Total amount charged, including fare, taxes, tolls, tips, and surcharges
NoteData Dictionary

For complete column definitions and data types, refer to the official data dictionary.

Back to top