Supervised Learning

Linear Regression for fare prediction

Objective

In this lab, we apply supervised learning to the cleaned NYC taxi dataset. By the end, you will be able to:

  • Train a Linear Regression model to predict taxi fare amounts
  • Evaluate model performance using R², MAE, and RMSE
  • Visualize prediction accuracy and residual distribution
NotePrerequisites

This lab requires the cleaned dataset from the Data Processing lab. Make sure you have yellow_tripdata_cleaned.parquet saved from Part B.


Load Cleaned Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_style("whitegrid")

df = pd.read_parquet("data_cleaned/yellow_tripdata_cleaned.parquet")
print("Shape:", df.shape)
df.head()

Expected output: 11,320,592 rows × 14 columns


Feature Selection and Train-Test Split

We select features that are known at the time of pickup and can reasonably predict the fare. We use an 80/20 train-test split.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

features = ['trip_distance', 'trip_duration_min', 'pickup_hour', 'passenger_count']
target = 'fare_amount'

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape[0]:,} rows")
print(f"Test set: {X_test.shape[0]:,} rows")
Feature Rationale
trip_distance Primary driver of fare (0.90 correlation from EDA)
trip_duration_min Longer trips cost more (0.79 correlation)
pickup_hour Time-of-day pricing effects
passenger_count Minor effect, included for completeness

Train and Evaluate

# Train Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluation metrics
print("=== Linear Regression Results ===")
print(f"R² Score:           {r2_score(y_test, y_pred):.4f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred):.2f}")
print(f"Root Mean Sq Error:  ${np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

print("\nFeature Coefficients:")
for feat, coef in zip(features, model.coef_):
    print(f"  {feat}: {coef:.4f}")
print(f"  Intercept: {model.intercept_:.4f}")

Results

Metric Value
R² Score 0.8260
Mean Absolute Error $7.27
Root Mean Squared Error $7.27

Feature Coefficients

Feature Coefficient Interpretation
trip_distance 2.8104 Each additional mile adds ~$2.81 to fare
trip_duration_min 0.2830 Each additional minute adds ~$0.28
pickup_hour 0.0475 Minimal time-of-day effect
passenger_count 0.8558 Slight increase per passenger
Intercept 3.1394 Base fare ~$3.14
TipInterpreting R²

An R² of 0.826 means the model explains ~83% of the variance in fare amounts. This is a strong result for a simple linear model, driven primarily by trip distance and duration.


Visualize Model Performance

# Visualize actual vs predicted fares
sample_idx = X_test.sample(n=5000, random_state=42).index
y_test_sample = y_test.loc[sample_idx]
y_pred_sample = pd.Series(y_pred, index=y_test.index).loc[sample_idx]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Actual vs Predicted scatter
axes[0].scatter(y_test_sample, y_pred_sample, alpha=0.3, s=5)
axes[0].plot([0, 100], [0, 100], 'r--', label='Perfect prediction')
axes[0].set_xlim(0, 100)
axes[0].set_ylim(0, 100)
axes[0].set_xlabel('Actual Fare ($)')
axes[0].set_ylabel('Predicted Fare ($)')
axes[0].set_title('Actual vs Predicted Fare')
axes[0].legend()

# Residual distribution
residuals = y_test_sample - y_pred_sample
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].axvline(x=0, color='r', linestyle='--')

plt.tight_layout()
plt.show()
NoteObservations
  • Actual vs Predicted: Points cluster tightly along the red diagonal line, indicating good predictions. A vertical cluster at ~$70 actual fare shows the model struggles with flat-rate airport trips.
  • Residuals: Centered near zero with a slight right skew. Most predictions are within ±$10 of the actual fare.

Summary

Aspect Detail
Algorithm Linear Regression (scikit-learn)
Training rows 9,056,473
Test rows 2,264,119
R² Score 0.826
Best predictor Trip distance ($2.81 per mile)
Known limitation Flat-rate airport fares (~$70) are poorly predicted
TipPotential Improvements
  • Add PULocationID / DOLocationID as categorical features to capture zone-based pricing
  • Use a non-linear model (e.g., Random Forest, Gradient Boosting) to handle flat-rate fares
  • Engineer an is_airport binary feature for JFK/LaGuardia/Newark trips
Back to top