Supervised Learning

Linear Regression for fare prediction

Objective

In this lab, we apply supervised learning to the cleaned NYC taxi dataset. By the end, you will be able to:

  • Train a Linear Regression model to predict taxi fare amounts
  • Evaluate model performance using R², MAE, and RMSE
  • Visualize prediction accuracy and residual distribution
Prerequisites

This lab requires the cleaned dataset from the Data Processing lab. Make sure you have yellow_tripdata_cleaned.parquet saved from Part B.


Load Cleaned Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_style("whitegrid")

df = pd.read_parquet("data_cleaned/yellow_tripdata_cleaned.parquet")
print("Shape:", df.shape)
df.head()

Expected output: 11,320,592 rows × 14 columns


Feature Selection, Normalization, and Train-Test Split

We select features that are known at the time of pickup and can reasonably predict the fare. We normalize the features to ensure they’re on the same scale, then use an 80/20 train-test split.

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

features = ['trip_distance', 'trip_duration_min', 'pickup_hour', 'passenger_count']
target = 'fare_amount'

X = df[features]
y = df[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape[0]:,} rows")
print(f"Test set: {X_test.shape[0]:,} rows")
print("\nNormalization verification (training set):")
print(f"Mean: {X_train_scaled.mean(axis=0).round(4)}")
print(f"Std: {X_train_scaled.std(axis=0).round(4)}")
Feature Rationale
trip_distance Primary driver of fare (0.90 correlation from EDA)
trip_duration_min Longer trips cost more (0.79 correlation)
pickup_hour Time-of-day pricing effects
passenger_count Minor effect, included for completeness
Why Normalize for Linear Regression?

While Linear Regression doesn’t require normalization mathematically, it provides several benefits:

  1. Coefficient Interpretation: All coefficients are on the same scale, making it easier to compare feature importance
  2. Numerical Stability: Prevents issues with very large or very small numbers
  3. Regularization: Essential when using Ridge or Lasso regression (regularization penalties are scale-dependent)
  4. Convergence: Faster convergence if using gradient descent optimization

Formula: z = (x - Ο) / σ where Ο is mean and σ is standard deviation


Train and Evaluate (Without Regularization)

First, let’s train a standard Linear Regression model without regularization:

# Train Linear Regression model (no regularization)
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred = model.predict(X_test_scaled)

# Evaluation metrics
print("=== Linear Regression Results (No Regularization) ===")
print(f"R² Score:           {r2_score(y_test, y_pred):.4f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred):.2f}")
print(f"Root Mean Sq Error:  ${np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

print("\nFeature Coefficients (Normalized Scale):")
for feat, coef in zip(features, model.coef_):
    print(f"  {feat}: {coef:.4f}")
print(f"  Intercept: {model.intercept_:.4f}")

Train with Regularization (Ridge Regression)

Now let’s apply Ridge Regression (L2 regularization) to prevent overfitting:

# Train Ridge Regression model (L2 regularization)
ridge_model = Ridge(alpha=1.0, random_state=42)
ridge_model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred_ridge = ridge_model.predict(X_test_scaled)

# Evaluation metrics
print("=== Ridge Regression Results (With Regularization) ===")
print(f"R² Score:           {r2_score(y_test, y_pred_ridge):.4f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred_ridge):.2f}")
print(f"Root Mean Sq Error:  ${np.sqrt(mean_squared_error(y_test, y_pred_ridge)):.2f}")

print("\nFeature Coefficients (Normalized Scale):")
for feat, coef in zip(features, ridge_model.coef_):
    print(f"  {feat}: {coef:.4f}")
print(f"  Intercept: {ridge_model.intercept_:.4f}")

# Compare coefficient magnitudes
print("\n=== Coefficient Comparison ===")
coef_comparison = pd.DataFrame({
    'Feature': features,
    'Linear Regression': model.coef_,
    'Ridge Regression': ridge_model.coef_,
    'Difference': np.abs(model.coef_ - ridge_model.coef_)
})
print(coef_comparison.round(4))
Understanding Regularization

Ridge Regression (L2) adds a penalty term to the loss function:

Loss = MSE + α × Σ(coefficients²)

Where: - α (alpha) is the regularization strength (higher = more penalty) - The penalty shrinks coefficients toward zero, preventing overfitting - Unlike Lasso (L1), Ridge doesn’t eliminate features entirely

Benefits: - Reduces overfitting by penalizing large coefficients - Handles multicollinearity better than standard Linear Regression - More stable predictions on new data


Model Comparison

Results Comparison

Metric Linear Regression Ridge Regression
R² Score 0.8260 0.8260
Mean Absolute Error $7.27 $7.27
Root Mean Squared Error $7.27 $7.27
Why Similar Results?

With this large dataset (9M+ training samples) and only 4 features, both models perform similarly because: - The dataset is large enough that overfitting is minimal - The features are not highly correlated (no severe multicollinearity) - The relationship is relatively linear

Regularization becomes more important with: - Smaller datasets - Many features (high dimensionality) - Highly correlated features - Non-linear relationships

Feature Coefficients (Original Scale)

To interpret coefficients in terms of the original features, we need to reverse the normalization:

# Convert normalized coefficients back to original scale
original_coefs = ridge_model.coef_ / scaler.scale_
original_intercept = ridge_model.intercept_ - np.sum(ridge_model.coef_ * scaler.mean_ / scaler.scale_)

print("=== Coefficients in Original Scale ===")
for feat, coef in zip(features, original_coefs):
    print(f"  {feat}: {coef:.4f}")
print(f"  Intercept: {original_intercept:.4f}")
Feature Coefficient Interpretation
trip_distance 2.8104 Each additional mile adds ~$2.81 to fare
trip_duration_min 0.2830 Each additional minute adds ~$0.28
pickup_hour 0.0475 Minimal time-of-day effect
passenger_count 0.8558 Slight increase per passenger
Intercept 3.1394 Base fare ~$3.14
Interpreting R²

An R² of 0.826 means the model explains ~83% of the variance in fare amounts. This is a strong result for a simple linear model, driven primarily by trip distance and duration.


Visualize Model Performance

# Use Ridge model predictions for visualization
sample_idx = X_test.sample(n=5000, random_state=42).index
y_test_sample = y_test.loc[sample_idx]
y_pred_sample = pd.Series(y_pred_ridge, index=y_test.index).loc[sample_idx]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Actual vs Predicted scatter
axes[0].scatter(y_test_sample, y_pred_sample, alpha=0.3, s=5)
axes[0].plot([0, 100], [0, 100], 'r--', label='Perfect prediction')
axes[0].set_xlim(0, 100)
axes[0].set_ylim(0, 100)
axes[0].set_xlabel('Actual Fare ($)')
axes[0].set_ylabel('Predicted Fare ($)')
axes[0].set_title('Actual vs Predicted Fare (Ridge Regression)')
axes[0].legend()

# Residual distribution
residuals = y_test_sample - y_pred_sample
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].axvline(x=0, color='r', linestyle='--')

plt.tight_layout()
plt.show()
Observations
  • Actual vs Predicted: Points cluster tightly along the red diagonal line, indicating good predictions. A vertical cluster at ~$70 actual fare shows the model struggles with flat-rate airport trips.
  • Residuals: Centered near zero with a slight right skew. Most predictions are within Âą$10 of the actual fare.

Summary

Aspect Detail
Algorithm Ridge Regression (L2 regularization)
Training rows 9,056,473
Test rows 2,264,119
Normalization StandardScaler (Z-score normalization)
Regularization Ridge (Îą=1.0)
R² Score 0.826
Best predictor Trip distance ($2.81 per mile)
Known limitation Flat-rate airport fares (~$70) are poorly predicted
Potential Improvements
  • Add PULocationID / DOLocationID as categorical features to capture zone-based pricing
  • Use a non-linear model (e.g., Random Forest, Gradient Boosting) to handle flat-rate fares
  • Engineer an is_airport binary feature for JFK/LaGuardia/Newark trips
  • Experiment with different regularization strengths (alpha values) using cross-validation
  • Try Lasso (L1) regularization for automatic feature selection
  • Use ElasticNet (combination of L1 and L2) for balanced regularization

Knowledge Check

🎯 Knowledge Check: Supervised Learning

1. What algorithm is used in this lab to predict taxi fare amounts?

2. What is the R² score achieved by the Ridge Regression model?

3. Which feature is the strongest predictor of taxi fare?

4. What does Ridge Regression (L2) add to prevent overfitting?

Back to top