Supervised Learning

Linear Regression for fare prediction

Objective

In this lab, we apply supervised learning to the cleaned NYC taxi dataset. By the end, you will be able to:

Train a Linear Regression model to predict taxi fare amounts
Evaluate model performance using R², MAE, and RMSE
Visualize prediction accuracy and residual distribution

Prerequisites

This lab requires the cleaned dataset from the Data Processing lab. Make sure you have yellow_tripdata_cleaned.parquet saved from Part B.

Load Cleaned Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_style("whitegrid")

df = pd.read_parquet("data_cleaned/yellow_tripdata_cleaned.parquet")
print("Shape:", df.shape)
df.head()

Expected output: 11,320,592 rows × 14 columns

Feature Selection, Normalization, and Train-Test Split

We select features that are known at the time of pickup and can reasonably predict the fare. We normalize the features to ensure they’re on the same scale, then use an 80/20 train-test split.

from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

features = ['trip_distance', 'trip_duration_min', 'pickup_hour', 'passenger_count']
target = 'fare_amount'

X = df[features]
y = df[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train.shape[0]:,} rows")
print(f"Test set: {X_test.shape[0]:,} rows")
print("\nNormalization verification (training set):")
print(f"Mean: {X_train_scaled.mean(axis=0).round(4)}")
print(f"Std: {X_train_scaled.std(axis=0).round(4)}")

Feature	Rationale
`trip_distance`	Primary driver of fare (0.90 correlation from EDA)
`trip_duration_min`	Longer trips cost more (0.79 correlation)
`pickup_hour`	Time-of-day pricing effects
`passenger_count`	Minor effect, included for completeness

Why Normalize for Linear Regression?

While Linear Regression doesn’t require normalization mathematically, it provides several benefits:

Coefficient Interpretation: All coefficients are on the same scale, making it easier to compare feature importance
Numerical Stability: Prevents issues with very large or very small numbers
Regularization: Essential when using Ridge or Lasso regression (regularization penalties are scale-dependent)
Convergence: Faster convergence if using gradient descent optimization

Formula: z = (x - μ) / σ where μ is mean and σ is standard deviation

Train and Evaluate (Without Regularization)

First, let’s train a standard Linear Regression model without regularization:

# Train Linear Regression model (no regularization)
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred = model.predict(X_test_scaled)

# Evaluation metrics
print("=== Linear Regression Results (No Regularization) ===")
print(f"R² Score:           {r2_score(y_test, y_pred):.4f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred):.2f}")
print(f"Root Mean Sq Error:  ${np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")

print("\nFeature Coefficients (Normalized Scale):")
for feat, coef in zip(features, model.coef_):
    print(f"  {feat}: {coef:.4f}")
print(f"  Intercept: {model.intercept_:.4f}")

Train with Regularization (Ridge Regression)

Now let’s apply Ridge Regression (L2 regularization) to prevent overfitting:

# Train Ridge Regression model (L2 regularization)
ridge_model = Ridge(alpha=1.0, random_state=42)
ridge_model.fit(X_train_scaled, y_train)

# Predict on test set
y_pred_ridge = ridge_model.predict(X_test_scaled)

# Evaluation metrics
print("=== Ridge Regression Results (With Regularization) ===")
print(f"R² Score:           {r2_score(y_test, y_pred_ridge):.4f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred_ridge):.2f}")
print(f"Root Mean Sq Error:  ${np.sqrt(mean_squared_error(y_test, y_pred_ridge)):.2f}")

print("\nFeature Coefficients (Normalized Scale):")
for feat, coef in zip(features, ridge_model.coef_):
    print(f"  {feat}: {coef:.4f}")
print(f"  Intercept: {ridge_model.intercept_:.4f}")

# Compare coefficient magnitudes
print("\n=== Coefficient Comparison ===")
coef_comparison = pd.DataFrame({
    'Feature': features,
    'Linear Regression': model.coef_,
    'Ridge Regression': ridge_model.coef_,
    'Difference': np.abs(model.coef_ - ridge_model.coef_)
})
print(coef_comparison.round(4))

Understanding Regularization

Ridge Regression (L2) adds a penalty term to the loss function:

Loss = MSE + α × Σ(coefficients²)

Where: - α (alpha) is the regularization strength (higher = more penalty) - The penalty shrinks coefficients toward zero, preventing overfitting - Unlike Lasso (L1), Ridge doesn’t eliminate features entirely

Benefits: - Reduces overfitting by penalizing large coefficients - Handles multicollinearity better than standard Linear Regression - More stable predictions on new data

Model Comparison

Results Comparison

Metric	Linear Regression	Ridge Regression
R² Score	0.8260	0.8260
Mean Absolute Error	$7.27	$7.27
Root Mean Squared Error	$7.27	$7.27

Why Similar Results?

With this large dataset (9M+ training samples) and only 4 features, both models perform similarly because: - The dataset is large enough that overfitting is minimal - The features are not highly correlated (no severe multicollinearity) - The relationship is relatively linear

Regularization becomes more important with: - Smaller datasets - Many features (high dimensionality) - Highly correlated features - Non-linear relationships

Feature Coefficients (Original Scale)

To interpret coefficients in terms of the original features, we need to reverse the normalization:

# Convert normalized coefficients back to original scale
original_coefs = ridge_model.coef_ / scaler.scale_
original_intercept = ridge_model.intercept_ - np.sum(ridge_model.coef_ * scaler.mean_ / scaler.scale_)

print("=== Coefficients in Original Scale ===")
for feat, coef in zip(features, original_coefs):
    print(f"  {feat}: {coef:.4f}")
print(f"  Intercept: {original_intercept:.4f}")

Feature	Coefficient	Interpretation
`trip_distance`	2.8104	Each additional mile adds ~$2.81 to fare
`trip_duration_min`	0.2830	Each additional minute adds ~$0.28
`pickup_hour`	0.0475	Minimal time-of-day effect
`passenger_count`	0.8558	Slight increase per passenger
Intercept	3.1394	Base fare ~$3.14

Interpreting R²

An R² of 0.826 means the model explains ~83% of the variance in fare amounts. This is a strong result for a simple linear model, driven primarily by trip distance and duration.

Visualize Model Performance

# Use Ridge model predictions for visualization
sample_idx = X_test.sample(n=5000, random_state=42).index
y_test_sample = y_test.loc[sample_idx]
y_pred_sample = pd.Series(y_pred_ridge, index=y_test.index).loc[sample_idx]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Actual vs Predicted scatter
axes[0].scatter(y_test_sample, y_pred_sample, alpha=0.3, s=5)
axes[0].plot([0, 100], [0, 100], 'r--', label='Perfect prediction')
axes[0].set_xlim(0, 100)
axes[0].set_ylim(0, 100)
axes[0].set_xlabel('Actual Fare ($)')
axes[0].set_ylabel('Predicted Fare ($)')
axes[0].set_title('Actual vs Predicted Fare (Ridge Regression)')
axes[0].legend()

# Residual distribution
residuals = y_test_sample - y_pred_sample
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].axvline(x=0, color='r', linestyle='--')

plt.tight_layout()
plt.show()

Observations

Actual vs Predicted: Points cluster tightly along the red diagonal line, indicating good predictions. A vertical cluster at ~$70 actual fare shows the model struggles with flat-rate airport trips.
Residuals: Centered near zero with a slight right skew. Most predictions are within ±$10 of the actual fare.

Summary

Aspect	Detail
Algorithm	Ridge Regression (L2 regularization)
Training rows	9,056,473
Test rows	2,264,119
Normalization	StandardScaler (Z-score normalization)
Regularization	Ridge (α=1.0)
R² Score	0.826
Best predictor	Trip distance ($2.81 per mile)
Known limitation	Flat-rate airport fares (~$70) are poorly predicted

Potential Improvements

Add PULocationID / DOLocationID as categorical features to capture zone-based pricing
Use a non-linear model (e.g., Random Forest, Gradient Boosting) to handle flat-rate fares
Engineer an is_airport binary feature for JFK/LaGuardia/Newark trips
Experiment with different regularization strengths (alpha values) using cross-validation
Try Lasso (L1) regularization for automatic feature selection
Use ElasticNet (combination of L1 and L2) for balanced regularization

Supervised Learning

Objective

Load Cleaned Data

Feature Selection, Normalization, and Train-Test Split

Train and Evaluate (Without Regularization)

Train with Regularization (Ridge Regression)

Model Comparison

Results Comparison

Feature Coefficients (Original Scale)

Visualize Model Performance

Summary

Knowledge Check

🎯 Knowledge Check: Supervised Learning