Supervised Learning
Linear Regression for fare prediction
Objective
In this lab, we apply supervised learning to the cleaned NYC taxi dataset. By the end, you will be able to:
- Train a Linear Regression model to predict taxi fare amounts
- Evaluate model performance using R², MAE, and RMSE
- Visualize prediction accuracy and residual distribution
This lab requires the cleaned dataset from the Data Processing lab. Make sure you have yellow_tripdata_cleaned.parquet saved from Part B.
Load Cleaned Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
sns.set_style("whitegrid")
df = pd.read_parquet("data_cleaned/yellow_tripdata_cleaned.parquet")
print("Shape:", df.shape)
df.head()Expected output: 11,320,592 rows Ă 14 columns
Feature Selection, Normalization, and Train-Test Split
We select features that are known at the time of pickup and can reasonably predict the fare. We normalize the features to ensure theyâre on the same scale, then use an 80/20 train-test split.
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
features = ['trip_distance', 'trip_duration_min', 'pickup_hour', 'passenger_count']
target = 'fare_amount'
X = df[features]
y = df[target]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Normalize features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Training set: {X_train.shape[0]:,} rows")
print(f"Test set: {X_test.shape[0]:,} rows")
print("\nNormalization verification (training set):")
print(f"Mean: {X_train_scaled.mean(axis=0).round(4)}")
print(f"Std: {X_train_scaled.std(axis=0).round(4)}")| Feature | Rationale |
|---|---|
trip_distance |
Primary driver of fare (0.90 correlation from EDA) |
trip_duration_min |
Longer trips cost more (0.79 correlation) |
pickup_hour |
Time-of-day pricing effects |
passenger_count |
Minor effect, included for completeness |
While Linear Regression doesnât require normalization mathematically, it provides several benefits:
- Coefficient Interpretation: All coefficients are on the same scale, making it easier to compare feature importance
- Numerical Stability: Prevents issues with very large or very small numbers
- Regularization: Essential when using Ridge or Lasso regression (regularization penalties are scale-dependent)
- Convergence: Faster convergence if using gradient descent optimization
Formula: z = (x - Îź) / Ď where Îź is mean and Ď is standard deviation
Train and Evaluate (Without Regularization)
First, letâs train a standard Linear Regression model without regularization:
# Train Linear Regression model (no regularization)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Predict on test set
y_pred = model.predict(X_test_scaled)
# Evaluation metrics
print("=== Linear Regression Results (No Regularization) ===")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred):.2f}")
print(f"Root Mean Sq Error: ${np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print("\nFeature Coefficients (Normalized Scale):")
for feat, coef in zip(features, model.coef_):
print(f" {feat}: {coef:.4f}")
print(f" Intercept: {model.intercept_:.4f}")Train with Regularization (Ridge Regression)
Now letâs apply Ridge Regression (L2 regularization) to prevent overfitting:
# Train Ridge Regression model (L2 regularization)
ridge_model = Ridge(alpha=1.0, random_state=42)
ridge_model.fit(X_train_scaled, y_train)
# Predict on test set
y_pred_ridge = ridge_model.predict(X_test_scaled)
# Evaluation metrics
print("=== Ridge Regression Results (With Regularization) ===")
print(f"R² Score: {r2_score(y_test, y_pred_ridge):.4f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred_ridge):.2f}")
print(f"Root Mean Sq Error: ${np.sqrt(mean_squared_error(y_test, y_pred_ridge)):.2f}")
print("\nFeature Coefficients (Normalized Scale):")
for feat, coef in zip(features, ridge_model.coef_):
print(f" {feat}: {coef:.4f}")
print(f" Intercept: {ridge_model.intercept_:.4f}")
# Compare coefficient magnitudes
print("\n=== Coefficient Comparison ===")
coef_comparison = pd.DataFrame({
'Feature': features,
'Linear Regression': model.coef_,
'Ridge Regression': ridge_model.coef_,
'Difference': np.abs(model.coef_ - ridge_model.coef_)
})
print(coef_comparison.round(4))Ridge Regression (L2) adds a penalty term to the loss function:
Loss = MSE + ι à Σ(coefficients²)
Where: - Îą (alpha) is the regularization strength (higher = more penalty) - The penalty shrinks coefficients toward zero, preventing overfitting - Unlike Lasso (L1), Ridge doesnât eliminate features entirely
Benefits: - Reduces overfitting by penalizing large coefficients - Handles multicollinearity better than standard Linear Regression - More stable predictions on new data
Model Comparison
Results Comparison
| Metric | Linear Regression | Ridge Regression |
|---|---|---|
| R² Score | 0.8260 | 0.8260 |
| Mean Absolute Error | $7.27 | $7.27 |
| Root Mean Squared Error | $7.27 | $7.27 |
With this large dataset (9M+ training samples) and only 4 features, both models perform similarly because: - The dataset is large enough that overfitting is minimal - The features are not highly correlated (no severe multicollinearity) - The relationship is relatively linear
Regularization becomes more important with: - Smaller datasets - Many features (high dimensionality) - Highly correlated features - Non-linear relationships
Feature Coefficients (Original Scale)
To interpret coefficients in terms of the original features, we need to reverse the normalization:
# Convert normalized coefficients back to original scale
original_coefs = ridge_model.coef_ / scaler.scale_
original_intercept = ridge_model.intercept_ - np.sum(ridge_model.coef_ * scaler.mean_ / scaler.scale_)
print("=== Coefficients in Original Scale ===")
for feat, coef in zip(features, original_coefs):
print(f" {feat}: {coef:.4f}")
print(f" Intercept: {original_intercept:.4f}")| Feature | Coefficient | Interpretation |
|---|---|---|
trip_distance |
2.8104 | Each additional mile adds ~$2.81 to fare |
trip_duration_min |
0.2830 | Each additional minute adds ~$0.28 |
pickup_hour |
0.0475 | Minimal time-of-day effect |
passenger_count |
0.8558 | Slight increase per passenger |
| Intercept | 3.1394 | Base fare ~$3.14 |
An R² of 0.826 means the model explains ~83% of the variance in fare amounts. This is a strong result for a simple linear model, driven primarily by trip distance and duration.
Visualize Model Performance
# Use Ridge model predictions for visualization
sample_idx = X_test.sample(n=5000, random_state=42).index
y_test_sample = y_test.loc[sample_idx]
y_pred_sample = pd.Series(y_pred_ridge, index=y_test.index).loc[sample_idx]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Actual vs Predicted scatter
axes[0].scatter(y_test_sample, y_pred_sample, alpha=0.3, s=5)
axes[0].plot([0, 100], [0, 100], 'r--', label='Perfect prediction')
axes[0].set_xlim(0, 100)
axes[0].set_ylim(0, 100)
axes[0].set_xlabel('Actual Fare ($)')
axes[0].set_ylabel('Predicted Fare ($)')
axes[0].set_title('Actual vs Predicted Fare (Ridge Regression)')
axes[0].legend()
# Residual distribution
residuals = y_test_sample - y_pred_sample
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].axvline(x=0, color='r', linestyle='--')
plt.tight_layout()
plt.show()- Actual vs Predicted: Points cluster tightly along the red diagonal line, indicating good predictions. A vertical cluster at ~$70 actual fare shows the model struggles with flat-rate airport trips.
- Residuals: Centered near zero with a slight right skew. Most predictions are within Âą$10 of the actual fare.
Summary
| Aspect | Detail |
|---|---|
| Algorithm | Ridge Regression (L2 regularization) |
| Training rows | 9,056,473 |
| Test rows | 2,264,119 |
| Normalization | StandardScaler (Z-score normalization) |
| Regularization | Ridge (Îą=1.0) |
| R² Score | 0.826 |
| Best predictor | Trip distance ($2.81 per mile) |
| Known limitation | Flat-rate airport fares (~$70) are poorly predicted |
- Add
PULocationID/DOLocationIDas categorical features to capture zone-based pricing - Use a non-linear model (e.g., Random Forest, Gradient Boosting) to handle flat-rate fares
- Engineer an
is_airportbinary feature for JFK/LaGuardia/Newark trips - Experiment with different regularization strengths (alpha values) using cross-validation
- Try Lasso (L1) regularization for automatic feature selection
- Use ElasticNet (combination of L1 and L2) for balanced regularization