Supervised Learning
Linear Regression for fare prediction
Objective
In this lab, we apply supervised learning to the cleaned NYC taxi dataset. By the end, you will be able to:
- Train a Linear Regression model to predict taxi fare amounts
- Evaluate model performance using R², MAE, and RMSE
- Visualize prediction accuracy and residual distribution
NotePrerequisites
This lab requires the cleaned dataset from the Data Processing lab. Make sure you have yellow_tripdata_cleaned.parquet saved from Part B.
Load Cleaned Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
sns.set_style("whitegrid")
df = pd.read_parquet("data_cleaned/yellow_tripdata_cleaned.parquet")
print("Shape:", df.shape)
df.head()Expected output: 11,320,592 rows × 14 columns
Feature Selection and Train-Test Split
We select features that are known at the time of pickup and can reasonably predict the fare. We use an 80/20 train-test split.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
features = ['trip_distance', 'trip_duration_min', 'pickup_hour', 'passenger_count']
target = 'fare_amount'
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set: {X_train.shape[0]:,} rows")
print(f"Test set: {X_test.shape[0]:,} rows")| Feature | Rationale |
|---|---|
trip_distance |
Primary driver of fare (0.90 correlation from EDA) |
trip_duration_min |
Longer trips cost more (0.79 correlation) |
pickup_hour |
Time-of-day pricing effects |
passenger_count |
Minor effect, included for completeness |
Train and Evaluate
# Train Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict on test set
y_pred = model.predict(X_test)
# Evaluation metrics
print("=== Linear Regression Results ===")
print(f"R² Score: {r2_score(y_test, y_pred):.4f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred):.2f}")
print(f"Root Mean Sq Error: ${np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print("\nFeature Coefficients:")
for feat, coef in zip(features, model.coef_):
print(f" {feat}: {coef:.4f}")
print(f" Intercept: {model.intercept_:.4f}")Results
| Metric | Value |
|---|---|
| R² Score | 0.8260 |
| Mean Absolute Error | $7.27 |
| Root Mean Squared Error | $7.27 |
Feature Coefficients
| Feature | Coefficient | Interpretation |
|---|---|---|
trip_distance |
2.8104 | Each additional mile adds ~$2.81 to fare |
trip_duration_min |
0.2830 | Each additional minute adds ~$0.28 |
pickup_hour |
0.0475 | Minimal time-of-day effect |
passenger_count |
0.8558 | Slight increase per passenger |
| Intercept | 3.1394 | Base fare ~$3.14 |
TipInterpreting R²
An R² of 0.826 means the model explains ~83% of the variance in fare amounts. This is a strong result for a simple linear model, driven primarily by trip distance and duration.
Visualize Model Performance
# Visualize actual vs predicted fares
sample_idx = X_test.sample(n=5000, random_state=42).index
y_test_sample = y_test.loc[sample_idx]
y_pred_sample = pd.Series(y_pred, index=y_test.index).loc[sample_idx]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Actual vs Predicted scatter
axes[0].scatter(y_test_sample, y_pred_sample, alpha=0.3, s=5)
axes[0].plot([0, 100], [0, 100], 'r--', label='Perfect prediction')
axes[0].set_xlim(0, 100)
axes[0].set_ylim(0, 100)
axes[0].set_xlabel('Actual Fare ($)')
axes[0].set_ylabel('Predicted Fare ($)')
axes[0].set_title('Actual vs Predicted Fare')
axes[0].legend()
# Residual distribution
residuals = y_test_sample - y_pred_sample
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('Residual ($)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Residual Distribution')
axes[1].axvline(x=0, color='r', linestyle='--')
plt.tight_layout()
plt.show()
NoteObservations
- Actual vs Predicted: Points cluster tightly along the red diagonal line, indicating good predictions. A vertical cluster at ~$70 actual fare shows the model struggles with flat-rate airport trips.
- Residuals: Centered near zero with a slight right skew. Most predictions are within ±$10 of the actual fare.
Summary
| Aspect | Detail |
|---|---|
| Algorithm | Linear Regression (scikit-learn) |
| Training rows | 9,056,473 |
| Test rows | 2,264,119 |
| R² Score | 0.826 |
| Best predictor | Trip distance ($2.81 per mile) |
| Known limitation | Flat-rate airport fares (~$70) are poorly predicted |
TipPotential Improvements
- Add
PULocationID/DOLocationIDas categorical features to capture zone-based pricing - Use a non-linear model (e.g., Random Forest, Gradient Boosting) to handle flat-rate fares
- Engineer an
is_airportbinary feature for JFK/LaGuardia/Newark trips