Unsupervised Learning
K-Means clustering for taxi demand patterns
Objective
In this lab, we apply unsupervised learning to the cleaned NYC taxi dataset. By the end, you will be able to:
- Use K-Means clustering to identify distinct taxi trip patterns
- Apply the Elbow Method to determine the optimal number of clusters
- Interpret and visualize cluster characteristics
This lab requires the cleaned dataset from the Data Processing lab. Make sure you have yellow_tripdata_cleaned.parquet saved from Part B.
Load Cleaned Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
sns.set_style("whitegrid")
df = pd.read_parquet("data_cleaned/yellow_tripdata_cleaned.parquet")
print("Shape:", df.shape)Feature Selection and Normalization
We select features that capture trip behavior: where, when, how far, and how much. We sample 100,000 rows for performance and apply normalization to ensure all features contribute equally to the clustering algorithm.
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
# Select features for clustering
cluster_features = ['PULocationID', 'pickup_hour', 'trip_distance', 'fare_amount']
# Sample for performance
df_sample = df[cluster_features].sample(n=100000, random_state=42)
# Normalize features using StandardScaler (Z-score normalization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_sample)
print("Sample shape:", df_sample.shape)
print("Scaled data ready for clustering")
# Verify normalization: mean should be ~0, std should be ~1
print("\nNormalization verification:")
print(f"Mean of scaled features: {X_scaled.mean(axis=0).round(4)}")
print(f"Std of scaled features: {X_scaled.std(axis=0).round(4)}")K-Means uses Euclidean distance to assign points to clusters. Without normalization, features with large ranges (like PULocationID: 1–265) would dominate over features with small ranges (like pickup_hour: 0–23).
StandardScaler (Z-score normalization) transforms each feature to have: - Mean (μ) = 0 - Standard deviation (σ) = 1
Formula: z = (x - μ) / σ
This ensures all features contribute equally to the distance calculations, preventing bias toward high-magnitude features.
Running K-Means on 11M+ rows is computationally expensive. A random sample of 100,000 rows captures the same patterns while keeping runtime manageable. We use MiniBatchKMeans for additional speed.
Regularization in Clustering
While regularization is more commonly associated with supervised learning, it plays an important role in unsupervised learning to prevent overfitting and improve cluster quality.
Regularization Techniques Applied
1. Sample Size Regularization - We use 100,000 samples instead of the full 11M+ dataset - This prevents overfitting to noise in the data - Ensures clusters represent genuine patterns, not random variations
2. Feature Selection - We carefully select 4 meaningful features instead of using all available columns - This reduces dimensionality and prevents the “curse of dimensionality” - Focuses the algorithm on relevant patterns
3. MiniBatchKMeans Parameters
kmeans = MiniBatchKMeans(
n_clusters=4,
random_state=42, # Ensures reproducibility
batch_size=10000, # Regularizes by processing data in batches
max_iter=100, # Prevents excessive iterations
n_init=3 # Number of initializations (default: 3)
)4. Elbow Method for K Selection - Prevents overfitting by choosing the optimal number of clusters - Balances model complexity (number of clusters) with performance (inertia) - Avoids creating too many clusters that fit noise rather than patterns
In clustering, overfitting occurs when: - Too many clusters are created (high K) - Clusters fit noise rather than genuine patterns - The model doesn’t generalize to new data
Our regularization approach ensures clusters are: - Meaningful and interpretable - Generalizable to the full dataset - Robust to random variations
Elbow Method — Finding Optimal K
The Elbow Method plots inertia (within-cluster sum of squares) against the number of clusters. The “elbow” — where the curve starts flattening — indicates the optimal K.
# Elbow Method to find optimal K
inertias = []
K_range = range(2, 11)
for k in K_range:
kmeans = MiniBatchKMeans(n_clusters=k, random_state=42, batch_size=10000)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)
print(f"K={k}, Inertia={kmeans.inertia_:.0f}")
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(K_range, inertias, marker='o', color='steelblue')
ax.set_title('Elbow Method - Optimal Number of Clusters')
ax.set_xlabel('Number of Clusters (K)')
ax.set_ylabel('Inertia')
plt.tight_layout()
plt.show()The elbow occurs around K=4, where the rate of inertia decrease slows significantly. Beyond K=4, adding more clusters yields diminishing returns.
| K | Inertia |
|---|---|
| 2 | 295,310 |
| 3 | 208,845 |
| 4 | 164,797 |
| 5 | 143,856 |
| 6 | 125,351 |
Fit K-Means (K=4)
# Fit K-Means with K=4
kmeans = MiniBatchKMeans(n_clusters=4, random_state=42, batch_size=10000)
df_sample['cluster'] = kmeans.fit_predict(X_scaled)
# Analyze cluster characteristics
print("=== Cluster Summary ===\n")
print(df_sample.groupby('cluster')[cluster_features].mean().round(2))
print("\n=== Cluster Sizes ===\n")
print(df_sample['cluster'].value_counts().sort_index())Cluster Profiles
| Cluster | Avg Location ID | Avg Pickup Hour | Avg Distance | Avg Fare | Size | Interpretation |
|---|---|---|---|---|---|---|
| 0 | 102 | 8:00 AM | 3.00 mi | $17.36 | 21,146 | Morning commuters, mid-distance |
| 1 | 136 | 7:00 PM | 2.37 mi | $15.78 | 33,543 | Evening rush, short urban trips |
| 2 | 230 | 1:00 PM | 2.05 mi | $14.54 | 35,498 | Midday, short trips, outer zones |
| 3 | 141 | 2:00 PM | 14.52 mi | $61.40 | 9,813 | Long-distance / airport trips |
The algorithm clearly separates airport/long-distance trips (Cluster 3) from regular urban trips (Clusters 0, 1, 2). The urban clusters are further differentiated by time of day — morning commuters (Cluster 0), evening rush (Cluster 1), and midday rides (Cluster 2).
Visualize Clusters
# Visualize clusters
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Scatter: pickup hour vs fare, colored by cluster
colors = ['#2196F3', '#FF9800', '#4CAF50', '#E91E63']
for c in range(4):
mask = df_sample['cluster'] == c
axes[0].scatter(df_sample.loc[mask, 'pickup_hour'],
df_sample.loc[mask, 'fare_amount'],
alpha=0.3, s=5, c=colors[c], label=f'Cluster {c}')
axes[0].set_xlabel('Pickup Hour')
axes[0].set_ylabel('Fare Amount ($)')
axes[0].set_title('Clusters: Pickup Hour vs Fare')
axes[0].legend()
# Bar chart: average fare by cluster
cluster_stats = df_sample.groupby('cluster').agg(
count=('fare_amount', 'size'),
avg_fare=('fare_amount', 'mean'),
avg_distance=('trip_distance', 'mean')
).round(2)
cluster_stats['avg_fare'].plot(kind='bar', ax=axes[1], color=colors)
axes[1].set_title('Average Fare by Cluster')
axes[1].set_xlabel('Cluster')
axes[1].set_ylabel('Average Fare ($)')
axes[1].set_xticklabels([f'C{i}' for i in range(4)], rotation=0)
plt.tight_layout()
plt.show()- Left plot: Cluster 3 (pink) clearly stands out with high fares across all hours — these are the long-distance/airport trips. Clusters 0, 1, 2 overlap in fare range but separate by time of day.
- Right plot: Cluster 3 has an average fare of ~$61, roughly 4x the average of the other clusters ($14–$17).
Summary
| Aspect | Detail |
|---|---|
| Algorithm | MiniBatchKMeans (scikit-learn) |
| Sample size | 100,000 rows |
| Features | PULocationID, pickup_hour, trip_distance, fare_amount |
| Normalization | StandardScaler (Z-score normalization) |
| Regularization | Sample size control, feature selection, optimal K selection |
| Optimal K | 4 (Elbow Method) |
| Key finding | Clear separation between airport/long-distance trips and urban trips |
- Add
DOLocationIDto capture destination-based patterns - Use
pickup_dayofweekto differentiate weekday vs weekend clusters - Try DBSCAN for density-based clustering that doesn’t require specifying K
- Apply clustering results as a new feature in the supervised model to improve fare prediction
- Experiment with different normalization techniques (MinMaxScaler, RobustScaler)
- Use silhouette score as an additional metric for cluster quality