MLSecOps Lab: Boeing x WIC x UW - Unsupervised Learning

Objective

In this lab, we apply unsupervised learning to the cleaned NYC taxi dataset. By the end, you will be able to:

Use K-Means clustering to identify distinct taxi trip patterns
Apply the Elbow Method to determine the optimal number of clusters
Interpret and visualize cluster characteristics

Prerequisites

This lab requires the cleaned dataset from the Data Processing lab. Make sure you have yellow_tripdata_cleaned.parquet saved from Part B.

Load Cleaned Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

sns.set_style("whitegrid")

df = pd.read_parquet("data_cleaned/yellow_tripdata_cleaned.parquet")
print("Shape:", df.shape)

Feature Selection and Normalization

We select features that capture trip behavior: where, when, how far, and how much. We sample 100,000 rows for performance and apply normalization to ensure all features contribute equally to the clustering algorithm.

from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler

# Select features for clustering
cluster_features = ['PULocationID', 'pickup_hour', 'trip_distance', 'fare_amount']

# Sample for performance
df_sample = df[cluster_features].sample(n=100000, random_state=42)

# Normalize features using StandardScaler (Z-score normalization)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_sample)

print("Sample shape:", df_sample.shape)
print("Scaled data ready for clustering")

# Verify normalization: mean should be ~0, std should be ~1
print("\nNormalization verification:")
print(f"Mean of scaled features: {X_scaled.mean(axis=0).round(4)}")
print(f"Std of scaled features: {X_scaled.std(axis=0).round(4)}")

Why Normalize?

K-Means uses Euclidean distance to assign points to clusters. Without normalization, features with large ranges (like PULocationID: 1–265) would dominate over features with small ranges (like pickup_hour: 0–23).

StandardScaler (Z-score normalization) transforms each feature to have: - Mean (μ) = 0 - Standard deviation (σ) = 1

Formula: z = (x - μ) / σ

This ensures all features contribute equally to the distance calculations, preventing bias toward high-magnitude features.

Why Sample?

Running K-Means on 11M+ rows is computationally expensive. A random sample of 100,000 rows captures the same patterns while keeping runtime manageable. We use MiniBatchKMeans for additional speed.

Regularization in Clustering

While regularization is more commonly associated with supervised learning, it plays an important role in unsupervised learning to prevent overfitting and improve cluster quality.

Regularization Techniques Applied

1. Sample Size Regularization - We use 100,000 samples instead of the full 11M+ dataset - This prevents overfitting to noise in the data - Ensures clusters represent genuine patterns, not random variations

2. Feature Selection - We carefully select 4 meaningful features instead of using all available columns - This reduces dimensionality and prevents the “curse of dimensionality” - Focuses the algorithm on relevant patterns

3. MiniBatchKMeans Parameters

kmeans = MiniBatchKMeans(
    n_clusters=4,
    random_state=42,      # Ensures reproducibility
    batch_size=10000,     # Regularizes by processing data in batches
    max_iter=100,         # Prevents excessive iterations
    n_init=3              # Number of initializations (default: 3)
)

4. Elbow Method for K Selection - Prevents overfitting by choosing the optimal number of clusters - Balances model complexity (number of clusters) with performance (inertia) - Avoids creating too many clusters that fit noise rather than patterns

Regularization vs Overfitting

In clustering, overfitting occurs when: - Too many clusters are created (high K) - Clusters fit noise rather than genuine patterns - The model doesn’t generalize to new data

Our regularization approach ensures clusters are: - Meaningful and interpretable - Generalizable to the full dataset - Robust to random variations

Elbow Method — Finding Optimal K

The Elbow Method plots inertia (within-cluster sum of squares) against the number of clusters. The “elbow” — where the curve starts flattening — indicates the optimal K.

# Elbow Method to find optimal K
inertias = []
K_range = range(2, 11)

for k in K_range:
    kmeans = MiniBatchKMeans(n_clusters=k, random_state=42, batch_size=10000)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)
    print(f"K={k}, Inertia={kmeans.inertia_:.0f}")

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(K_range, inertias, marker='o', color='steelblue')
ax.set_title('Elbow Method - Optimal Number of Clusters')
ax.set_xlabel('Number of Clusters (K)')
ax.set_ylabel('Inertia')
plt.tight_layout()
plt.show()

Observation

The elbow occurs around K=4, where the rate of inertia decrease slows significantly. Beyond K=4, adding more clusters yields diminishing returns.

K	Inertia
2	295,310
3	208,845
4	164,797
5	143,856
6	125,351

Fit K-Means (K=4)

# Fit K-Means with K=4
kmeans = MiniBatchKMeans(n_clusters=4, random_state=42, batch_size=10000)
df_sample['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze cluster characteristics
print("=== Cluster Summary ===\n")
print(df_sample.groupby('cluster')[cluster_features].mean().round(2))
print("\n=== Cluster Sizes ===\n")
print(df_sample['cluster'].value_counts().sort_index())

Cluster Profiles

Cluster	Avg Location ID	Avg Pickup Hour	Avg Distance	Avg Fare	Size	Interpretation
0	102	8:00 AM	3.00 mi	$17.36	21,146	Morning commuters, mid-distance
1	136	7:00 PM	2.37 mi	$15.78	33,543	Evening rush, short urban trips
2	230	1:00 PM	2.05 mi	$14.54	35,498	Midday, short trips, outer zones
3	141	2:00 PM	14.52 mi	$61.40	9,813	Long-distance / airport trips

Key Insight

The algorithm clearly separates airport/long-distance trips (Cluster 3) from regular urban trips (Clusters 0, 1, 2). The urban clusters are further differentiated by time of day — morning commuters (Cluster 0), evening rush (Cluster 1), and midday rides (Cluster 2).

Visualize Clusters

# Visualize clusters
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scatter: pickup hour vs fare, colored by cluster
colors = ['#2196F3', '#FF9800', '#4CAF50', '#E91E63']
for c in range(4):
    mask = df_sample['cluster'] == c
    axes[0].scatter(df_sample.loc[mask, 'pickup_hour'],
                    df_sample.loc[mask, 'fare_amount'],
                    alpha=0.3, s=5, c=colors[c], label=f'Cluster {c}')
axes[0].set_xlabel('Pickup Hour')
axes[0].set_ylabel('Fare Amount ($)')
axes[0].set_title('Clusters: Pickup Hour vs Fare')
axes[0].legend()

# Bar chart: average fare by cluster
cluster_stats = df_sample.groupby('cluster').agg(
    count=('fare_amount', 'size'),
    avg_fare=('fare_amount', 'mean'),
    avg_distance=('trip_distance', 'mean')
).round(2)

cluster_stats['avg_fare'].plot(kind='bar', ax=axes[1], color=colors)
axes[1].set_title('Average Fare by Cluster')
axes[1].set_xlabel('Cluster')
axes[1].set_ylabel('Average Fare ($)')
axes[1].set_xticklabels([f'C{i}' for i in range(4)], rotation=0)

plt.tight_layout()
plt.show()

Observations

Left plot: Cluster 3 (pink) clearly stands out with high fares across all hours — these are the long-distance/airport trips. Clusters 0, 1, 2 overlap in fare range but separate by time of day.
Right plot: Cluster 3 has an average fare of ~$61, roughly 4x the average of the other clusters ($14–$17).

Summary

Aspect	Detail
Algorithm	MiniBatchKMeans (scikit-learn)
Sample size	100,000 rows
Features	PULocationID, pickup_hour, trip_distance, fare_amount
Normalization	StandardScaler (Z-score normalization)
Regularization	Sample size control, feature selection, optimal K selection
Optimal K	4 (Elbow Method)
Key finding	Clear separation between airport/long-distance trips and urban trips

Potential Improvements

Add DOLocationID to capture destination-based patterns
Use pickup_dayofweek to differentiate weekday vs weekend clusters
Try DBSCAN for density-based clustering that doesn’t require specifying K
Apply clustering results as a new feature in the supervised model to improve fare prediction
Experiment with different normalization techniques (MinMaxScaler, RobustScaler)
Use silhouette score as an additional metric for cluster quality

Knowledge Check

🎯 Knowledge Check: Unsupervised Learning

1. What is the key difference between supervised and unsupervised learning?

Supervised learning uses Python; unsupervised uses R Supervised learning uses labeled data with a target variable; unsupervised finds patterns without labels Unsupervised learning is always faster Supervised learning only works on images

2. What algorithm is used in this lab for clustering NYC taxi trips?

K-Nearest Neighbors (KNN) Linear Regression MiniBatchKMeans Random Forest

3. What method is used to determine the optimal number of clusters (K)?

Trial and error The Elbow Method (plotting inertia vs K) The R² score Cross-validation

4. What was the key finding of the clustering analysis in this lab?

All taxi trips are identical The model failed to find any clusters Clear separation between airport/long-distance trips and urban trips Passenger count is the most important feature