Clustering Methods¶
This notebook demonstrates all clustering methods and configuration options available in tsam.
Available Methods¶
| Method | Description | Best For |
|---|---|---|
hierarchical |
Agglomerative hierarchical clustering | General purpose, recommended default |
kmeans |
K-means with centroids | Fast clustering, large datasets |
kmedoids |
K-medoids (MILP exact) | Optimal solution, smaller datasets (slow) |
kmaxoids |
Selects most dissimilar periods | Capturing extremes |
contiguous |
Hierarchical with temporal constraint | Storage modeling, seasonal patterns |
averaging |
Sequential period averaging | Simple baseline |
Tip: For medoid-based clustering on large datasets, use hierarchical with representation="medoid" instead of kmedoids.
Key Configuration Options¶
| Option | Description |
|---|---|
weights |
Per-column importance weights (top-level parameter of aggregate()) |
representation |
How to represent cluster centers (mean, medoid, maxoid, distribution, distribution_minmax) |
normalize_column_means |
Normalize columns to same mean before clustering |
use_duration_curves |
Match by value distribution rather than timing |
Setup¶
%load_ext autoreload
%autoreload 2
import pandas as pd
import plotly.express as px
import plotly.io as pio
import tsam
from tsam import ClusterConfig
pio.renderers.default = "notebook_connected"
import warnings
# Added to every example notebook: silence the v3 column-order
# FutureWarning in the rendered docs (tsam v4 returns result columns in
# input order; see migration guide).
warnings.filterwarnings(
"ignore", category=FutureWarning, message=".*sorted alphabetically.*"
)
Input Data¶
The test dataset contains hourly time series for one year with four columns:
- GHI: Global Horizontal Irradiance (solar)
- T: Temperature
- Wind: Wind speed
- Load: Electrical load
raw = pd.read_csv("testdata.csv", index_col=0)
print(f"Shape: {raw.shape} ({raw.shape[0]} hours = {raw.shape[0] // 24} days)")
raw.head()
Shape: (8760, 4) (8760 hours = 365 days)
| GHI | T | Wind | Load | |
|---|---|---|---|---|
| 2009-12-31 23:30:00 | 0 | -2.1 | 7.1 | 375.478394 |
| 2010-01-01 00:30:00 | 0 | -2.8 | 8.6 | 364.541326 |
| 2010-01-01 01:30:00 | 0 | -3.3 | 9.7 | 357.416844 |
| 2010-01-01 02:30:00 | 0 | -3.2 | 9.8 | 350.191306 |
| 2010-01-01 03:30:00 | 0 | -3.2 | 9.4 | 345.161449 |
1. Hierarchical Clustering (Recommended Default)¶
Agglomerative hierarchical clustering builds a tree of clusters and cuts it at the desired number. It's the recommended default because it:
- Produces consistent results (deterministic)
- Works well with various representations
- Handles multi-variate data effectively
result_hierarchical = tsam.aggregate(
raw,
n_clusters=8,
period_duration=24,
cluster=ClusterConfig(method="hierarchical"),
)
print(f"Accuracy: RMSE = {result_hierarchical.accuracy.rmse.mean():.4f}")
Accuracy: RMSE = 0.1075
2. K-Means Clustering¶
K-means is fast and widely used. It computes cluster centroids (averages), which may not correspond to actual periods in the data.
result_kmeans = tsam.aggregate(
raw,
n_clusters=8,
period_duration=24,
cluster=ClusterConfig(method="kmeans"),
)
print(f"Accuracy: RMSE = {result_kmeans.accuracy.rmse.mean():.4f}")
Accuracy: RMSE = 0.0917
3. K-Medoids-like Clustering¶
K-medoids selects actual periods as cluster centers (medoids) rather than computing averages. This preserves realistic patterns.
Note: The true kmedoids method uses an exact MILP solver which can be slow for large datasets. For most use cases, hierarchical with representation="medoid" gives similar results much faster.
# Use hierarchical with medoid representation (fast alternative to kmedoids)
result_kmedoids = tsam.aggregate(
raw,
n_clusters=8,
period_duration=24,
cluster=ClusterConfig(method="hierarchical", representation="medoid"),
)
print(f"Accuracy: RMSE = {result_kmedoids.accuracy.rmse.mean():.4f}")
Accuracy: RMSE = 0.1075
4. K-Maxoids Clustering¶
K-maxoids selects the most dissimilar periods as cluster centers. This is useful for capturing extreme conditions.
Note: We set preserve_column_means=False below because mean preservation adjusts typical period values to match the original data's mean. For k-maxoids, where the goal is to preserve extreme values, this would diminish the very extremes we're trying to capture. Use preserve_column_means=True (default) when mean preservation is more important than extreme value preservation.
result_kmaxoids = tsam.aggregate(
raw,
n_clusters=8,
period_duration=24,
cluster=ClusterConfig(method="kmaxoids"),
preserve_column_means=False, # Don't rescale to preserve extreme values
)
print(f"Accuracy: RMSE = {result_kmaxoids.accuracy.rmse.mean():.4f}")
Accuracy: RMSE = 0.1739
5. Contiguous Clustering¶
Contiguous clustering enforces temporal continuity - adjacent typical periods must come from adjacent original periods. This is important for:
- Storage modeling: State-of-charge must be continuous
- Seasonal patterns: Preserving the natural progression of seasons
result_contiguous = tsam.aggregate(
raw,
n_clusters=8,
period_duration=24,
cluster=ClusterConfig(method="contiguous"),
)
print(f"Accuracy: RMSE = {result_contiguous.accuracy.rmse.mean():.4f}")
Accuracy: RMSE = 0.1255
6. Comparison of Methods¶
# Collect all results for comparison
results = {
"Original": raw,
"Hierarchical": result_hierarchical.reconstructed,
"K-Means": result_kmeans.reconstructed,
"K-Medoids": result_kmedoids.reconstructed,
"K-Maxoids": result_kmaxoids.reconstructed,
"Contiguous": result_contiguous.reconstructed,
}
Duration Curve Comparison¶
Duration curves show how well each method preserves the value distribution.
# Duration curve comparison - Load
frames = []
for name, df in results.items():
sorted_vals = df["Load"].sort_values(ascending=False).reset_index(drop=True)
frames.append(
pd.DataFrame(
{"Hour": range(len(sorted_vals)), "Load": sorted_vals, "Method": name}
)
)
long_df = pd.concat(frames, ignore_index=True)
px.line(
long_df,
x="Hour",
y="Load",
color="Method",
title="Duration Curve Comparison - Load",
)
# Duration curve comparison - GHI
frames = []
for name, df in results.items():
sorted_vals = df["GHI"].sort_values(ascending=False).reset_index(drop=True)
frames.append(
pd.DataFrame(
{"Hour": range(len(sorted_vals)), "GHI": sorted_vals, "Method": name}
)
)
long_df = pd.concat(frames, ignore_index=True)
px.line(
long_df, x="Hour", y="GHI", color="Method", title="Duration Curve Comparison - GHI"
)
Accuracy Comparison¶
# Compare RMSE across methods
accuracy_comparison = pd.DataFrame(
{
"Method": ["Hierarchical", "K-Means", "K-Medoids", "K-Maxoids", "Contiguous"],
"Mean RMSE": [
result_hierarchical.accuracy.rmse.mean(),
result_kmeans.accuracy.rmse.mean(),
result_kmedoids.accuracy.rmse.mean(),
result_kmaxoids.accuracy.rmse.mean(),
result_contiguous.accuracy.rmse.mean(),
],
}
)
accuracy_comparison.sort_values("Mean RMSE")
| Method | Mean RMSE | |
|---|---|---|
| 1 | K-Means | 0.091694 |
| 0 | Hierarchical | 0.107490 |
| 2 | K-Medoids | 0.107490 |
| 4 | Contiguous | 0.125497 |
| 3 | K-Maxoids | 0.173910 |
7. Configuration Options¶
Using Weights¶
When clustering multi-variate time series, you can assign different importance to each column using the weights parameter of aggregate(). This is useful when one variable is more critical for your application. Weights influence all pipeline stages (clustering, segmentation, representation, rescaling).
# Prioritize Load over other columns (e.g., for demand-focused energy systems)
result_weighted = tsam.aggregate(
raw,
n_clusters=8,
period_duration=24,
cluster=ClusterConfig(method="hierarchical"),
weights={"Load": 3.0, "GHI": 1.0, "T": 1.0, "Wind": 1.0},
)
print(f"Load RMSE (weighted): {result_weighted.accuracy.rmse['Load']:.4f}")
print(f"Load RMSE (unweighted): {result_hierarchical.accuracy.rmse['Load']:.4f}")
Load RMSE (weighted): 0.0611 Load RMSE (unweighted): 0.1012
Using Duration Curves for Clustering¶
By default, clustering matches periods by their temporal patterns. Setting use_duration_curves=True matches periods by their value distributions instead, ignoring timing.
# Cluster by value distribution rather than temporal pattern
result_duration_curves = tsam.aggregate(
raw,
n_clusters=8,
period_duration=24,
cluster=ClusterConfig(
method="hierarchical",
use_duration_curves=True,
),
)
print(f"RMSE with duration curves: {result_duration_curves.accuracy.rmse.mean():.4f}")
RMSE with duration curves: 0.1112
Distribution-Preserving Representation¶
The distribution_minmax representation preserves both the value distribution AND the min/max values. This is excellent for energy system optimization where both the shape and extremes matter.
# Use distribution_minmax representation
result_dist_minmax = tsam.aggregate(
raw,
n_clusters=8,
period_duration=24,
cluster=ClusterConfig(
method="hierarchical",
representation="distribution_minmax",
),
)
# Compare min/max preservation
print("Original data range:")
print(f" Load: {raw['Load'].min():.2f} - {raw['Load'].max():.2f}")
reconstructed_standard = result_hierarchical.reconstructed
reconstructed_dist = result_dist_minmax.reconstructed
print("\nStandard medoid representation:")
print(
f" Load: {reconstructed_standard['Load'].min():.2f} - {reconstructed_standard['Load'].max():.2f}"
)
print("\nDistribution + MinMax representation:")
print(
f" Load: {reconstructed_dist['Load'].min():.2f} - {reconstructed_dist['Load'].max():.2f}"
)
Original data range: Load: 270.00 - 636.48 Standard medoid representation: Load: 329.41 - 573.97 Distribution + MinMax representation: Load: 270.00 - 636.48
Comparison: Standard vs Distribution-Preserving¶
# Comparison: Standard vs Distribution-Preserving
comparison_dist = {
"Original": raw,
"Medoid (standard)": reconstructed_standard,
"Distribution + MinMax": reconstructed_dist,
}
frames = []
for name, df in comparison_dist.items():
sorted_vals = df["Load"].sort_values(ascending=False).reset_index(drop=True)
frames.append(
pd.DataFrame(
{"Hour": range(len(sorted_vals)), "Load": sorted_vals, "Method": name}
)
)
long_df = pd.concat(frames, ignore_index=True)
px.line(
long_df,
x="Hour",
y="Load",
color="Method",
title="Effect of Distribution-Preserving Representation",
)
Summary¶
| Use Case | Recommended Method | Key Options |
|---|---|---|
| General purpose | hierarchical |
Default settings |
| Fast clustering | kmeans |
- |
| Preserve realistic patterns | hierarchical |
representation="medoid" |
| Capture extremes | kmaxoids |
preserve_column_means=False |
| Storage modeling | contiguous |
- |
| Demand-focused | hierarchical |
weights={"Load": 3.0, ...} (top-level) |
| Preserve distribution | hierarchical |
representation="distribution_minmax" |
Note: The kmedoids method uses an exact MILP solver and can be slow for datasets with many periods (365+ days). Use hierarchical with representation="medoid" for similar results with much better performance.