Clustering Methods¶

This notebook demonstrates all clustering methods and configuration options available in tsam.

Available Methods¶

Method	Description	Best For
`hierarchical`	Agglomerative hierarchical clustering	General purpose, recommended default
`kmeans`	K-means with centroids	Fast clustering, large datasets
`kmedoids`	K-medoids (MILP exact)	Optimal solution, smaller datasets (slow)
`kmaxoids`	Selects most dissimilar periods	Capturing extremes
`contiguous`	Hierarchical with temporal constraint	Storage modeling, seasonal patterns
`averaging`	Sequential period averaging	Simple baseline

Tip: For medoid-based clustering on large datasets, use hierarchical with representation="medoid" instead of kmedoids.

Key Configuration Options¶

Option	Description
`weights`	Per-column importance weights (top-level parameter of `aggregate()`)
`representation`	How to represent cluster centers (mean, medoid, maxoid, distribution, distribution_minmax)
`normalize_column_means`	Normalize columns to same mean before clustering
`use_duration_curves`	Match by value distribution rather than timing

Setup¶

In [1]:

Copied!





%load_ext autoreload
%autoreload 2

import pandas as pd
import plotly.express as px
import plotly.io as pio

import tsam
from tsam import ClusterConfig

pio.renderers.default = "notebook_connected"
import warnings

# Added to every example notebook: silence the v3 column-order
# FutureWarning in the rendered docs (tsam v4 returns result columns in
# input order; see migration guide).
warnings.filterwarnings(
    "ignore", category=FutureWarning, message=".*sorted alphabetically.*"
)
%load_ext autoreload
%autoreload 2

import pandas as pd
import plotly.express as px
import plotly.io as pio

import tsam
from tsam import ClusterConfig

pio.renderers.default = "notebook_connected"
import warnings

# Added to every example notebook: silence the v3 column-order
# FutureWarning in the rendered docs (tsam v4 returns result columns in
# input order; see migration guide).
warnings.filterwarnings(
    "ignore", category=FutureWarning, message=".*sorted alphabetically.*"
)

Input Data¶

The test dataset contains hourly time series for one year with four columns:

GHI: Global Horizontal Irradiance (solar)
T: Temperature
Wind: Wind speed
Load: Electrical load

In [2]:

Copied!

raw = pd.read_csv("testdata.csv", index_col=0)
print(f"Shape: {raw.shape} ({raw.shape[0]} hours = {raw.shape[0] // 24} days)")
raw.head()
raw = pd.read_csv("testdata.csv", index_col=0)
print(f"Shape: {raw.shape} ({raw.shape[0]} hours = {raw.shape[0] // 24} days)")
raw.head()

Shape: (8760, 4) (8760 hours = 365 days)

Out[2]:

	T	Wind	Load
2009-12-31 23:30:00	-2.1	7.1	375.478394
2010-01-01 00:30:00	-2.8	8.6	364.541326
2010-01-01 01:30:00	-3.3	9.7	357.416844
2010-01-01 02:30:00	-3.2	9.8	350.191306
2010-01-01 03:30:00	-3.2	9.4	345.161449

1. Hierarchical Clustering (Recommended Default)¶

Agglomerative hierarchical clustering builds a tree of clusters and cuts it at the desired number. It's the recommended default because it:

Produces consistent results (deterministic)
Works well with various representations
Handles multi-variate data effectively

In [3]:

Copied!





result_hierarchical = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="hierarchical"),
)
print(f"Accuracy: RMSE = {result_hierarchical.accuracy.rmse.mean():.4f}")
result_hierarchical = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="hierarchical"),
)
print(f"Accuracy: RMSE = {result_hierarchical.accuracy.rmse.mean():.4f}")

Accuracy: RMSE = 0.1075

2. K-Means Clustering¶

K-means is fast and widely used. It computes cluster centroids (averages), which may not correspond to actual periods in the data.

In [4]:

Copied!





result_kmeans = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="kmeans"),
)
print(f"Accuracy: RMSE = {result_kmeans.accuracy.rmse.mean():.4f}")
result_kmeans = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="kmeans"),
)
print(f"Accuracy: RMSE = {result_kmeans.accuracy.rmse.mean():.4f}")

Accuracy: RMSE = 0.0917

3. K-Medoids-like Clustering¶

K-medoids selects actual periods as cluster centers (medoids) rather than computing averages. This preserves realistic patterns.

Note: The true kmedoids method uses an exact MILP solver which can be slow for large datasets. For most use cases, hierarchical with representation="medoid" gives similar results much faster.

In [5]:

Copied!





# Use hierarchical with medoid representation (fast alternative to kmedoids)
result_kmedoids = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="hierarchical", representation="medoid"),
)
print(f"Accuracy: RMSE = {result_kmedoids.accuracy.rmse.mean():.4f}")
# Use hierarchical with medoid representation (fast alternative to kmedoids)
result_kmedoids = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="hierarchical", representation="medoid"),
)
print(f"Accuracy: RMSE = {result_kmedoids.accuracy.rmse.mean():.4f}")

Accuracy: RMSE = 0.1075

4. K-Maxoids Clustering¶

K-maxoids selects the most dissimilar periods as cluster centers. This is useful for capturing extreme conditions.

Note: We set preserve_column_means=False below because mean preservation adjusts typical period values to match the original data's mean. For k-maxoids, where the goal is to preserve extreme values, this would diminish the very extremes we're trying to capture. Use preserve_column_means=True (default) when mean preservation is more important than extreme value preservation.

In [6]:

Copied!





result_kmaxoids = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="kmaxoids"),
    preserve_column_means=False,  # Don't rescale to preserve extreme values
)
print(f"Accuracy: RMSE = {result_kmaxoids.accuracy.rmse.mean():.4f}")
result_kmaxoids = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="kmaxoids"),
    preserve_column_means=False,  # Don't rescale to preserve extreme values
)
print(f"Accuracy: RMSE = {result_kmaxoids.accuracy.rmse.mean():.4f}")

Accuracy: RMSE = 0.1659

5. Contiguous Clustering¶

Contiguous clustering enforces temporal continuity - adjacent typical periods must come from adjacent original periods. This is important for:

Storage modeling: State-of-charge must be continuous
Seasonal patterns: Preserving the natural progression of seasons

In [7]:

Copied!





result_contiguous = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="contiguous"),
)
print(f"Accuracy: RMSE = {result_contiguous.accuracy.rmse.mean():.4f}")
result_contiguous = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="contiguous"),
)
print(f"Accuracy: RMSE = {result_contiguous.accuracy.rmse.mean():.4f}")

Accuracy: RMSE = 0.1255

6. Comparison of Methods¶

In [8]:

Copied!





# Collect all results for comparison
results = {
    "Original": raw,
    "Hierarchical": result_hierarchical.reconstructed,
    "K-Means": result_kmeans.reconstructed,
    "K-Medoids": result_kmedoids.reconstructed,
    "K-Maxoids": result_kmaxoids.reconstructed,
    "Contiguous": result_contiguous.reconstructed,
}
# Collect all results for comparison
results = {
    "Original": raw,
    "Hierarchical": result_hierarchical.reconstructed,
    "K-Means": result_kmeans.reconstructed,
    "K-Medoids": result_kmedoids.reconstructed,
    "K-Maxoids": result_kmaxoids.reconstructed,
    "Contiguous": result_contiguous.reconstructed,
}

Duration Curve Comparison¶

Duration curves show how well each method preserves the value distribution.

In [9]:

Copied!





# Duration curve comparison - Load
frames = []
for name, df in results.items():
    sorted_vals = df["Load"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "Load": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df,
    x="Hour",
    y="Load",
    color="Method",
    title="Duration Curve Comparison - Load",
)
# Duration curve comparison - Load
frames = []
for name, df in results.items():
    sorted_vals = df["Load"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "Load": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df,
    x="Hour",
    y="Load",
    color="Method",
    title="Duration Curve Comparison - Load",
)

In [10]:

Copied!





# Duration curve comparison - GHI
frames = []
for name, df in results.items():
    sorted_vals = df["GHI"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "GHI": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df, x="Hour", y="GHI", color="Method", title="Duration Curve Comparison - GHI"
)
# Duration curve comparison - GHI
frames = []
for name, df in results.items():
    sorted_vals = df["GHI"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "GHI": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df, x="Hour", y="GHI", color="Method", title="Duration Curve Comparison - GHI"
)

Accuracy Comparison¶

In [11]:

Copied!





# Compare RMSE across methods
accuracy_comparison = pd.DataFrame(
    {
        "Method": ["Hierarchical", "K-Means", "K-Medoids", "K-Maxoids", "Contiguous"],
        "Mean RMSE": [
            result_hierarchical.accuracy.rmse.mean(),
            result_kmeans.accuracy.rmse.mean(),
            result_kmedoids.accuracy.rmse.mean(),
            result_kmaxoids.accuracy.rmse.mean(),
            result_contiguous.accuracy.rmse.mean(),
        ],
    }
)
accuracy_comparison.sort_values("Mean RMSE")
# Compare RMSE across methods
accuracy_comparison = pd.DataFrame(
    {
        "Method": ["Hierarchical", "K-Means", "K-Medoids", "K-Maxoids", "Contiguous"],
        "Mean RMSE": [
            result_hierarchical.accuracy.rmse.mean(),
            result_kmeans.accuracy.rmse.mean(),
            result_kmedoids.accuracy.rmse.mean(),
            result_kmaxoids.accuracy.rmse.mean(),
            result_contiguous.accuracy.rmse.mean(),
        ],
    }
)
accuracy_comparison.sort_values("Mean RMSE")

Out[11]:

	Method	Mean RMSE
1	K-Means	0.091698
0	Hierarchical	0.107490
2	K-Medoids	0.107490
4	Contiguous	0.125497
3	K-Maxoids	0.165900

7. Configuration Options¶

Using Weights¶

When clustering multi-variate time series, you can assign different importance to each column using the weights parameter of aggregate(). This is useful when one variable is more critical for your application. Weights influence all pipeline stages (clustering, segmentation, representation, rescaling).

In [12]:

Copied!





# Prioritize Load over other columns (e.g., for demand-focused energy systems)
result_weighted = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="hierarchical"),
    weights={"Load": 3.0, "GHI": 1.0, "T": 1.0, "Wind": 1.0},
)
print(f"Load RMSE (weighted): {result_weighted.accuracy.rmse['Load']:.4f}")
print(f"Load RMSE (unweighted): {result_hierarchical.accuracy.rmse['Load']:.4f}")
# Prioritize Load over other columns (e.g., for demand-focused energy systems)
result_weighted = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(method="hierarchical"),
    weights={"Load": 3.0, "GHI": 1.0, "T": 1.0, "Wind": 1.0},
)
print(f"Load RMSE (weighted): {result_weighted.accuracy.rmse['Load']:.4f}")
print(f"Load RMSE (unweighted): {result_hierarchical.accuracy.rmse['Load']:.4f}")

Load RMSE (weighted): 0.0611
Load RMSE (unweighted): 0.1012

Using Duration Curves for Clustering¶

By default, clustering matches periods by their temporal patterns. Setting use_duration_curves=True matches periods by their value distributions instead, ignoring timing.

In [13]:

Copied!





# Cluster by value distribution rather than temporal pattern
result_duration_curves = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(
        method="hierarchical",
        use_duration_curves=True,
    ),
)
print(f"RMSE with duration curves: {result_duration_curves.accuracy.rmse.mean():.4f}")
# Cluster by value distribution rather than temporal pattern
result_duration_curves = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(
        method="hierarchical",
        use_duration_curves=True,
    ),
)
print(f"RMSE with duration curves: {result_duration_curves.accuracy.rmse.mean():.4f}")

RMSE with duration curves: 0.1112

Distribution-Preserving Representation¶

The distribution_minmax representation preserves both the value distribution AND the min/max values. This is excellent for energy system optimization where both the shape and extremes matter.

In [14]:

Copied!





# Use distribution_minmax representation
result_dist_minmax = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(
        method="hierarchical",
        representation="distribution_minmax",
    ),
)

# Compare min/max preservation
print("Original data range:")
print(f"  Load: {raw['Load'].min():.2f} - {raw['Load'].max():.2f}")

reconstructed_standard = result_hierarchical.reconstructed
reconstructed_dist = result_dist_minmax.reconstructed

print("\nStandard medoid representation:")
print(
    f"  Load: {reconstructed_standard['Load'].min():.2f} - {reconstructed_standard['Load'].max():.2f}"
)

print("\nDistribution + MinMax representation:")
print(
    f"  Load: {reconstructed_dist['Load'].min():.2f} - {reconstructed_dist['Load'].max():.2f}"
)
# Use distribution_minmax representation
result_dist_minmax = tsam.aggregate(
    raw,
    n_clusters=8,
    period_duration=24,
    cluster=ClusterConfig(
        method="hierarchical",
        representation="distribution_minmax",
    ),
)

# Compare min/max preservation
print("Original data range:")
print(f"  Load: {raw['Load'].min():.2f} - {raw['Load'].max():.2f}")

reconstructed_standard = result_hierarchical.reconstructed
reconstructed_dist = result_dist_minmax.reconstructed

print("\nStandard medoid representation:")
print(
    f"  Load: {reconstructed_standard['Load'].min():.2f} - {reconstructed_standard['Load'].max():.2f}"
)

print("\nDistribution + MinMax representation:")
print(
    f"  Load: {reconstructed_dist['Load'].min():.2f} - {reconstructed_dist['Load'].max():.2f}"
)

Original data range:
  Load: 270.00 - 636.48

Standard medoid representation:
  Load: 329.41 - 573.97

Distribution + MinMax representation:
  Load: 270.00 - 636.48

Comparison: Standard vs Distribution-Preserving¶

In [15]:

Copied!





# Comparison: Standard vs Distribution-Preserving
comparison_dist = {
    "Original": raw,
    "Medoid (standard)": reconstructed_standard,
    "Distribution + MinMax": reconstructed_dist,
}

frames = []
for name, df in comparison_dist.items():
    sorted_vals = df["Load"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "Load": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df,
    x="Hour",
    y="Load",
    color="Method",
    title="Effect of Distribution-Preserving Representation",
)
# Comparison: Standard vs Distribution-Preserving
comparison_dist = {
    "Original": raw,
    "Medoid (standard)": reconstructed_standard,
    "Distribution + MinMax": reconstructed_dist,
}

frames = []
for name, df in comparison_dist.items():
    sorted_vals = df["Load"].sort_values(ascending=False).reset_index(drop=True)
    frames.append(
        pd.DataFrame(
            {"Hour": range(len(sorted_vals)), "Load": sorted_vals, "Method": name}
        )
    )
long_df = pd.concat(frames, ignore_index=True)

px.line(
    long_df,
    x="Hour",
    y="Load",
    color="Method",
    title="Effect of Distribution-Preserving Representation",
)

Summary¶

Use Case	Recommended Method	Key Options
General purpose	`hierarchical`	Default settings
Fast clustering	`kmeans`	-
Preserve realistic patterns	`hierarchical`	`representation="medoid"`
Capture extremes	`kmaxoids`	`preserve_column_means=False`
Storage modeling	`contiguous`	-
Demand-focused	`hierarchical`	`weights={"Load": 3.0, ...}` (top-level)
Preserve distribution	`hierarchical`	`representation="distribution_minmax"`

Note: The kmedoids method uses an exact MILP solver and can be slow for datasets with many periods (365+ days). Use hierarchical with representation="medoid" for similar results with much better performance.