Understanding Clustering Methods

AddMaple offers four algorithms to discover natural groupings in your data. Here's how each one works, explained without the jargon.

Looking for a quick start? See Clustering for step-by-step instructions on using the clustering tool.

K-Means: The Centroid Champion

What it does: K-Means divides your data into exactly k groups by finding the "center point" of each group and assigning rows to their nearest center.

How it works (step by step):

Pick k random starting points (centers) scattered throughout your data
Assign every row to its nearest center
Calculate a new center for each group based on all the rows assigned to it
Repeat steps 2-3 until the centers stop moving (convergence)

Survey example: You survey 1,000 customers on numeric scales: overall satisfaction (1-10), likelihood to recommend (1-10), and price sensitivity (1-10). K-Means with k=3 might discover:

Group 1 (center: 9.2, 8.8, 2.1): Highly satisfied, loyal, price-insensitive (premium segment)
Group 2 (center: 5.5, 5.2, 6.8): Middle of the road on satisfaction, price-conscious (value segment)
Group 3 (center: 3.1, 2.9, 8.4): Dissatisfied, likely to shop around, very price-sensitive (deal-hunters)

The algorithm moves those centers around until each customer settles with their nearest group.

When to use it:

You want exactly 3, 5, or 10 clusters—no guessing
Your survey uses only numeric questions (1-10 scales, numerical values). Note: Opinion scales like Likert scales are treated as numeric
You want something fast and interpretable
Your customer segments should be distinct and roughly balanced

What it's good at:

Pure numeric survey data with clear segment patterns (e.g., satisfaction profiles, usage intensity levels)
Speed: runs quickly even on surveys with 10,000+ respondents

Limitations:

Only works with numeric columns. If you include categorical survey questions (industry, yes/no, single-select responses), those columns get filtered out
Assumes clusters are roughly equal size and round-shaped
Can struggle if your segments naturally have very different sizes
Sensitive to outliers (one respondent with extreme scores can skew things)

Important Note: How Data Types Are Treated

Numeric data includes:

Numbers (e.g., age, company size, revenue)
Opinion scales (e.g., 1-10 satisfaction, Likert scales, NPS scores)
Percentages, Currency

Categorical data includes:

Single-select categories (e.g., "Which industry?", "Yes/No/Maybe")
Multi-select tags (e.g., "Which features do you use?")

For clustering:

K-Means works with numeric data only (opinion scales included)
K-Medoids works with both numeric and categorical
HDBSCAN works with both numeric and categorical
Balanced HDBSCAN works with both numeric and categorical

So if you have a 1-10 opinion scale question, it's treated as numeric and works with all four algorithms. If you have "Select your industry," that's categorical and K-Means will filter it out.

K-Medoids: The Robustness King

What it does: K-Medoids is like K-Means' tougher cousin. Instead of using an average center point, it picks actual respondents from your data as cluster centers (called "medoids").

How it works (step by step):

Pick k actual respondents as starting medoids (using a smart algorithm)
Assign every other respondent to their nearest medoid
For each group, check if a different respondent would be a better representative (lower total distance)
Swap out medoids if it improves the clustering
Repeat steps 2-4 until no more beneficial swaps happen

Survey example: You survey 500 SaaS customers with mixed questions: satisfaction (1-10), usage frequency (daily/weekly/monthly), company size (1-50 / 50-200 / 200+), and churn risk (high/medium/low). K-Medoids with k=4 might pick these actual respondents as medoids:

Medoid A: Sarah from TechCorp, satisfaction 9, daily user, 150 employees, low churn risk → represents your power users
Medoid B: Mike from StartupXYZ, satisfaction 6, weekly user, 25 employees, medium churn risk → represents growing companies
Medoid C: Elena from Enterprise Inc., satisfaction 7, daily user, 500+ employees, low churn risk → represents large accounts
Medoid D: James from SmallBiz LLC, satisfaction 4, monthly user, 10 employees, high churn risk → represents at-risk customers

Every respondent gets assigned to whichever medoid (real person) is most similar to them.

When to use it:

Your survey has both numeric and categorical questions
You want robust results without extreme values pulling things off
You want to show stakeholders a "typical" respondent from each cluster ("This is Sarah, our power user profile")

What it's good at:

Mixed survey questions: numbers + categories + opinions all work together
Robustness: real respondents as centers, so not skewed by extreme answers
Interpretability: "Meet your key customer types" becomes real people
Irregular segment shapes: handles naturally unbalanced customer types

Limitations:

Slower than K-Means
Still requires you to pick k upfront
Still assumes segments are roughly balanced in size

HDBSCAN: The Natural Explorer

What it does: HDBSCAN automatically finds natural, density-based clusters without you specifying how many. It treats sparse outliers as "noise" and focuses on dense regions.

How it works (conceptually):

Build a k-nearest-neighbor graph: for each respondent, find their closest neighbors
Estimate the "density" around each respondent (are they clustered with similar people or isolated?)
Find regions of high density—areas where many respondents cluster together
Mark sparse, isolated respondents as noise/outliers
Group the dense regions into clusters

Survey example: You survey 800 companies on attitudes to cloud migration: cost concern (1-10), security concern (1-10), innovation priority (1-10), existing cloud usage (%), and industry. HDBSCAN might naturally discover:

Cloud-Ready Innovators (200 respondents clustered together): Low cost concern, low security concern, high innovation priority, already 60%+ cloud-using tech/finance companies
Cost-Conscious Pragmatists (300 respondents): High cost concern, medium security concern, medium innovation priority, 20-40% cloud usage, mixed industries
Security-First Enterprises (180 respondents): Medium cost concern, high security concern, low innovation priority, 30-50% cloud, healthcare/finance heavily represented
Outliers/Noise (20 respondents): Scattered responses that don't fit any pattern—maybe early-stage or unusual hybrid approaches

The algorithm automatically discovers these 3 clusters (plus noise) without you telling it "find 3."

When to use it:

You're exploring survey data and don't know how many segments exist
You want to identify genuinely unusual respondents (the noise group)
You have any mix of data types in your survey

What it's good at:

Finding the "natural" number of segments automatically
Identifying outliers and unusual respondents
Handling segments that naturally vary in size
Mixed survey data: numbers, categories, opinions all work

Limitations:

Slower than K-Means
If your min_cluster_size is too high, you might call real segments "noise"
Requires tuning: min_cluster_size affects how many respondents form a cluster
If respondents are very scattered, most might be marked noise

Balanced HDBSCAN: The Best-of-Both-Worlds

What it does: Combines HDBSCAN's smart outlier filtering with K-Medoids' balanced cluster sizes. It automatically finds natural survey segments, then refines them to be more even-sized.

How it works (step by step):

Run HDBSCAN to find natural customer segments and filter out true outliers
Pick one representative respondent (medoid) from each natural segment
Use K-Medoids to consolidate those representatives into exactly k final segments
Assign all non-outlier respondents to the k segments, balancing sizes

Survey example: You survey 1,200 B2B software customers on needs: budget size (numeric), decision-maker count (numeric), time-to-decision (numeric), industry (categorical), and product-market fit perception (categorical).

Step 1 (HDBSCAN finds natural clustering): Discovers maybe 7 natural groups, but marks 30 weird responses as noise

Step 2 (Pick medoids): Picks one "representative customer" from each of the 7 groups

Step 3-4 (Consolidate to balanced k=4): Merges similar groups and assigns all 1,170 non-noise respondents into:

Enterprise Buyers (290 respondents): Large budgets, many decision-makers, long sales cycles, mostly Fortune 500
Mid-Market Pragmatists (280 respondents): Medium budgets, 3-5 decision-makers, mid-length cycles, diverse industries
Fast-Growing Companies (285 respondents): Growing budgets, fewer decision-makers, quick cycles, startups and scale-ups
Cost-Conscious Small Teams (315 respondents): Small budgets, 1-2 decision-makers, quick decisions, SMBs

Each segment is balanced (~280-320 respondents) and represents a real, meaningful customer type.

When to use it (recommended default):

You're analyzing survey data with mixed question types (recommended!)
You want clear, balanced customer personas (4-5 segments, not 2 or 12)
You want to ignore true outliers but keep borderline respondents
You need segments that are both natural and actionable

What it's good at:

Survey data: the perfect fit for mixed numeric and categorical questions
Balanced personas: great for building 4-5 customer types for business strategy
Smart filtering: real outliers get marked as noise; unusual-but-real respondents still get segmented
Robustness: not sensitive to extreme individual answers

Limitations:

Most complex (slower than K-Means)
If your data truly has very unbalanced natural segments, balancing will force artificial boundaries
Requires tuning for best results

Quick Comparison Table

Aspect	K-Means	K-Medoids	HDBSCAN	Balanced HDBSCAN
Cluster count	You specify `k`	You specify `k`	Auto-detected	You specify `k`
Data types	Numeric only	All types ✓	All types ✓	All types ✓
Speed	Fast	Moderate	Slow	Slow
Robustness	Sensitive to outliers	Robust	Robust	Robust
Outlier detection	No	No	Yes (noise)	Yes (noise)
Balanced sizes	Tends toward equal	Tends toward equal	Variable	Balanced ✓
Best for	Pure numeric surveys	Mixed surveys, show medoids	Exploration, outlier discovery	Survey data (recommended)

Choosing Your Algorithm

Start here: Use Balanced HDBSCAN for survey data. It combines the best properties: robustness, mixed-data support, noise filtering, and balanced results.

See Clustering to get started running your first clustering analysis. This guide has step-by-step instructions and tips on selecting columns and reviewing results.

Use K-Means if:

All your survey questions are numeric (1-10 scales, numeric responses only)
You need to process a very large survey (50,000+ respondents) and speed matters
You want clean, distinct numeric profiles

Use K-Medoids if:

Your survey has mixed questions (ratings + categories + opinions)
You want to show stakeholders actual respondent profiles ("Meet your top 3 customer types")
You want robustness without filtering outliers

Use HDBSCAN if:

You're exploring survey data and don't yet know how many natural segments exist
You want to identify and remove survey noise (responses that don't fit any pattern)
You're okay with unbalanced segment sizes

Use Balanced HDBSCAN if:

You're analyzing survey data (recommended!)
You want 3-5 balanced customer personas for business strategy
You want mixed survey questions handled naturally

Tips for Better Results

Before you cluster:

Include relevant survey questions: mix behavioral (frequency, usage) + attitudinal (satisfaction, priority)
Don't duplicate: if you asked "overall satisfaction" and "overall satisfaction rescaled," include only one
Check data quality: extreme outlier responses can affect K-Means; balanced HDBSCAN filters them

After you cluster:

Review the top features per segment—do they tell a coherent customer story?
Look at segment sizes—are they actionable? (If you get 5% and 95%, something's off)
If quality is "Poor," try:
- Different survey questions (maybe the ones you picked don't segment well)
- A different algorithm
- Adjusting min_cluster_size (higher = fewer, larger, more stable segments)

For business use:

A "Good" quality score with clear, actionable personas beats "Excellent" with confusing segments
Test your segments: do they correlate with business outcomes (retention, upgrade, churn)?
Iterate: compare different question combinations and pick the version that best matches your business reality

Assistant

Contents

Understanding Clustering Methods

K-Means: The Centroid Champion

Important Note: How Data Types Are Treated

K-Medoids: The Robustness King

HDBSCAN: The Natural Explorer

Balanced HDBSCAN: The Best-of-Both-Worlds

Quick Comparison Table

Choosing Your Algorithm

Tips for Better Results