Understanding Clustering Methods
AddMaple offers four algorithms to discover natural groupings in your data. Here's how each one works, explained without the jargon.
Looking for a quick start? See Clustering for step-by-step instructions on using the clustering tool.
K-Means: The Centroid Champion
What it does: K-Means divides your data into exactly k groups by finding the "center point" of each group and assigning rows to their nearest center.
How it works (step by step):
- Pick
krandom starting points (centers) scattered throughout your data - Assign every row to its nearest center
- Calculate a new center for each group based on all the rows assigned to it
- Repeat steps 2-3 until the centers stop moving (convergence)
Survey example: You survey 1,000 customers on numeric scales: overall satisfaction (1-10), likelihood to recommend (1-10), and price sensitivity (1-10). K-Means with k=3 might discover:
- Group 1 (center: 9.2, 8.8, 2.1): Highly satisfied, loyal, price-insensitive (premium segment)
- Group 2 (center: 5.5, 5.2, 6.8): Middle of the road on satisfaction, price-conscious (value segment)
- Group 3 (center: 3.1, 2.9, 8.4): Dissatisfied, likely to shop around, very price-sensitive (deal-hunters)
The algorithm moves those centers around until each customer settles with their nearest group.
When to use it:
- You want exactly 3, 5, or 10 clusters—no guessing
- Your survey uses only numeric questions (1-10 scales, numerical values). Note: Opinion scales like Likert scales are treated as numeric
- You want something fast and interpretable
- Your customer segments should be distinct and roughly balanced
What it's good at:
- Pure numeric survey data with clear segment patterns (e.g., satisfaction profiles, usage intensity levels)
- Speed: runs quickly even on surveys with 10,000+ respondents
Limitations:
- Only works with numeric columns. If you include categorical survey questions (industry, yes/no, single-select responses), those columns get filtered out
- Assumes clusters are roughly equal size and round-shaped
- Can struggle if your segments naturally have very different sizes
- Sensitive to outliers (one respondent with extreme scores can skew things)
Important Note: How Data Types Are Treated
Numeric data includes:
- Numbers (e.g., age, company size, revenue)
- Opinion scales (e.g., 1-10 satisfaction, Likert scales, NPS scores)
- Percentages, Currency
Categorical data includes:
- Single-select categories (e.g., "Which industry?", "Yes/No/Maybe")
- Multi-select tags (e.g., "Which features do you use?")
For clustering:
- K-Means works with numeric data only (opinion scales included)
- K-Medoids works with both numeric and categorical
- HDBSCAN works with both numeric and categorical
- Balanced HDBSCAN works with both numeric and categorical
So if you have a 1-10 opinion scale question, it's treated as numeric and works with all four algorithms. If you have "Select your industry," that's categorical and K-Means will filter it out.
K-Medoids: The Robustness King
What it does: K-Medoids is like K-Means' tougher cousin. Instead of using an average center point, it picks actual respondents from your data as cluster centers (called "medoids").
How it works (step by step):
- Pick
kactual respondents as starting medoids (using a smart algorithm) - Assign every other respondent to their nearest medoid
- For each group, check if a different respondent would be a better representative (lower total distance)
- Swap out medoids if it improves the clustering
- Repeat steps 2-4 until no more beneficial swaps happen
Survey example: You survey 500 SaaS customers with mixed questions: satisfaction (1-10), usage frequency (daily/weekly/monthly), company size (1-50 / 50-200 / 200+), and churn risk (high/medium/low). K-Medoids with k=4 might pick these actual respondents as medoids:
- Medoid A: Sarah from TechCorp, satisfaction 9, daily user, 150 employees, low churn risk → represents your power users
- Medoid B: Mike from StartupXYZ, satisfaction 6, weekly user, 25 employees, medium churn risk → represents growing companies
- Medoid C: Elena from Enterprise Inc., satisfaction 7, daily user, 500+ employees, low churn risk → represents large accounts
- Medoid D: James from SmallBiz LLC, satisfaction 4, monthly user, 10 employees, high churn risk → represents at-risk customers
Every respondent gets assigned to whichever medoid (real person) is most similar to them.
When to use it:
- Your survey has both numeric and categorical questions
- You want robust results without extreme values pulling things off
- You want to show stakeholders a "typical" respondent from each cluster ("This is Sarah, our power user profile")
What it's good at:
- Mixed survey questions: numbers + categories + opinions all work together
- Robustness: real respondents as centers, so not skewed by extreme answers
- Interpretability: "Meet your key customer types" becomes real people
- Irregular segment shapes: handles naturally unbalanced customer types
Limitations:
- Slower than K-Means
- Still requires you to pick
kupfront - Still assumes segments are roughly balanced in size
HDBSCAN: The Natural Explorer
What it does: HDBSCAN automatically finds natural, density-based clusters without you specifying how many. It treats sparse outliers as "noise" and focuses on dense regions.
How it works (conceptually):
- Build a k-nearest-neighbor graph: for each respondent, find their closest neighbors
- Estimate the "density" around each respondent (are they clustered with similar people or isolated?)
- Find regions of high density—areas where many respondents cluster together
- Mark sparse, isolated respondents as noise/outliers
- Group the dense regions into clusters
Survey example: You survey 800 companies on attitudes to cloud migration: cost concern (1-10), security concern (1-10), innovation priority (1-10), existing cloud usage (%), and industry. HDBSCAN might naturally discover:
- Cloud-Ready Innovators (200 respondents clustered together): Low cost concern, low security concern, high innovation priority, already 60%+ cloud-using tech/finance companies
- Cost-Conscious Pragmatists (300 respondents): High cost concern, medium security concern, medium innovation priority, 20-40% cloud usage, mixed industries
- Security-First Enterprises (180 respondents): Medium cost concern, high security concern, low innovation priority, 30-50% cloud, healthcare/finance heavily represented
- Outliers/Noise (20 respondents): Scattered responses that don't fit any pattern—maybe early-stage or unusual hybrid approaches
The algorithm automatically discovers these 3 clusters (plus noise) without you telling it "find 3."
When to use it:
- You're exploring survey data and don't know how many segments exist
- You want to identify genuinely unusual respondents (the noise group)
- You have any mix of data types in your survey
What it's good at:
- Finding the "natural" number of segments automatically
- Identifying outliers and unusual respondents
- Handling segments that naturally vary in size
- Mixed survey data: numbers, categories, opinions all work
Limitations:
- Slower than K-Means
- If your
min_cluster_sizeis too high, you might call real segments "noise" - Requires tuning:
min_cluster_sizeaffects how many respondents form a cluster - If respondents are very scattered, most might be marked noise
Balanced HDBSCAN: The Best-of-Both-Worlds
What it does: Combines HDBSCAN's smart outlier filtering with K-Medoids' balanced cluster sizes. It automatically finds natural survey segments, then refines them to be more even-sized.
How it works (step by step):
- Run HDBSCAN to find natural customer segments and filter out true outliers
- Pick one representative respondent (medoid) from each natural segment
- Use K-Medoids to consolidate those representatives into exactly
kfinal segments - Assign all non-outlier respondents to the
ksegments, balancing sizes
Survey example: You survey 1,200 B2B software customers on needs: budget size (numeric), decision-maker count (numeric), time-to-decision (numeric), industry (categorical), and product-market fit perception (categorical).
Step 1 (HDBSCAN finds natural clustering): Discovers maybe 7 natural groups, but marks 30 weird responses as noise
Step 2 (Pick medoids): Picks one "representative customer" from each of the 7 groups
Step 3-4 (Consolidate to balanced k=4): Merges similar groups and assigns all 1,170 non-noise respondents into:
- Enterprise Buyers (290 respondents): Large budgets, many decision-makers, long sales cycles, mostly Fortune 500
- Mid-Market Pragmatists (280 respondents): Medium budgets, 3-5 decision-makers, mid-length cycles, diverse industries
- Fast-Growing Companies (285 respondents): Growing budgets, fewer decision-makers, quick cycles, startups and scale-ups
- Cost-Conscious Small Teams (315 respondents): Small budgets, 1-2 decision-makers, quick decisions, SMBs
Each segment is balanced (~280-320 respondents) and represents a real, meaningful customer type.
When to use it (recommended default):
- You're analyzing survey data with mixed question types (recommended!)
- You want clear, balanced customer personas (4-5 segments, not 2 or 12)
- You want to ignore true outliers but keep borderline respondents
- You need segments that are both natural and actionable
What it's good at:
- Survey data: the perfect fit for mixed numeric and categorical questions
- Balanced personas: great for building 4-5 customer types for business strategy
- Smart filtering: real outliers get marked as noise; unusual-but-real respondents still get segmented
- Robustness: not sensitive to extreme individual answers
Limitations:
- Most complex (slower than K-Means)
- If your data truly has very unbalanced natural segments, balancing will force artificial boundaries
- Requires tuning for best results
Quick Comparison Table
| Aspect | K-Means | K-Medoids | HDBSCAN | Balanced HDBSCAN |
|---|---|---|---|---|
| Cluster count | You specify k |
You specify k |
Auto-detected | You specify k |
| Data types | Numeric only | All types ✓ | All types ✓ | All types ✓ |
| Speed | Fast | Moderate | Slow | Slow |
| Robustness | Sensitive to outliers | Robust | Robust | Robust |
| Outlier detection | No | No | Yes (noise) | Yes (noise) |
| Balanced sizes | Tends toward equal | Tends toward equal | Variable | Balanced ✓ |
| Best for | Pure numeric surveys | Mixed surveys, show medoids | Exploration, outlier discovery | Survey data (recommended) |
Choosing Your Algorithm
Start here: Use Balanced HDBSCAN for survey data. It combines the best properties: robustness, mixed-data support, noise filtering, and balanced results.
See Clustering to get started running your first clustering analysis. This guide has step-by-step instructions and tips on selecting columns and reviewing results.
Use K-Means if:
- All your survey questions are numeric (1-10 scales, numeric responses only)
- You need to process a very large survey (50,000+ respondents) and speed matters
- You want clean, distinct numeric profiles
Use K-Medoids if:
- Your survey has mixed questions (ratings + categories + opinions)
- You want to show stakeholders actual respondent profiles ("Meet your top 3 customer types")
- You want robustness without filtering outliers
Use HDBSCAN if:
- You're exploring survey data and don't yet know how many natural segments exist
- You want to identify and remove survey noise (responses that don't fit any pattern)
- You're okay with unbalanced segment sizes
Use Balanced HDBSCAN if:
- You're analyzing survey data (recommended!)
- You want 3-5 balanced customer personas for business strategy
- You want mixed survey questions handled naturally
Tips for Better Results
Before you cluster:
- Include relevant survey questions: mix behavioral (frequency, usage) + attitudinal (satisfaction, priority)
- Don't duplicate: if you asked "overall satisfaction" and "overall satisfaction rescaled," include only one
- Check data quality: extreme outlier responses can affect K-Means; balanced HDBSCAN filters them
After you cluster:
- Review the top features per segment—do they tell a coherent customer story?
- Look at segment sizes—are they actionable? (If you get 5% and 95%, something's off)
- If quality is "Poor," try:
- Different survey questions (maybe the ones you picked don't segment well)
- A different algorithm
- Adjusting
min_cluster_size(higher = fewer, larger, more stable segments)
For business use:
- A "Good" quality score with clear, actionable personas beats "Excellent" with confusing segments
- Test your segments: do they correlate with business outcomes (retention, upgrade, churn)?
- Iterate: compare different question combinations and pick the version that best matches your business reality