Add Maple

Understanding Clustering Methods

AddMaple offers four algorithms to discover natural groupings in your data. Here's how each one works, explained without the jargon.

Looking for a quick start? See Clustering for step-by-step instructions on using the clustering tool.

K-Means: The Centroid Champion

What it does: K-Means divides your data into exactly k groups by finding the "center point" of each group and assigning rows to their nearest center.

How it works (step by step):

  1. Pick k random starting points (centers) scattered throughout your data
  2. Assign every row to its nearest center
  3. Calculate a new center for each group based on all the rows assigned to it
  4. Repeat steps 2-3 until the centers stop moving (convergence)

Survey example: You survey 1,000 customers on numeric scales: overall satisfaction (1-10), likelihood to recommend (1-10), and price sensitivity (1-10). K-Means with k=3 might discover:

  • Group 1 (center: 9.2, 8.8, 2.1): Highly satisfied, loyal, price-insensitive (premium segment)
  • Group 2 (center: 5.5, 5.2, 6.8): Middle of the road on satisfaction, price-conscious (value segment)
  • Group 3 (center: 3.1, 2.9, 8.4): Dissatisfied, likely to shop around, very price-sensitive (deal-hunters)

The algorithm moves those centers around until each customer settles with their nearest group.

When to use it:

  • You want exactly 3, 5, or 10 clusters—no guessing
  • Your survey uses only numeric questions (1-10 scales, numerical values). Note: Opinion scales like Likert scales are treated as numeric
  • You want something fast and interpretable
  • Your customer segments should be distinct and roughly balanced

What it's good at:

  • Pure numeric survey data with clear segment patterns (e.g., satisfaction profiles, usage intensity levels)
  • Speed: runs quickly even on surveys with 10,000+ respondents

Limitations:

  • Only works with numeric columns. If you include categorical survey questions (industry, yes/no, single-select responses), those columns get filtered out
  • Assumes clusters are roughly equal size and round-shaped
  • Can struggle if your segments naturally have very different sizes
  • Sensitive to outliers (one respondent with extreme scores can skew things)

Important Note: How Data Types Are Treated

Numeric data includes:

  • Numbers (e.g., age, company size, revenue)
  • Opinion scales (e.g., 1-10 satisfaction, Likert scales, NPS scores)
  • Percentages, Currency

Categorical data includes:

  • Single-select categories (e.g., "Which industry?", "Yes/No/Maybe")
  • Multi-select tags (e.g., "Which features do you use?")

For clustering:

  • K-Means works with numeric data only (opinion scales included)
  • K-Medoids works with both numeric and categorical
  • HDBSCAN works with both numeric and categorical
  • Balanced HDBSCAN works with both numeric and categorical

So if you have a 1-10 opinion scale question, it's treated as numeric and works with all four algorithms. If you have "Select your industry," that's categorical and K-Means will filter it out.


K-Medoids: The Robustness King

What it does: K-Medoids is like K-Means' tougher cousin. Instead of using an average center point, it picks actual respondents from your data as cluster centers (called "medoids").

How it works (step by step):

  1. Pick k actual respondents as starting medoids (using a smart algorithm)
  2. Assign every other respondent to their nearest medoid
  3. For each group, check if a different respondent would be a better representative (lower total distance)
  4. Swap out medoids if it improves the clustering
  5. Repeat steps 2-4 until no more beneficial swaps happen

Survey example: You survey 500 SaaS customers with mixed questions: satisfaction (1-10), usage frequency (daily/weekly/monthly), company size (1-50 / 50-200 / 200+), and churn risk (high/medium/low). K-Medoids with k=4 might pick these actual respondents as medoids:

  • Medoid A: Sarah from TechCorp, satisfaction 9, daily user, 150 employees, low churn risk → represents your power users
  • Medoid B: Mike from StartupXYZ, satisfaction 6, weekly user, 25 employees, medium churn risk → represents growing companies
  • Medoid C: Elena from Enterprise Inc., satisfaction 7, daily user, 500+ employees, low churn risk → represents large accounts
  • Medoid D: James from SmallBiz LLC, satisfaction 4, monthly user, 10 employees, high churn risk → represents at-risk customers

Every respondent gets assigned to whichever medoid (real person) is most similar to them.

When to use it:

  • Your survey has both numeric and categorical questions
  • You want robust results without extreme values pulling things off
  • You want to show stakeholders a "typical" respondent from each cluster ("This is Sarah, our power user profile")

What it's good at:

  • Mixed survey questions: numbers + categories + opinions all work together
  • Robustness: real respondents as centers, so not skewed by extreme answers
  • Interpretability: "Meet your key customer types" becomes real people
  • Irregular segment shapes: handles naturally unbalanced customer types

Limitations:

  • Slower than K-Means
  • Still requires you to pick k upfront
  • Still assumes segments are roughly balanced in size

HDBSCAN: The Natural Explorer

What it does: HDBSCAN automatically finds natural, density-based clusters without you specifying how many. It treats sparse outliers as "noise" and focuses on dense regions.

How it works (conceptually):

  1. Build a k-nearest-neighbor graph: for each respondent, find their closest neighbors
  2. Estimate the "density" around each respondent (are they clustered with similar people or isolated?)
  3. Find regions of high density—areas where many respondents cluster together
  4. Mark sparse, isolated respondents as noise/outliers
  5. Group the dense regions into clusters

Survey example: You survey 800 companies on attitudes to cloud migration: cost concern (1-10), security concern (1-10), innovation priority (1-10), existing cloud usage (%), and industry. HDBSCAN might naturally discover:

  • Cloud-Ready Innovators (200 respondents clustered together): Low cost concern, low security concern, high innovation priority, already 60%+ cloud-using tech/finance companies
  • Cost-Conscious Pragmatists (300 respondents): High cost concern, medium security concern, medium innovation priority, 20-40% cloud usage, mixed industries
  • Security-First Enterprises (180 respondents): Medium cost concern, high security concern, low innovation priority, 30-50% cloud, healthcare/finance heavily represented
  • Outliers/Noise (20 respondents): Scattered responses that don't fit any pattern—maybe early-stage or unusual hybrid approaches

The algorithm automatically discovers these 3 clusters (plus noise) without you telling it "find 3."

When to use it:

  • You're exploring survey data and don't know how many segments exist
  • You want to identify genuinely unusual respondents (the noise group)
  • You have any mix of data types in your survey

What it's good at:

  • Finding the "natural" number of segments automatically
  • Identifying outliers and unusual respondents
  • Handling segments that naturally vary in size
  • Mixed survey data: numbers, categories, opinions all work

Limitations:

  • Slower than K-Means
  • If your min_cluster_size is too high, you might call real segments "noise"
  • Requires tuning: min_cluster_size affects how many respondents form a cluster
  • If respondents are very scattered, most might be marked noise

Balanced HDBSCAN: The Best-of-Both-Worlds

What it does: Combines HDBSCAN's smart outlier filtering with K-Medoids' balanced cluster sizes. It automatically finds natural survey segments, then refines them to be more even-sized.

How it works (step by step):

  1. Run HDBSCAN to find natural customer segments and filter out true outliers
  2. Pick one representative respondent (medoid) from each natural segment
  3. Use K-Medoids to consolidate those representatives into exactly k final segments
  4. Assign all non-outlier respondents to the k segments, balancing sizes

Survey example: You survey 1,200 B2B software customers on needs: budget size (numeric), decision-maker count (numeric), time-to-decision (numeric), industry (categorical), and product-market fit perception (categorical).

Step 1 (HDBSCAN finds natural clustering): Discovers maybe 7 natural groups, but marks 30 weird responses as noise

Step 2 (Pick medoids): Picks one "representative customer" from each of the 7 groups

Step 3-4 (Consolidate to balanced k=4): Merges similar groups and assigns all 1,170 non-noise respondents into:

  • Enterprise Buyers (290 respondents): Large budgets, many decision-makers, long sales cycles, mostly Fortune 500
  • Mid-Market Pragmatists (280 respondents): Medium budgets, 3-5 decision-makers, mid-length cycles, diverse industries
  • Fast-Growing Companies (285 respondents): Growing budgets, fewer decision-makers, quick cycles, startups and scale-ups
  • Cost-Conscious Small Teams (315 respondents): Small budgets, 1-2 decision-makers, quick decisions, SMBs

Each segment is balanced (~280-320 respondents) and represents a real, meaningful customer type.

When to use it (recommended default):

  • You're analyzing survey data with mixed question types (recommended!)
  • You want clear, balanced customer personas (4-5 segments, not 2 or 12)
  • You want to ignore true outliers but keep borderline respondents
  • You need segments that are both natural and actionable

What it's good at:

  • Survey data: the perfect fit for mixed numeric and categorical questions
  • Balanced personas: great for building 4-5 customer types for business strategy
  • Smart filtering: real outliers get marked as noise; unusual-but-real respondents still get segmented
  • Robustness: not sensitive to extreme individual answers

Limitations:

  • Most complex (slower than K-Means)
  • If your data truly has very unbalanced natural segments, balancing will force artificial boundaries
  • Requires tuning for best results

Quick Comparison Table

Aspect K-Means K-Medoids HDBSCAN Balanced HDBSCAN
Cluster count You specify k You specify k Auto-detected You specify k
Data types Numeric only All types ✓ All types ✓ All types ✓
Speed Fast Moderate Slow Slow
Robustness Sensitive to outliers Robust Robust Robust
Outlier detection No No Yes (noise) Yes (noise)
Balanced sizes Tends toward equal Tends toward equal Variable Balanced ✓
Best for Pure numeric surveys Mixed surveys, show medoids Exploration, outlier discovery Survey data (recommended)

Choosing Your Algorithm

Start here: Use Balanced HDBSCAN for survey data. It combines the best properties: robustness, mixed-data support, noise filtering, and balanced results.

See Clustering to get started running your first clustering analysis. This guide has step-by-step instructions and tips on selecting columns and reviewing results.

Use K-Means if:

  • All your survey questions are numeric (1-10 scales, numeric responses only)
  • You need to process a very large survey (50,000+ respondents) and speed matters
  • You want clean, distinct numeric profiles

Use K-Medoids if:

  • Your survey has mixed questions (ratings + categories + opinions)
  • You want to show stakeholders actual respondent profiles ("Meet your top 3 customer types")
  • You want robustness without filtering outliers

Use HDBSCAN if:

  • You're exploring survey data and don't yet know how many natural segments exist
  • You want to identify and remove survey noise (responses that don't fit any pattern)
  • You're okay with unbalanced segment sizes

Use Balanced HDBSCAN if:

  • You're analyzing survey data (recommended!)
  • You want 3-5 balanced customer personas for business strategy
  • You want mixed survey questions handled naturally

Tips for Better Results

Before you cluster:

  • Include relevant survey questions: mix behavioral (frequency, usage) + attitudinal (satisfaction, priority)
  • Don't duplicate: if you asked "overall satisfaction" and "overall satisfaction rescaled," include only one
  • Check data quality: extreme outlier responses can affect K-Means; balanced HDBSCAN filters them

After you cluster:

  • Review the top features per segment—do they tell a coherent customer story?
  • Look at segment sizes—are they actionable? (If you get 5% and 95%, something's off)
  • If quality is "Poor," try:
    • Different survey questions (maybe the ones you picked don't segment well)
    • A different algorithm
    • Adjusting min_cluster_size (higher = fewer, larger, more stable segments)

For business use:

  • A "Good" quality score with clear, actionable personas beats "Excellent" with confusing segments
  • Test your segments: do they correlate with business outcomes (retention, upgrade, churn)?
  • Iterate: compare different question combinations and pick the version that best matches your business reality