Clustering
Clustering groups similar rows together to reveal natural segments (e.g., types of customers or respondent profiles). It works well for survey and behavioral datasets where you have a mix of Numbers, Single/Multi‑Category, and Opinion Scales.
Open the tool
- Click the More menu.
- Choose Create clusters.
This opens a panel where you pick the columns to include, choose an algorithm, and run clustering. After reviewing the results, you can add a calculated cluster column to your dataset.
How it works (quick version)
Clustering finds groups of rows that are more similar to each other than to the rest of the dataset. AddMaple offers multiple algorithms:
- Balanced HDBSCAN (recommended): Works with mixed data types. Automatically balances cluster sizes while finding natural groupings.
- HDBSCAN: Density‑based clustering that can detect outliers. Works with all data types.
- K‑Medoids: Similar to K‑means but more robust to outliers. Works with all data types.
- K‑Means: Classic algorithm for numeric data only (Numbers, Opinion Scales, Percentages, Currency).
For survey data with mixed categories and numbers, start with Balanced HDBSCAN.
Want more detail? See Understanding Clustering Methods for a deep dive into how each algorithm works and when to use it.
Step 1 — Select columns
Choose the columns you want to use for clustering. You can include:
- Numbers and Opinion Scales
- Single Category and Multi Category
Tip: Include a balanced mix of behavioral and attitudinal variables. Remove obvious duplicates to avoid overweighting the same idea twice.
Step 2 — Choose algorithm
- Balanced HDBSCAN (recommended): Works with Numbers, Single Category, and Multi Category. Best default for survey data. Balances cluster sizes automatically.
- HDBSCAN: Finds density‑based clusters and marks outliers as noise. Works with all data types.
- K‑Medoids: Requires you to specify the number of clusters. Works with all data types.
- K‑Means: Requires you to specify the number of clusters. Only works with numeric columns (Numbers, Opinion Scales, Percentages, Currency).
When you select K‑Means, incompatible columns are automatically filtered out.
Key choice: Want exactly 3 clusters, or 5, or 10? Use K‑Medoids (works with any data type) or K‑Means (numeric data only). Both let you set the exact number upfront. Use the HDBSCAN variants when you want the algorithm to find the natural number of clusters automatically.
For detailed explanations of each method's strengths and limitations, see Understanding Clustering Methods.
Advanced options (optional)
Common settings across algorithms:
- Number of clusters: For Balanced HDBSCAN and K‑Medoids/K‑Means, set how many groups you want. Balanced HDBSCAN can use "Auto" to detect the optimal number.
- Min cluster size: Minimum rows required to form a cluster. Higher values produce larger, more stable groups (default: 50).
Each algorithm has additional tuning parameters you can adjust for fine control. Most users should stick with the defaults.
You can reset to recommended settings anytime.
Run and review
Click Run Clustering. You'll see:
- Detected clusters (and a possible "Noise" group for outliers)
- Each cluster's size and percent of rows
- Top features per cluster to help explain what makes the group distinct
How to read features:
- Numeric features show the cluster mean and a z‑score vs the dataset mean.
- Categorical features show the percent in the cluster and a lift (×) vs overall.
Name and save clusters
You may see suggested names and descriptions for clusters to speed up labeling. When you're happy, click Add Cluster Column to add a new Single Category column to your dataset with the cluster labels. You can rename values later.
Understand cluster quality
After running clustering, AddMaple shows you a Quality rating (Excellent, Good, Fair, or Poor). This helps you judge how well your clusters fit together.
The quality is determined by two statistical measures:
Silhouette Score measures how well-separated your clusters are — how similar rows within each cluster are to each other, compared to rows outside that cluster.
- 0.7+ = Excellent clustering (very tight, well-defined groups)
- 0.5–0.7 = Good clustering (clear separation)
- 0.2–0.5 = Fair clustering (some overlap; patterns are still meaningful)
- <0.2 = Poor clustering (groups are scattered; reconsider your columns or algorithm)
PERMANOVA is a statistical test that validates your clusters are genuinely different — not just random noise. It returns an R² value and a p-value:
- R² > 0.5 = Strong separation (differences are substantial)
- R² 0.3–0.5 = Moderate separation (differences exist but are smaller)
- R² < 0.3 = Weak separation (very subtle or no real patterns)
- p-value < 0.05 = Statistically significant (the clustering is real, not due to chance)
Why this matters: Don't chase a perfect score. Fair results often capture real patterns. Poor quality suggests either that your columns don't cluster well together, or you need to adjust your algorithm settings. Try removing columns, adding more relevant ones, or switching algorithms.
History panel
Your clustering runs are saved during your session, so you can compare different configurations and pick the best one.
Why use history: Often the best insights come from iteration. You might:
- Try different column combinations to see which creates the most meaningful segments
- Switch between algorithms to find one that better captures your data patterns
- Adjust parameters like Min Cluster Size to balance cluster granularity (many small groups vs. fewer large ones)
- Compare quality scores across runs to pick the version that works best for your use case
How to access it: Click the History button in the review screen. You'll see all runs from your current session, labeled with:
- Run number (e.g., "Run 2 of 5")
- Quality rating
- Number of clusters and noise percentage
- Algorithm used
- Column count
Switching between runs: Click any run to instantly reload its configuration and results. You can then:
- Review its clusters and top features again
- Modify settings and run a new iteration
- Compare quality scores and cluster counts side-by-side mentally
- Choose which result to add to your dataset
Note: History resets when you close the clustering tool. To save a run permanently, add it as a column to your dataset.
Tips and limitations
- Unsupervised insight: Clusters describe patterns; they aren't "right or wrong". Try different column sets and algorithms to find stable themes.
- Rare categories can be noisy: Consider combining very small groups or increasing Min cluster size.
- Algorithm choice: Use Balanced HDBSCAN for survey data. Use K‑Means when your inputs are all numeric and you want a fixed number of clusters. Use HDBSCAN when you want to identify outliers.
- Don't over-optimize: A "Good" quality score with business-intuitive clusters beats an "Excellent" score with uninterpretable groups.
Availability: Clustering is limited to certain plans.