Introduction
We recently launched our new statistics engine, designed to make data analysis both accessible and efficient. This engine automatically performs the appropriate statistical tests on your data - reducing the time it takes to go from dataset to useful insights.
For data experts, the new engine accelerates your workflow by handling the statistical tests, letting you spend more time interpreting results and making decisions. For those less familiar with statistics, it eliminates the complexity and jargon, providing straightforward insights without requiring in-depth statistical knowledge.
In this article, we'll explore how the new engine works, the types of statistical tests it uses, and the practical benefits it offers for your data analysis needs.
The Stats Engine
The new statistics engine in AddMaple is designed to automatically handle a variety of statistical tests, streamlining your data analysis process.
The engine first examines your dataset to determine the types of data in each column. Based on this analysis, it selects the appropriate statistical tests to compare columns, ensuring that the results are both relevant and accurate. Here are the types of tests the engine uses:
Chi-Square Test: Used when both columns contain categorical data. This test helps determine if there is a significant association between the categories of the two variables.
Example: Imagine you conducted a survey to see if there is a relationship between people's preferred type of exercise (running, swimming, cycling) and their age group (under 30, 30-50, over 50). The Chi-Square test can help you determine if the preference for exercise type is related to the age group of respondents.
ANOVA (Analysis of Variance): Applied when comparing one categorical column and one numerical column. It helps identify whether there are any statistically significant differences between the means of different groups.
Example: Consider a medical study that looks at the effect of different diets (low-carb, low-fat, Mediterranean) on blood pressure levels. ANOVA can be used to determine if there are significant differences in blood pressure changes among the different diet groups.
Kruskal-Wallis Test: Applied when comparing one categorical column and one numerical or ordinal column, especially useful in scenarios where some categories may have small sample sizes and the data do not assume a normal distribution. It assesses whether there are statistically significant differences between the distributions of different groups.
Example: Consider an ecological study assessing the impact of various conservation strategies (community management, protected areas, none) on the diversity of species in small, isolated patches of habitat. Given the small sample sizes from some habitat patches, the Kruskal-Wallis test is suitable for determining if there are significant differences in species diversity distributions among the conservation strategy groups.
T-Test: Used when comparing one categorical column with only two categories against a numerical column. This test checks if there are significant differences between the two groups.
Example: Suppose you want to compare the test scores of students who studied using two different methods: traditional learning vs. online learning. The T-Test can help determine if there is a significant difference in the test scores between these two groups.
Correlation Tests (Pearson’s and Spearman’s): Used when both columns contain numerical data. Pearson’s correlation is applied if the data is normally distributed, while Spearman’s correlation is used if the data is not normally distributed. These tests measure the strength and direction of the relationship between the two numerical variables.
Example: Imagine you are analyzing data to see if there is a relationship between hours of exercise per week and cholesterol levels. The Pearson’s or Spearman’s correlation tests can help determine if there is a significant correlation between these two numerical columns.
By automating these tests, the new statistics engine saves you time and effort, allowing you to focus on interpreting the results rather than worrying about the technical details of statistical analysis. AddMaple automatically highlights the columns most related to the column you are viewing, surfacing hidden insights and patterns that might otherwise be overlooked. This not only streamlines your workflow but also ensures that you don't miss any significant relationships in your data.
How It Works
Our statistics engine is designed to choose the appropriate statistical test for each pair of columns in your dataset, ensuring accurate and relevant results. This optimized algorithm runs very fast, even on large data sets, and performs all relevant tests automatically.
Here is a breakdown of how we perform each statistical test:
Chi-Square Test
For pairs of columns containing categorical data, the engine performs a Chi-Square test. Here’s how it works:
Calculate Expected Frequencies:
The engine calculates the expected frequencies for each category combination based on the marginal totals.
Compute Chi-Square Statistic: It then compares the observed frequencies with the expected frequencies to compute the Chi-Square statistic using the formula -
Where is the observed frequency for each category combination and \(E_i\) is the expected frequency.
Determine P-Value: The Chi-Square statistic is compared against the Chi-Square distribution with the appropriate degrees of freedom to determine the p-value, indicating the significance of the association between the categories.
Calculate Cramer's V: To measure the strength of the association between the categories, the engine calculates Cramer's V. Here's how it works:
Chi-Square Value: Use the computed Chi-Square statistic.
Sample Size: Determine the total number of observations.
Minimum Dimension (k): Find the smaller of (number of rows - 1) and (number of columns - 1).
Cramer's V Formula:
Where is the Chi-Square statistic, is the total number of observations, and is the minimum dimension.
By including Cramer's V, the engine not only tells you whether there is a significant association but also how strong that association is, providing a more comprehensive understanding of the relationship between your categorical variables.
ANOVA (Analysis of Variance)
For comparing one categorical variable and one numerical variable, the engine uses ANOVA. Here’s the process:
Calculate Group Means: The engine calculates the mean of the numerical variable for each category.
Compute Variance: It then computes the variance within groups and between groups.
Calculate F-Statistic: The ratio of between-group variance to within-group variance is calculated to obtain the F-statistic.
Determine P-Value: The F-statistic is compared against the F-distribution to determine the p-value, which indicates whether there are significant differences between group means.
Calculate Eta Squared : To measure the effect size, the engine calculates eta squared. Here's how it works:
Sum of Squares Between (SSB): The sum of squared deviations of each group mean from the overall mean, multiplied by the number of observations in each group.
Sum of Squares Total (SST): The sum of squared deviations of each observation from the overall mean.
Eta Squared Formula:
Where is the proportion of total variance attributable to the factor.
By including eta squared, the engine not only tells you whether there are significant differences between groups but also quantifies the magnitude of the differences, providing a more comprehensive understanding of the relationship between your categorical and numerical variables.
Kruskal-Wallis Test
For comparing one categorical variable with one numerical or ordinal variable where there are small sample sizes in some categories, the engine uses the Kruskal-Wallis test. Here’s the process:
- Calculate Group Ranks: The engine assigns ranks to all observations across the groups and calculates the sum of ranks for each group.
- Compute Test Statistic: It computes the H statistic, which involves the following steps:
- Calculate the sum of squared ranks for each group, adjusted by the number of observations in each group. Sum these values across all groups and adjust for the total number of observations to derive the H statistic.
- Determine P-Value: The H statistic is compared against the chi-squared distribution to determine the p-value, which indicates whether there are significant differences between group distributions.
- Calculate Eta Squared : To measure the effect size, the engine calculates eta squared. This allows us to order related columns by the effect size as well as the p-value.
T-Test
For comparing one categorical variable with two categories against a numerical variable, the engine performs a two-sided T-Test:
- Calculate Group Means and Variances: The engine computes the means and standard deviations of the numerical variable for the two categories.
- Compute T-Statistic: The difference between the means is divided by the standard error of the difference to obtain the T-statistic.
- Determine Degrees of Freedom: The degrees of freedom (dof) are calculated based on the sample sizes of the two groups.
- Compute P-Value: The T-statistic is compared against the T-distribution to find the p-value, indicating if there is a significant difference between the two groups.
Correlation Tests (Pearson’s and Spearman’s)
For pairs of numerical columns, the engine uses two-sided correlation tests:
Pearson’s Correlation: Used if the data is normally distributed.
- Check Normality: The engine checks if both arrays of data are normally distributed.
- Compute Pearson’s Correlation Coefficient: It calculates the correlation coefficient, which measures the strength and direction of the linear relationship.
Spearman’s Correlation: Used if the data is not normally distributed.
- Rank the Data: The engine ranks the data for both variables.
- Compute Spearman’s Rank Correlation Coefficient: It calculates the correlation coefficient based on these ranks.
For both correlation tests:
- 3. Calculate T-Score: The correlation coefficient is used to calculate the t-score, which involves the sample size.
- Determine Degrees of Freedom: The degrees of freedom (dof) is calculated as the sample size minus 2.
- Compute P-Value: The t-score and degrees of freedom are used to determine the p-value, indicating the significance of the correlation.
By performing these tests automatically and highlighting the most relevant columns, AddMaple’s statistics engine surfaces hidden insights and saves you time.
How to use in AddMaple
At AddMaple we want to make these powerful statistical techniques easy to use and explore. Rather than ask you to choose a complex set of options, we choose the most appropriate tests, run them automatically and present the results to you in an intuitive manner.
Please see examples below of the different ways in which we present these results back to you.
Related Columns
When you expand on any column in AddMaple, we run all the calculations as described above. The results are show in the stats tab. In the image below we can see that AddMaple has found a moderate relationship between "Age Category" and "Device Used".
Relationship Highlight
When clicking on a related column, AddMaple will take you to a pivot chart with the relationship highlight. For example in the image below we can see Age Category vs Device Used, with a clear preference for Tablets among the 65+ age category. AddMaple provides a highlight sentence about the relationship - in this case it is a moderate relationship.
Relationship Overview
By clicking on the "See more" link or on the "Stats" tab, AddMaple will give you more details on the relationship. We give a series of dynamic paragraphs depending on the columns chosen, the statistical test that was run, and the results of that test. In the example below you can see an explanation of how the Chi-Square results for "Age" vs "Device Used".
The numbers behind the overview
Below the overview we provide the underlying numbers from the stats tests performed. If you hover the name of each item we give a clear description of what the number is and how it was calculated.
Further Insights
Where applicable we run additional tests between categories. In the below example you can see that while there is a moderate relationship overall between "Age Category" and "Device Used", there is a strong relationship when comparing Smartphone users vs Laptop and Tablet users. This analysis helps you dive deeper to understand the particular categories that are having the biggest impact on the relationship.
Visual Exploration on the Chart Dashboard
When you've pivoted by a single column, you are able to go back to the chart dashboard and view all other columns pivoted by that column. The columns are ordered by the strength of the relationship. This allows you to quickly explore visually the impact of one column on all other columns in your dataset. Below we can see the two columns with the strongest relationship to "Age Category".
For further details on how to run each type of test please see these guides:
Conclusion
AddMaple's new statistics engine is designed to make your data analysis process more efficient and insightful. By automatically selecting and performing the appropriate statistical tests, it saves you time and reduces the complexity of your workflow. Whether you are an experienced analyst or new to data statistical tests, AddMaple helps you quickly uncover significant relationships and hidden insights in your data. With the ability to highlight the most relevant columns and provide clear visualizations, AddMaple ensures that you can focus on getting useful insights from your data in as short a time as possible.
We are a small team passionate about making data analysis fast, intuitive and fun. We are continually improving this feature, if there is something you'd like to see, then please let us know. We hope this module helps you uncover hidden insights in your data.
The AddMaple Team