A Deep Dive into Ngrams, Filters, and Contextual Analysis
Before you roll your eyes at the mention of word clouds, please let us explain why ours are useful with the addition of a few extra tricks. We can all agree that analyzing unstructured text is difficult. And we can agree that free text data could contain hidden insights. For example that open ended survey question might have brought something to your attention you didn't know to ask about in the multiple choice questions. The review data waiting to be explored might help you understand how your customers view your product compared to your competitors. And getting a handle on support data could help you prioritize the next feature because you have a greater understanding of that customer need in the first place
Exploring unstructured free text (e.g. free text responses to a survey question) is difficult. At AddMaple we’ve explored a few different approaches and would like to share why we think interactive word clouds are an extremely powerful tool.
There is a lot of evidence that “conversational surveys” or allowing users to enter free text answers is valuable. Rather than restricting users to choosing a preselected list of answers (which by definition will be limited to what we know at the time), free-text answers allow you to discover what users really think about a topic. As well as surveys, analyzing customer support messages or customer reviews is extremely valuable. Oftentimes a decision is made to leave out open questions because of the analytical challenges they bring but dear reader, we have made this much simpler to tackle.
At AddMaple we are big believers in thematic analysis (and the role that AI can play in it), however, in this post, we will explain a simpler and quicker way to explore a large free text dataset. We’ll explore n-grams, stemming, and interactive word clouds and we’ll explain why we made the choices that we did.
N-grams, Unigrams, Bigrams, and Trigrams Explained
N-grams are commonly used in natural language processing, here is a brief explanation of what they are:
N-grams: A contiguous sequence of 'n' items from a given text sample. In text, an item often means a word. The 'n' in n-gram denotes how many words are grouped together, aiding in understanding language structure.
Unigrams: Single words. In the phrase "I love ice cream", unigrams are "I", "love", "ice", and "cream".
Bigrams: Pairs of words. From the same phrase, the bigrams are: "I love", "love ice", and "ice cream".
Trigrams: Sequences of three words. Using the aforementioned example, trigrams are: "I love ice" and "love ice cream".
By using n-grams, we can capture linguistic patterns, with unigrams highlighting individual words, while bigrams and trigrams offer contextual insight by examining words in pairs or groups of three.
What size n-gram to use?
Traditionally, many analysts have championed the use of bigrams and trigrams when trying to understand the thematic structure of a text. These combinations of two or three words can provide context that a single word - or unigram - might lack. For instance, the phrase "climate change" encapsulates a specific concept, while the words "climate" and "change" separately might not capture the full essence.
But here lies a challenge: What if crucial words that are conceptually linked don't sit right next to each other in a sentence? Take, for instance, a statement like "Climate is experiencing drastic changes due to global activities." The words "climate" and "changes" are separated, and thus wouldn’t be included together in bigrams or trigrams.
We explored the following approaches:
- Combining trigrams, bigrams, and unigrams using a scoring approach that ranks trigrams the highest, followed by bigrams and finally unigrams. The purpose of this technique is to highlight common bigrams and trigrams as well as unigrams. In the above example, this would have resulted in a bigram of “Stack Overflow”. This approach has some issues, however:
- Allowing the user to select between unigrams, bigrams, and trigrams. This gives the user control, however, it still has issues:
After exploring the above (and other approaches), we settled on using unigrams but combined with interactive filters.
Interactive word clouds
Word clouds that use unigrams and interactive filters are the most useful in our experience. Here's how it works: when you click on a word, the cloud can adjust to show you other words most frequently associated with your chosen term. This interactivity means that, regardless of whether the words are immediately adjacent to each other or several words apart in the original text, you can gain insights into their contextual relationships.
Imagine being able to tap on "climate" and immediately seeing words like "change," "impact," and "global" becoming more prominent, thus giving you a clear snapshot of the surrounding discourse. With the flexibility to filter down on two or more words, users enjoy the benefits that bigrams or trigrams traditionally offer, but with the added advantage of broader context and real-time exploration.
AddMaple is optimized to apply filters within milliseconds for datasets of up to a million records. The user experience improvement of instant filters is remarkable. It allows quick analysis and exploration of datasets in a way that is just not possible when each filter takes a few seconds to apply.
After settling on unigrams, we still had the issue of stemming to consider.
Why we choose not to use stemming in our interactive word clouds
Stemming has long been a cornerstone in the realm of textual analysis. At its core, stemming aims to reduce a word to its base or root form, allowing words like "answer" and "answers" to be processed as one entity. By consolidating such variations, analysts can gain a clearer picture of the frequency and prominence of core themes within a body of text.
However, like many tools, stemming is not without its complications. For the uninitiated, a word cloud that displays only stems might seem alien and incoherent. Let's take the word "computing" for example. Its stem, "comput", would baffle many, as it doesn't correspond to any commonly used term in English. While "answer" and "answers" are aggregated as "answer", not all words are so accommodating.
To avoid the confusion of only including the stem, we tried displaying the most frequent variant of a stemmed word in the word cloud. On the surface, this seemed like a viable solution. Yet it introduces a few issues. Firstly there are many situations where you do want to see variants of a stem, for example you may not want the noun “researcher” and the verb “researching” to be linked together. In addition it creates these issues when filtering:
- Filtering solely on the most frequent variant misses the other variants of the stem
- Filtering on the stem alone catches unrelated words, e.g. the stem "operat" could pull in "operation", "operator", and "operate", but could also unintentionally capture words like "operative".
- Cleverly filtering on all variants of a stemmed word works but makes text filtering more complex both from a filtering perspective and for the UI. It breaks the simple connection between the text filter box and the text filtering.
In light of these challenges, our approach at AddMaple has been to side-step stemming in favor of harnessing the power of interactive filtering in word clouds. When users can swiftly sift through terms and observe relationships in real-time, the distinction between "answer" and "answers" isn’t a problem. The intuitive nature of this method, coupled with its speed, makes it a good choice for exploring and understanding free text data.
In addition to our decision not to use stemming in our interactive word clouds, we also chose to forgo lemmatization. Lemmatization, like stemming, aims to reduce words to a base or dictionary form. However, it considers the context and converts the word to its meaningful base form, unlike stemming, which often chops off word endings. For instance, "better" is lemmatized to "good" and "running" to "run". While this sounds helpful, especially for linguistic accuracy, it introduces complexities in a word cloud context. The transformed words can sometimes deviate significantly from their original form, potentially confusing users who are trying to connect the visualized words back to their raw data.
Additionally, the computational demands of lemmatization, especially for large datasets, can impact the responsiveness and simplicity of the user interface, a critical aspect of our design ethos at AddMaple. Thus, while acknowledging the linguistic sophistication of lemmatization, we chose to prioritize user experience and immediacy of insight, steering clear of any processes that could complicate or obfuscate the intuitive understanding of data.
The beauty of simplicity
After exploring more complex natural language processing, and observing how these options impacted on the ease with which a user is able to grasp the meaning of a body of text, and the user-interface itself, we settled on keeping things simple.
The design of user interfaces is as much an art as it is a science. While advanced functionalities are appealing technically, they can also obfuscate what the user is trying to do, muddying functionality, leading to unintended confusion. It's a delicate balance between equipping users with powerful tools and ensuring clarity and intuitiveness. Overcomplicate the interface, and you risk alienating users; oversimplify, and you may not deliver on the promise of deep insights.
This is why at AddMaple, we have centered our approach on the principle of intuitive interaction. We’ve designed our word clouds to look good, be intuitive to work with and be quick to dive into the details and explore connections between words.
We also don’t hide the actual raw free text responses from users - once a dataset has been filtered down using our interactive word clouds it is easy to switch to a table view focussed on the column being analyzed.
In our word clouds, we exclude "stop words," which are common but less informative words like 'the', 'is', and 'on'. This helps highlight more meaningful words, making the data analysis clearer and more relevant. Currently we only support English stop words, but let us know and we can add support for your language.
Our recommended workflow
Load the Data: Begin by loading your dataset into AddMaple. You can load from CSV or SAV files or bring your data in directly from Typeform, Survey Monkey or Google Drive. AddMaple will automatically detect your column types and prepare text columns for unigram based analysis.
Instant Word Cloud Visualization: For each text column, a word cloud is auto-generated. Crafted from unigrams, this cloud disregards stop words and captures the essence of the responses by highlighting the most frequently used terms. Whether respondents are discussing generic concepts or mentioning specific product names, you'll be able to see the key words used at a glance.
Interactive Exploration: Dive deeper into the data by clicking around. Employ the filtering options to narrow down topics or themes. The beauty of AddMaple's interface is the fluidity it offers; see a word you are interested in, click it and instantly see related words.
Deep Dive with Table View: After honing in on a subset of data through filters, explore the raw data in our table view. By simply clicking on the 'table' link, you're presented with a neatly organized table showcasing the raw responses related to your filters. Each filtered word is highlighted to make it easy to see how they were actually used in the raw data. This focused approach, devoid of other distracting columns, makes it easy to explore lots of raw text responses in an efficient manner.
From Macro to Micro
AddMaple's intuitive design takes you on a journey from overarching themes to the nuanced sentiments of individual respondents. This gradient of insights ensures that while you grasp the bigger picture, you never lose touch with the unique perspectives that often hold the key to understanding your audience better.
If you think we should approach this problem another way, we’d love to hear from you. Designing a data analysis tool that is powerful, but intuitive is a big challenge. Hopefully this post explains the why behind our word cloud interface. The next time you have a dataset with free text responses try exploring it with AddMaple.