I hope you are enjoying the “Advanced Analytics Introduction” blog post series; here is a link to the previous segments (Step One, Step Two, Step Three, Step Four) to provide some helpful background. In the previous installment, I continued to review the practice of “shallow parsing” of natural language content. In this post, I will examine word association, mining, and analysis.
There are two kinds of relationships words can have within a sentence.
Words with a high context similarity have a paradigmatic relationship. Words with these kinds of relationships can be substituted for each other and still result in a valid sentence. In the diagram below the words “cat” and “dog” can be substituted for one another without affecting the overall meaning of the sentence. By contrast, if we were to substitute “dog” with “computer,” the sentence would no longer be valid.
Figure 1. Example of a Paradigmatic Relationship (Cheng Xiang Zhai et al., 2016)
Words that have a high co-occurrence with other words, but have an overall low occurrence, have a syntagmatic relationship. Unlike paradigmatic relationships, we are looking at how many times two words occur together in a context and are then comparing this to their individual occurrences. In the diagram below we are focused on the word “eats,” and, more specifically, looking at how this relationship can predict what other words it’s likely to be associated with:
Figure 2. Example of a Syntagmatic Relationship. (Cheng Xiang Zhai et al., 2016)
Now that we understand what kind of relationships words can have within a body of text, which we will refer to as a “document” from this point onwards, let’s discuss the techniques to discover the similarity between documents.
Many of the approaches to finding similarities between documents are based on the principal of distributional semantics. The basic idea of distributional semantics is that documents with similar word distributions have similar meanings.
So, how can we represent text in a way to understand word distribution?
One of the first tasks is to turn text into numbers or features, which can then be processed to uncover patterns and relationships. The most commonly used pre-processor step is called a “Bag-of-Words Model” (BoW). The BoW Model counts the frequency of words within a document. It is called a “bag of words” because the order of words is discarded during the process of creating the word distribution. The model is only looking at word occurrence within a document.
Here is a very simple example from an article on this subject by Jason Brownlee, Ph.D. He is taking an excerpt from the book “A Tale of Two Cities” by Charles Dickens. Each line below is comprised of ten words, and we are going to treat each of the four lines as a “document”. The entire “corpus” consists of 24 words.
“It was the best of times” = [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
“it was the worst of times” = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
“it was the age of wisdom” = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
“it was the age of foolishness” = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
Once we score all the words in each document we can begin to see similarities and differences. Each document is now represented as a “Binary Vector” (more on this later when discussing the Vector Space Model).
Some of the initial tasks you can perform with the BoW model involve simply counting the number of times a word appears in a document. You can also calculate the word frequency within a document. The frequency of each word represents the overall probability of its existence within a given document.
Word frequency becomes problematic in the BoW model because of low-value words (e.g., “it”, “the”, and “of”). This causes the important words within the document to become lost in what is essentially word noise.
This is where we use a different weighting technique called “Term Frequency-Inverse Document Frequency (TF-IDF),” which is a simple function that puts a higher emphasis on infrequently occurring words. This technique tells you how rare a word is across documents.
The diagram below from datameetsmedia.com illustrates the use of TF-IDF. It considers the following three example documents:
- “I love dogs”
- “I hate dogs and knitting”
- “Knitting is my hobby and my passion”
Note that we did not remove “sfrom within this document, so there are still some low-value words mixed in.TF-IDF identified high-value words like “love,” “hobby,” and “passion” while de-emphasizing the lower-value words like “I” and “and.” The de-emphasis of the word “knitting” is not correct, but this is mostly due to the small text sample size.
Figure 3. Example of the Application of the TF-IDF Formula (datameetsmedia.com, 2017)
In the previous section, we took the four sentences from the Dickens book “A Tale of Two Cities” and created a vector for each line, which we referred to as a “document.” These vectors are also known as “feature vectors,” where each feature is a word (term) and the feature’s value is a term weight.
We then adjusted the term frequency to score higher value words using the TF-IDF function. Using the Vector Space Model, each feature vector is assigned to a point in a vector space. The cosine distance between the two feature vectors indicates their similarity. This is also known as “Cosine Similarity”.
Figure 4. Example of Cosine Similarity using the Vector Space Model. Original image altered with concrete illustrations (“apples,” “bananas,” etc.). All other aspects of the image are original. (blog.christianperone.com, 2019)
The Performance Architects team recently implemented a solution using the principles mentioned in this blog in order to help an organization better understand their customer product feedback. These were the outcomes of that engagement:
- Implemented a data model that used the TF-IDF (Term Frequency-Inverse Document Frequency) methodology to index and weight words in order to make them easily searchable
- Developed executive dashboards that allowed for complex search capabilities and the ability to drill down to the lowest case-level detail
- Developed a model to demonstrate when the company’s products are mentioned on social media platforms and what the context of the mention entails
- The TF-IDF method along with the tools complex search capability enabled the client team to search for all entered variations of search words, including misspellings and typos, ensuring that nothing was missed
As I have mentioned in the previous blogs in this series, a strong search base of your original text corpus is essential for your success in performing other higher-level advanced analytics task such as sentiment analysis. A strong system provides knowledge provenance by allowing you to test the origin and validity of the information you are working with.
We hope you have enjoyed this topic! Additional blog posts on text and advanced analytics concepts to follow; please contact firstname.lastname@example.org if you have any questions or need further help!