To handle the vast amount of data available today, it’s necessary to slice and dice or break information down, examining it from different viewpoints until you get the right level of detail and understanding. This is the basic concept that underpins the work of data scientists.
While basic, the skill level is akin to that of a chef. Chefs need a good set of knives with different edges to slice and dice their ingredients: to prepare the ingredients for a recipe.
IN THE FIELD OF TEXT ANALYTICS, SLICING AND DICING GETS COMPLICATED:
Let’s look at four techniques that dice words and slice phrases from these two sample texts:
🔪 Keyword extraction:
Keywords are sequences of one, two, or three words in a text, filtered using statistical methods like frequency counting and/or Pointwise Mutual Information (PMI). The keywords help to identify what is most talked about in the text.
🔪 Topic modeling:
This technique is used in a large corpus of data to identify groups of words with a similar context (based on the statistical probability of occurring together). Then the best topic name is assigned from what is observed. This can be subjective; for example, ‘Topic 3’ below could be called ‘Pet problems'.
🔪 Noun phrases:
These are phrases built around a noun and preceded by a ‘determiner’ (such as ‘a,' ‘the,' ‘some,' ‘this’). Expanded noun phrases give more detail and contain adjectives, such as ‘male,' ‘huge,' and ‘colorful’.
🔪 Dependency grammar:
This process captures the relationship between words in a sentence or piece of text using the grammatical structure of that text. Let’s take a closer look at the diagram and the phrase “cat loves the taste”:
There can be thousands of possible patterns in a corpus of mixed-sized text data, but not all are relevant or produce meaningful phrases.
Slicing and dicing in textual analytics are known as ‘lexical semantics’ – deconstructing words and phrases within the text. To go back to our chef analogy, In themselves, these words and phrases don’t provide meaning in context or insights (difference between apples and apple pie).
For many analytics providers, this is the start and end of insights. But in fact, it’s just the beginning. From here, the challenges go beyond the technical ones – it’s about what has been referred to as ‘the last mile’ and requires human engagement, labeling, and interpretation.
The 2020 GRIT report bore this out:
“What is striking is that both Buyers and Suppliers equally indicate that business knowledge (68% for both) is a high priority skill their organization needs and technical/computer expertise is the lowest (28% for Buyers and 42% Suppliers).”
‘The last mile’ hasn’t evolved as rapidly as the technical parts of NLP; many analytics providers reach a certain point and offer what they ‘conveniently’ position as a ‘self-service’ or ‘human in the loop’ service. By doing this, they are effectively offloading the last mile as they lack the subject expertise (like the chef’s knowledge) to complement data science efforts.
If you’d like to learn more about how we go further, please read our article, “Demystifying AI”.
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.