Demystify AI

This article will reallllly enlighten you……..but many will strugle to agre.

If you’re still reading, you’ve successfully cut through the poor language in the title. In Natural Language Processing, this ‘poor language’ is what we call ‘noise’, and includes incorrect character repetitions (such as: ‘reallllly’ and ‘……..’) and spelling errors (like ‘strugle’ and ‘agre’).

As humans, we can decipher this one sentence without too much difficulty. If we want to learn about consumer’s experiences from the vast amounts of consumer-generated text data, we need to leverage the power of computers.

Consumer generated text data
Consumer generated text data

Although computers can process vast amounts of data, they still rely on us humans to provide instructions on how to handle the noise.

The Natural Language Processing (NLP) community has developed techniques to simplify the process by bringing words back to their base form. This avoids needing individual rules for each type of noise and anticipating new noise.

Common techniques

1. Stemming

With stemming, words tend to be chopped down, which can result in the meaning being lost in a way that we did not intend, for example:

uses > us

troubled > troubl

2. Lemmatizing

Lemmatizing also cuts down words, but the result is a legitimate word. For example:

troubled > trouble

Although ‘trouble’ is a valid word, the sentence may have lost meaning because the tense has changed.

3. ‘Stop words’ removal

Lastly, removing the ‘stop words’ such as helping verbs and words like: ‘am’, ‘is’ and ‘has’ can also change the meaning and context of sentences.

These techniques have limitations when trying to get insights into customer attitudes and behaviors-maintaining meaning and context matters. We want to understand precisely what the customers have said.

Let’s see the impact of this in the sample customer review below.

Note: As a standard practice, all text data is first converted into lowercase to maintain the consistency of the vocabulary. For example, ‘Love’ and ‘love’ are the same word. Still, suppose we don’t convert ‘Love’ to lowercase. In the case, the computers will consider these as two different words in terms of syntax (formation of a sentence), which might be necessary for statistical analysis (word frequency).

Original text:

‘I’ve been repurchasing these since my dog was a puppy. My dog now weighs 3.5 pounds, he is happy & healthy. They’re a great size training treat and are easy for my GSD to eat quickly when we are walking or out training. They are smelly and keep her interested. We don’t have to worry about giving her too many because they’re so small. My dog lovés this one. Thank you ssssoooo muchhhhh!!!

Pre-processing (using stemming, lemmatizing, and ‘stop words’ removal):

‘I’ve been repurchasing these since my dog was a puppy. My dog now weighs 35 pounds, he is happy healthy. They’re a great size training treat and are easy for my god to eat quickly when we are walking or out training. They are smelly and keep her interested. We don’t have to worry about giving her too many because they’re so small. My dog lovés this one. Thank you ssssoooo muchhhhh!!!

Now, let’s see how we can maintain meaning and context.

PetThinQ’s approach:

‘I have been repurchasing these since my dog was a puppy. My dog now weighs 3.5 pounds, he is happy and healthy. They are a great size training treat and are easy for my gsd to eat quickly when we are walking or out training. They are smelly and keep her interested. We do not have to worry about giving her too many because they are so small. My dog loves this one. Thank you so much!'

Not all the mistakes are critical, but you can see some glaring issues:

1. Spell correction

There is a difference between passing ‘gsd’ through a general dictionary versus our domain-specific dictionary, which knows that ‘gsd’ is a valid term and stands for ‘German Shepherd Dog’:

Spell correction is a critical aspect of the NLP process. You can learn more about our approach in our article ‘Allways chek for speling erors’.

2. Removing unnecessary special characters

Characters like ‘#’ ‘%’ ‘&’ ‘@’ are non-alphabetical and are sometimes used to make emoticons like ‘:)’ or to make the text more attractive. In NLP, all punctuation is considered ‘special characters', and although some need to be removed for clarity, they shouldn’t all be removed, otherwise this can happen:

My dog now weighs 3.5 pounds, he is happy & healthy.’
‘My dog weighs 35 pounds he is happy healthy’

The meaning has changed dramatically because removing the decimal point has meant the dog’s weight has gone from 3.5 pounds to 35 pounds. Therefore, we keep these specific special characters to maintain meaning.

3. Expanding contractions

Contractions such as ‘won’t’, ‘don’t’, and ‘isn’t’ are noisy because they have apostrophes which will be considered as additional characters – this will create problems when we try to standardize the words using ‘spell correction’. For example, the term 'don’t’ won’t be corrected to ‘do not’ directly and might be changed to something unrelated by a spell checker.

don't > donut

You might argue that we could use the contractions as common vocabulary words and avoid spell-checking them, but people often write contractions without apostrophes (like ‘dont’ or ‘wont’). So, rather than dealing with two separate forms of single words (such as ‘don’t’ and ‘dont’), we expand the contractions for consistency.

don't > do not

won't > will not

4. Reducing exaggerated character repetition

The reviewer is excited about the product and has repeated several characters. So, we need to remove the incorrect character and punctuation repetition to make the text consistent.

ssssoooo muchhhhh!!! > so much!

[It should be noted that we only remove the incorrect repetitions. For example, the word ‘smelly repeats the letter ‘l’, but that’s the correct spelling, so we leave it as it is.]

5. Removing accented characters

For the above review, we have an accented character ‘é’ in the sentence:

‘My dog lovés this one’

When misused, these characters are noisy and break the consistency of data. Therefore, we convert these accented characters into standard ASCII form (American Standard Code for Information Interchange – the most common format for text files) to avoid multiple forms of a single word.

lovés > loves

Many analytics providers perform pre-processing as part of a ‘standard procedure’ regardless of the end goal.

Pre-processing receives very little attention in the interactions between clients and providers, yet it is an essential determinant of the quality and depth of insights.


For example, while visually impressive, a word cloud may camouflage an unacceptable level of pre-processing as it only displays a high level of information.

Word cloud with broad information

The best way to get deeper, qualitative insights begins with how you prepare the data and handle the noise. Choose a provider using interpretative tools tuned to pet-specific consumer language to gain meaningful insights.


Proper pre-processing techniques = understanding exactly what customers say + qualitative insights.

Thanks for reading through to the end.🐶

If you’d like to learn more about how we go further, please read our article, "Allways chek for speling erors".


What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text
hi how are you

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.

  • How they feel about brandskfafjkkkkkkkkkkkfkkkkkkkkkkkdfjfjhfjfhfkhffhjfhfhjfhffhfhfjkhwuruywryrywryuw
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.