If you’re still reading, you’ve successfully cut through the poor language in the title. In Natural Language Processing, this ‘poor language’ is what we call ‘noise’, and includes incorrect character repetitions (such as: ‘reallllly’ and ‘……..’) and spelling errors (like ‘strugle’ and ‘agre’).
As humans, we can decipher this one sentence without too much difficulty. If we want to learn about consumer’s experiences from the vast amounts of consumer-generated text data, we need to leverage the power of computers.
The Natural Language Processing (NLP) community has developed techniques to simplify the process by bringing words back to their base form. This avoids needing individual rules for each type of noise and anticipating new noise.
With stemming, words tend to be chopped down, which can result in the meaning being lost in a way that we did not intend, for example:
Lemmatizing also cuts down words, but the result is a legitimate word. For example:
Although ‘trouble’ is a valid word, the sentence may have lost meaning because the tense has changed.
Lastly, removing the ‘stop words’ such as helping verbs and words like: ‘am’, ‘is’ and ‘has’ can also change the meaning and context of sentences.
Let’s see the impact of this in the sample customer review below.
Note: As a standard practice, all text data is first converted into lowercase to maintain the consistency of the vocabulary. For example, ‘Love’ and ‘love’ are the same word. Still, suppose we don’t convert ‘Love’ to lowercase. In the case, the computers will consider these as two different words in terms of syntax (formation of a sentence), which might be necessary for statistical analysis (word frequency).
Now, let’s see how we can maintain meaning and context.
Not all the mistakes are critical, but you can see some glaring issues:
There is a difference between passing ‘gsd’ through a general dictionary versus our domain-specific dictionary, which knows that ‘gsd’ is a valid term and stands for ‘German Shepherd Dog’:
Spell correction is a critical aspect of the NLP process. You can learn more about our approach in our article ‘Allways chek for speling erors’.
Characters like ‘#’ ‘%’ ‘&’ ‘@’ are non-alphabetical and are sometimes used to make emoticons like ‘:)’ or to make the text more attractive. In NLP, all punctuation is considered ‘special characters', and although some need to be removed for clarity, they shouldn’t all be removed, otherwise this can happen:
The meaning has changed dramatically because removing the decimal point has meant the dog’s weight has gone from 3.5 pounds to 35 pounds. Therefore, we keep these specific special characters to maintain meaning.
Contractions such as ‘won’t’, ‘don’t’, and ‘isn’t’ are noisy because they have apostrophes which will be considered as additional characters – this will create problems when we try to standardize the words using ‘spell correction’. For example, the term 'don’t’ won’t be corrected to ‘do not’ directly and might be changed to something unrelated by a spell checker.
You might argue that we could use the contractions as common vocabulary words and avoid spell-checking them, but people often write contractions without apostrophes (like ‘dont’ or ‘wont’). So, rather than dealing with two separate forms of single words (such as ‘don’t’ and ‘dont’), we expand the contractions for consistency.
The reviewer is excited about the product and has repeated several characters. So, we need to remove the incorrect character and punctuation repetition to make the text consistent.
[It should be noted that we only remove the incorrect repetitions. For example, the word ‘smelly repeats the letter ‘l’, but that’s the correct spelling, so we leave it as it is.]
For the above review, we have an accented character ‘é’ in the sentence:
When misused, these characters are noisy and break the consistency of data. Therefore, we convert these accented characters into standard ASCII form (American Standard Code for Information Interchange – the most common format for text files) to avoid multiple forms of a single word.
Pre-processing receives very little attention in the interactions between clients and providers, yet it is an essential determinant of the quality and depth of insights.
For example, while visually impressive, a word cloud may camouflage an unacceptable level of pre-processing as it only displays a high level of information.
The best way to get deeper, qualitative insights begins with how you prepare the data and handle the noise. Choose a provider using interpretative tools tuned to pet-specific consumer language to gain meaningful insights.
If you’d like to learn more about how we go further, please read our article, "Allways chek for speling erors".
.
The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.
A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.
Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of" nested selector system.