[PYTHON/NLTK] Stop Word Removal, Rare Word Removal and Edit Distance

On this post, Python commands for stop word removal, rare word removal and finding the edit distance, (which are parts of Text Wrangling and Cleansing) will be shared.

Stop Word Removal

Stop words are the words that occur commonly across all the documents in the corpus. Stop word removal is one of the most commonly used pre-processing steps across different NLP applications.

Python commands to stop word removal:

>>> tokens = ['hello','java','python','the']


>>> from nltk.corpus import stopwords


>>> stop = set(stopwords.words('english'))


>>> clean_tokens = [tok for tok in tokens if len(tok.lower())>1 and (tok.lower() not in stop)]


>>> clean_tokens
['hello', 'java', 'python']


 

Rare Word Removal

Sometimes, you would need to remove the words that are very unique in nature like names, brands, product names, and some of the noise characters, such as html leftouts. This is considered as “rare word removal”.

Python commands for rare word removal:

>>>import nltk
>>> tokens = ['hi','i','am','am','whatever','this','is','just','a','test','test','java','python','java']
>>>freq_dist = nltk.FreqDist(tokens)

>>>rarewords = freq_dist.keys()[-5:]

>>>after_rare_words = [ word for word in token not in rarewords]

FreqDist() is used to get the distribution of the terms in the corpus, selecting the rarest one into a list (Since the array “tokens” is not too big, only last 5 items are filtered here.), and then filtering the original text. Here, I used a very simple array for tokens but you can try this out with your original corpus or the individual documents, as well! (Ex. Parsed data from open API.)

 

Edit Distance

Some NLP projects would require you to use basic spell check. A very basic and simple spellchecker can be created by just using a dictionary and algorithms to compare strings. “Edit Distance” is one of the most commonly used method in this case.

In computational linguistics and computer science, edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.

To find edit distance, we need three types of operations — Insertion, Deletion and Substitution. The edit distance can be understood briefly as minimal number of operations happened between two strings to match each other.

 

Example:

The edit distance (also The Levenshtein distance ) between “rain” and “shine” is 3. A minimal edit script that transforms the former into the latter is:

  • rain → sain (substitution of “s” for “r”)
  • sain → shin (substitution of “h” for “a”)
  • shin → shine (insertion of “e” at the end)

 

NLTK provides us with a variety of metrics module that has edit distance.

Python commands for finding edit distance between two strings:

>>>from nltk.metrics import edit_distance
>>>edit_distance("rain","shine")
3

 

Ref: 

NLTK Essentials, Nitin Hardeniya, PACKT publishing

http://www.nltk.org/

Wikipedia