[PYTHON/NLTK] Getting started with POS tagging

POS is the abbreviation for “part of speech”, that means a category of words (or, more generally, of lexical items) which have similar grammatical properties. POS tagging is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context.

When we talk about POS, the most frequent POS notification used is Penn Treebank. See its tag set here. There is also other POS tag sets used in POS tagging such as Brown Corpus POS tag set.

Since this post is about how to get started with POS tagging, I will start with using pos_tag method provided by NLTK. This is one of the pre-trained POS taggers that come with NLTK.

>>> import nltk
>>> sentence = "At eight o'clock on thursday morning"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'thursday', 'morning']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('thursday', 'JJ'), ('morning', 'NN')]
>>> for word, pos in tagged: # getting words with NN tag
if pos=='NN':
print (word)


o'clock
morning

 

NLTK provides documentation for each tag, which can be queried using the tag, e.g. nltk.help.upenn_tagset(‘RB’), or a regular expression, e.g. nltk.help.upenn_tagset(‘NN.*’). Some corpora have README files with tagset documentation, see nltk.corpus.???.readme(), substituting in the name of the corpus.

 

Ref: 

NLTK Essentials, Nitin Hardeniya, PACKT publishing

http://www.nltk.org/

Wikipedia