Part of speech tagger

Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection, Jinho D.

Choi, Proceedings of the AAAI 2015 Student Program, Phoenix, AZ, 2015.

Intrinsic and Extrinsic Evaluations of Word Embeddings, Michael Zhai, Johnny Tan, Jinho D.

Choi, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (NAACL'16), San Diego, CA, 2016.

Dynamic Feature Induction: The Last Gist to the State-of-the-Art, Jinho D.

It processes over 82K tokens per second on an Intel Xeon 2.30GHz machine and shows the state-of-the-art accuracy (97.64% on the WSJ corpus). “Improving statistical POS tagging using linguistic features for Hindi and Telugu” by S Phani Kumar Gadde, Meher Vijay Yeleti.Our part-of-speech tagger uses the generalized model from dynamic model selection and utilizes ambiguity classes trained on a large corpus. In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation. “HMM BASED POS TAGGER FOR HINDI” by Nisheeth Joshi, Hemant Darbari and Iti Mathur. Token : Each entity that is a part of whatever was split up based on rules. “Hindi POS Tagger Using Naive Stemming : Harnessing Morphological Information Without Extensive Linguistic Knowledge” by Manish and Pushpak I have tried using Google Translator API to handle the “Unk” tags by translating and getting the tags and then appending it to the NLTK Indian Corpus which gave a pretty good result.

You can try training the data and predict the “UNK” data using SVM or HMM Tagger.

we can also use Google translator to translate and get the tag.

we can add more tagged sentences to NLTK hindi.pos.

we can try handling the compound words as purposed by.

we can use probability with freq for next word as purposed by.

So we tend to get the tag “Unk” most of the time while tagging the words ex: (‘वाशिंग’, ‘Unk’), (‘मशीन’, ‘Unk’). The main issue here is that the nltk data is not complete. Let’s Tag the text now! tagged_words = (tnt_pos_tagger.tag(nltk.word_tokenize(text))) print(tagged_words) : In fact, the same word can be a noun in one sentence and a verb or adjective in the next.Tag is a keyword. from nltk.tag import tnt from rpus import indian train_data = indian.tagged_sents(‘hindi.pos’) tnt_pos_tagger = tnt.TnT() tnt_pos_ain(train_data) Each part of speech explains how the word is used. Let’s use the already tagged data which is given in nltk to train the data. tTAG is a part-of-speech tagger which can handle plain ASCII text and XML marked-up text.

So today we’ll be using TNT tagger to tag Hindi words! S Phani Kumar Gadde, Meher Vijay Yeleti used CRF based tagger and Brants TnT (Brants, 2000), an HMM-based tagger for Hindi POS Tag where they got an accuracy of 94.21%. while Nisheeth Joshi, Hemant Darbari and Iti Mathur also researched on Hindi POS using Hidden Markov Model with the frequency count of two tags seen together in the corpus divided by the frequency count of the previous tag seen independently in the corpus. Manish and Pushpak researched on Hindi POS using a simple HMM-based POS tagger with an accuracy of 93.12%.

Hindi Part of Speech Tagging is something that people are still doing research on as we have various techniques and libraries available for English Text and rarely for Hindi Text. Jet provides a tagger file trained on a portion of the part-of-speech tagged Penn. Reminds you of school days? Okay now let’s start with Hindi Part of Speech Tagging. The part-of-speech tagger assigns parts of speech to tokens based on lexical statistics (the frequency with which a word is assigned a given part of speech) and POS bigram statistics (the frequency with which part of speech X is followed by part of speech Y). A noun is divided into Proper Nouns, Common Nouns, Concrete Nouns etc. Most of it is further divided into sub-parts. There are eight main Parts of Speech: Nouns(naming word), Pronouns(replaces a noun), Adjectives(describing word), Verbs(action word), Adverbs(describes a verb), Prepositions(shows relationships), Conjunctions(joining word) and Interjections(Expressive word). Didn’t we? But anyways let me give a brief explanation on it! Before going further on POS tagging, I am assuming that you all know about the part of speech as we all have studied grammar during school. POS tagging is used mostly for Keyword Extractions, phrase extractions, Named Entity Recognition, etc. Part of speech plays a very major role in NLP task as it is important to know how a word is used in every sentence.