Vitokenizer là gì
Introduction to pyvi: Python Vietnamese NLP ToolkitUtilizing pyvi package for tokenization, pos tagging and accent marks modificationsI have covered quite a number articles related to NLP toolkit of various Asian languages in the past:
Today, lets explore a little further on Vietnamese instead. By reading this piece, you will learn to perform linguistic analysis on Vietnamese text via an open-source Python package called pyvi. At the time of this writing, pyvi offers the following functionalities:
Lets proceed to the next section and start installing the necessary packages. SetupBefore you continue with the installation, it is recommended to create a new virtual environment. Activate it and run the following command: pip install pyviTokenizationTokenizeIn this section, you will learn to perform tokenization on Vietnamese text. Create a new Python file and add the following code inside it. result = ViTokenizer.tokenize(text) print(result) You should get the following output: Xin chào ! Rất vui được gặp bạn .Each token will be separated by a white space. You can easily convert it to a list by splitting the text with whitespace: result.split(' ')The new output is as follows: ['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.']spacy_tokenizeBesides that, pyvi does provide an alternative function called spacy_tokenize for better integration with spaCy package. Simply call it as follows: result = ViTokenizer.spacy_tokenize(text)The output is a tuple with the following items:
You should get the following output when you ran the file: (['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.'], [True, False, True, True, True, True, True, False, False])Use index 0 to get the list: result[0]POS TaggingThere are two steps involved for POS tagging:
postaggingSimply call the postagging function after you have tokenized the text: from pyvi import ViPosTaggerresult = ViPosTagger.postagging(ViTokenizer.tokenize(text))print(result) The following text will be displayed on your terminal: (['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.'], ['V', 'V', 'F', 'R', 'A', 'R', 'V', 'N', 'F'])Likewise, it contains a tuple with the following items:
Simply loop over the list to get the corresponding tag for the token: for index in range(len(result[0])):print(result[0][index], result[1][index]) You should see the following output: Xin Vchào V ! F ... The complete list for POS tags is as follows:
postagging_tokensIn addition, there is an alternative function called postagging_tokens which accepts a list of tokens instead. You can use it in conjunction with spacy_tokenize to get the same output: tokens = ViTokenizer.spacy_tokenize(text)[0]result = ViPosTagger.postagging_tokens(tokens) print(result) Accent Marks RemovalSometimes, there might arise a situation in which accents marks (diacritics) should be removed from the text. In this case, you can utilize remove_accents function to do the trick for you: print(result) It returns a byte string: b'Xin chao! Rat vui duoc gap ban.'If you wish to use the output as string, all you need to do is to encode it based on UTF-8 encoding: result.encode('utf8')Accent Mark AddingOn the other hand, Vietnamese text can be ambiguous and confusing without accent marks. Use the add_accents function to convert unaccented text to accented text: unaccented_text = 'truong dai hoc bach khoa ha noi'result = ViUtils.add_accents(unaccented_text) print(result) The output should be as follows: Trường Đại học Bách Khoa hà nộiConclusionLets recap what you have learned today. This article started with a brief introduction on the features available in pyvi. Then, it moved on to the installation process via pip install. It continued with in-depth explanations on some of the core functionalities such as tokenization and pos tagging. Moreover, this tutorial also covered sections on diacritics which include removal and adding of accent marks to Vietnamese text. Thanks for reading this piece. Feel free to read my other articles. Have a great day ahead! Reference
|