Hỏi Đáp Là gì

Vitokenizer là gì

Introduction to pyvi: Python Vietnamese NLP Toolkit

Utilizing pyvi package for tokenization, pos tagging and accent marks modifications

Ng Wai Foong

Sep 7·4 min read

Photo by Markus Winkler on Unsplash

I have covered quite a number articles related to NLP toolkit of various Asian languages in the past:

Khmer Natural Language Processing in Python
Beginners Guide to PyThaiNLP
Korean Natural Language Processing in Python
SudachiPy: A Japanese Morphological Analyzer in Python

Today, lets explore a little further on Vietnamese instead. By reading this piece, you will learn to perform linguistic analysis on Vietnamese text via an open-source Python package called pyvi.

At the time of this writing, pyvi offers the following functionalities:

Tokenization
POS tagging
Accent marks removal
Accent marks adding

Lets proceed to the next section and start installing the necessary packages.

Setup

Before you continue with the installation, it is recommended to create a new virtual environment. Activate it and run the following command:

pip install pyvi

Tokenization

Tokenize

In this section, you will learn to perform tokenization on Vietnamese text. Create a new Python file and add the following code inside it.

from pyvi import ViTokenizertext = 'Xin chào! Rất vui được gặp bạn.'
result = ViTokenizer.tokenize[text]
print[result]

You should get the following output:

Xin chào ! Rất vui được gặp bạn .

Each token will be separated by a white space. You can easily convert it to a list by splitting the text with whitespace:

result.split[' ']

The new output is as follows:

['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.']

spacy_tokenize

Besides that, pyvi does provide an alternative function called spacy_tokenize for better integration with spaCy package. Simply call it as follows:

result = ViTokenizer.spacy_tokenize[text]

The output is a tuple with the following items:

a list of tokenized tokens
a list of booleans indicating if the token followed by a space

You should get the following output when you ran the file:

[['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.'], [True, False, True, True, True, True, True, False, False]]

Use index 0 to get the list:

result[0]

POS Tagging

There are two steps involved for POS tagging:

tokenize the text into a new text delimited by whitespace
perform pos tagging on the tokenized text

postagging

Simply call the postagging function after you have tokenized the text:

from pyvi import ViPosTaggerresult = ViPosTagger.postagging[ViTokenizer.tokenize[text]]
print[result]

The following text will be displayed on your terminal:

[['Xin', 'chào', '!', 'Rất', 'vui', 'được', 'gặp', 'bạn', '.'], ['V', 'V', 'F', 'R', 'A', 'R', 'V', 'N', 'F']]

Likewise, it contains a tuple with the following items:

a list of tokenized text
a list of POS tags corresponding to the token

Simply loop over the list to get the corresponding tag for the token:

for index in range[len[result[0]]]:
print[result[0][index], result[1][index]]

You should see the following output:

Xin V
chào V
! F
...

The complete list for POS tags is as follows:

A Adjective
C Coordinating conjunction
E Preposition
I Interjection
L Determiner
M Numeral
N Common noun
Nc Noun Classifier
Ny Noun abbreviation
Np Proper noun
Nu Unit noun
P Pronoun
R Adverb
S Subordinating conjunction
T Auxiliary, modal words
V Verb
X Unknown
F Filtered out [punctuation]

postagging_tokens

In addition, there is an alternative function called postagging_tokens which accepts a list of tokens instead. You can use it in conjunction with spacy_tokenize to get the same output:

tokens = ViTokenizer.spacy_tokenize[text][0]
result = ViPosTagger.postagging_tokens[tokens]
print[result]

Accent Marks Removal

Sometimes, there might arise a situation in which accents marks [diacritics] should be removed from the text. In this case, you can utilize remove_accents function to do the trick for you:

from pyvi import ViUtilsresult = ViUtils.remove_accents[text]
print[result]

It returns a byte string:

b'Xin chao! Rat vui duoc gap ban.'

If you wish to use the output as string, all you need to do is to encode it based on UTF-8 encoding:

result.encode['utf8']

Accent Mark Adding

On the other hand, Vietnamese text can be ambiguous and confusing without accent marks. Use the add_accents function to convert unaccented text to accented text:

unaccented_text = 'truong dai hoc bach khoa ha noi'
result = ViUtils.add_accents[unaccented_text]
print[result]

The output should be as follows:

Trường Đại học Bách Khoa hà nội

Conclusion

Lets recap what you have learned today.

This article started with a brief introduction on the features available in pyvi.

Then, it moved on to the installation process via pip install.

It continued with in-depth explanations on some of the core functionalities such as tokenization and pos tagging. Moreover, this tutorial also covered sections on diacritics which include removal and adding of accent marks to Vietnamese text.

Thanks for reading this piece. Feel free to read my other articles. Have a great day ahead!

Reference

Github pyvi