Sentiment Analysis of Movie Reviews pt.3 -- n-gram

--by Charlie Chengrui Zheng 01/25/2021

N-gram

In part.1's text preprocessing step, we tokenized the words in reviews one by one. For example, 'Very boring movie' will be tokenized as ['very','boring','movie'].

This kind of model is called unigram model, because we are only taking one token at a time. However, there are other ways of tokenizing the words in n-gram model. We can instead take a sequence of tokens at a time. For example, in a bigram (2-gram) model, 'Very boring movie' will be tokenized as ['very boring','boring movie'].

In a trigram (3-gram) model, 'Very boring movie' will be tokenized as a single token 'very boring movie'.

N-gram model is helpful in our sentiment analysis because sequences of words may contain more important semantics for classification. For example, the unigram 'very' does not contain any sentiment per se. 'boring' means that the reviews hates the movie. However, 'very boring' conveys that the reviewer really hates this movie, more the just 'boring'. 'very boring' shall be treated differently as 'boring' because it contains a stronger sentiment. Therefore, we need to find good n-gram models to do sentiment analysis.

Tuning parameter

In Scikit-learn's TfidfVectorizer, we can choose the n-gram model by passing in the parameters, tuples of minimum n and maximum n. For example, (1,1) means that we are only using unigram model, since minimum n and maximum n are both 1. (1,3) means that we are using unigram, bigram, and trigram model together. For example, 'Very boring movie' will be tokenized as ['very','boring','movie','very boring','boring movie','very boring movie']

Therefore, we can refine the preprocess and classify function in part.1 as below:

Naive Bayes Classifier

Because, from part.1, Multinomial Naive Bayes classifer was fast and accurate. We are going to use MultinomialNB as a baseline model for tune the parameters for it. We can pass in different tuples of parameters, from (1,1) to (3,3) to the classifer and record the performance in a Pandas dataframe as below:

We can see that we must include unigram because (1,1), (1,2) and (1,3) achieve great results. (2,2)'s performance is mediocre. (2,3), (3,3)'s accuracy rate are down to 0.5, meaning that they are useless.

Smoothing

In MultinomialNB model, we can tune the smoothing parameter $\alpha$ of Laplace smoothing to explore a better result. For a more detailed introduction about laplace smoothing, please refer to this article. We can choose $\alpha$ from the list [0.1,0.5,1,1.5,2,2.5] and the n-gram model from (1,1),(1,2),(1,3). Then, run the sentiment analysis and record the accuracy in a Pandas dataframe. In this way, we can find the best pair of parameters.