Sentiment Analysis of Movie Reviews pt.4 -- BERT

--by Charlie Chengrui Zheng 01/25/2021

In part 1, 2, 3 of this study of sentimental analysis of IMDb movie reviews, I used classic supervised learning methods. In this part, I will use transfer learning with the model BERT.

Transfer Learning

Unlike the models trained by myself part 1–3, transfer learning is a method of using already existing pre-trained models to deal with a specific task. The model for this time, BERT, is a state-of-the-art deep learning model for multiple purposes in Natural Language Understanding field invented by Google researchers. By fine-tuning BERT to this dataset of IMDb movie reviews, I do not need to train my own model, but to use the transferred model for this specific task of predicting the sentiments for movie reviews. To explain it in a figurative way, training a supervised learning model is like building a car yourself; fine-tuning a transfer learning model is like car-tuning, buying a car and modifying it.

Pytorch and CUDA

After reading all the data, I will use Pytorch, a package for deep learning. CUDA in Pytorch is a toolkit that allows using GPU for parallel computing (but actually I do not have GPU on my local computer so I will still use CPU). To explain it in a figurative way. When you deliver something, using CPU is like driving a fast car to move a small amount of cargo; using GPU is like driving multiple pick-up trucks simultaneously to move a large amount of cargo. Because in deep learning, the calculation is a large amount of matrix multiplication, using GPU will speed up the process.

Tokenization

image.png In part 1, I used TF-IDF to convert a sentence into a vector, sentences into a matrix. In this part, I will directly use BERT tokenizer to convert sentences into encodings, consisting of input IDs and attention masks. Input IDs are tokens for BERT converted from the words in a sentence. attention masks indicates how much attention each token need in a sentence.

For example, in the sentence There are two birds. If "There" gets the most attention, we know that the birds are there, instead of here. If "two" gets the most attention, we know that there are more than 1 bird, less than 3 birds. If "birds", we know that the beings there are birds, not other things.

The attention mechanism in BERT makes this encoder efficent in sentiment analysis because it will notice that the words and phrases containing emotions, such as "like" and "don't like", need the most attention in movie reviews. Also, BERT tokenizer reads the sentences bidirectional, meaning both left-to-right and right-to-left, so that it also takes the position of a word in a sentence in to account.

When tokenizing the sentences, the maximum length will be set to 128. So the sentences with more than 126 tokens (the 2 other tokens are "cls" and "sep", meaning the start and the end of a sentence) will be truncated. The sentence with less than 126 with have zero paddings at the end.

Let us take a look at the first review and see how it is encoded by BERT tokenizer.

Dataset and Dataloader

After the sentences are encoded, the input IDs, attention masks and labels will be stored in the tensor form in TensorDataset and split into training and testing set. Then, Dataloaders wrap iterables over datasets, and support automatic batching, sampling, shuffling and multiprocess data loading.

batching means that the dataloader will send a batch of encodings into the model for mini-batch Stochastic Gradient Descent in the training process. In this study, I set the batch size to 16. Every time, the dataloader will shuffle all the data in the training set for a new batch.

Training

Epochs meaning how many times the training set will be sent to the model in the training process. For the training, I will set the number of epochs to 3. Usually the recommended number of epochs is 2-4. I will use ADAM as the optimizer for gradient descent.

Evaluation

We can see that the accuracy is higher than 90%, also better than the classic supervised training in part 1-3, proving that BERT is a more advanced model for sentiment analysis.