Sentiment Analysis of Movie Reviews pt.2 -- LSA

--by Charlie Chengrui Zheng 01/15/2021

Recap

Remember from part.1, I made a sentiment analysis of movie reviews from the dataset of reviews on imdb. In the last part, I first preprocessed the text of all reviews into lowercase stemmed stokens with numbers and punctuations stripped. Then, I used TF-IDF as word-embedding to vectorize the all words into sparse matrix. Afterwards, I ran several selected machine-learning models to classify the features, the sparse matrix, and evaluated their performances.

However, this was just the basics of sentiment analysis. Because we have a relatively small dataset (1000 entries), we do not need to consider the dimension of features. The dimension of features last time was 2317, as shown below:

However, what if we have a large dataset, like a million entries? The size of 1,000,000 x 2,317 is a super large for a sparse matrix. It requires a strong computing power to store it and run it. It can also be exceedingly time-consuming. Therefore, we need to reduce the dimensionality of features and speed up our machine-learning process with a little loss of accuracy. In this part, I will conduct LSA on the Sentiment Analysis.

LSA and SVD

Latent Semantic Analysis (LSA) is a common word-embedding method used in Topic Modelling. On the other hand, it is also useful for text classification. In short, LSA is to perform Singular Value Decomposition(SVD) on the matrix after TD-IDF vectorization. If you are not familiar with TF-IDF, please refer to the Pt.1 of this study. SVD is a powerful matrix decomposition method used in Natural Language Processing. In NLP, we can always encounter a huge amount of dimension for our features. We can use SVD to select a small amount of those dimensions to have a truncated matrix to process machine-learning. (Here is an informative tutorial of SVD)

To briefly explain SVD, we can decompose the matrix A as

$A = USV^T$

The original matrix $A$ is a $nxp$ matrix. $U$ is an $nxn$ matrix, containing n left singular vectors. $V$ is a $pxp$ matrix, containing p right singular vectors. $S$ is an $nxp$ diagonal matrix containing singular values on its diagonal. The rest elements off the diagonal are 0. The singular values $\sigma$ are lined up according to its value: $\sigma 1$ > $\sigma 2$ > ... > $\sigma k$

Among all those singular values, we can choose a small number of singular values we want. For example, I will choose first 100 among all 2317 singular values in this case, in order to significantly reduce the dimensionality. We will remain the first 100 left singular vectors in $U_k$, the first 100 singular values in the singular matrix $S_k$, and the first 100 right singular vectors in $V_k$, and multiply them to get the truncated matrix $A_k$ as:

$U_k S_k V_k^T = A_k$

In Python, we can use scikit-learn to get the truncated SVD, and normalize it to tranform the feature matrix. Here is the code:

We can integrate this snippet of code into the preprocess method used in part.1 and transform the dataset. Here is the 2.0 version of preprocess function:

Performance

We can apply the data transfromed by LSA to the machine-learning models and monitor the change in their performance. Remember that in part.1, the result for Linear Discriminant Analysis was:

Time cost of LinearDiscriminantAnalysis(): 0.79s
The accuracy of LinearDiscriminantAnalysis(): 0.71

Now, we have our data's dimensionality significantly reduced. We should expect the time cost to be drastically improved this time. Also, we do not even need to make the data dense this time!

We just reduced the time cost from 0.79s to 0.05s. This is a giant leap as for speed!

Remember that last time, we also tried Logistic Regression, MultinomialNB, SVC, SGD and MLP Classifiers. The performance was:

Time cost of LogisticRegression(): 0.03s
The accuracy of LogisticRegression(): 0.825

Time cost of MultinomialNB(): 0.0s
The accuracy of MultinomialNB(): 0.825

Time cost of SVC(): 0.09s
The accuracy of SVC(): 0.835

Time cost of SGDClassifier(): 0.0s
The accuracy of SGDClassifier(): 0.82

Time cost of MLPClassifier(): 3.47s
The accuracy of MLPClassifier(): 0.81

We can try all those models and observe their performance:

We can see that even though the accuracy decreases a little for all those model, the speed is accelerated. The Time cost is significantly reduced for the complicated model MLP classifier. Therefore, MLP classifer and Linear Discriminant Classifier can be included in the ensemble classifier.