In part.1's text preprocessing step, we tokenized the words in reviews one by one. For example, 'Very boring movie' will be tokenized as ['very','boring','movie'].
This kind of model is called unigram model, because we are only taking one token at a time. However, there are other ways of tokenizing the words in n-gram model. We can instead take a sequence of tokens at a time. For example, in a bigram (2-gram) model, 'Very boring movie' will be tokenized as ['very boring','boring movie'].
In a trigram (3-gram) model, 'Very boring movie' will be tokenized as a single token 'very boring movie'.
N-gram model is helpful in our sentiment analysis because sequences of words may contain more important semantics for classification. For example, the unigram 'very' does not contain any sentiment per se. 'boring' means that the reviews hates the movie. However, 'very boring' conveys that the reviewer really hates this movie, more the just 'boring'. 'very boring' shall be treated differently as 'boring' because it contains a stronger sentiment. Therefore, we need to find good n-gram models to do sentiment analysis.
In Scikit-learn's TfidfVectorizer, we can choose the n-gram model by passing in the parameters, tuples of minimum n and maximum n. For example, (1,1) means that we are only using unigram model, since minimum n and maximum n are both 1. (1,3) means that we are using unigram, bigram, and trigram model together. For example, 'Very boring movie' will be tokenized as ['very','boring','movie','very boring','boring movie','very boring movie']
Therefore, we can refine the preprocess and classify function in part.1 as below:
from nltk.stem import PorterStemmer # stem the words
from nltk.tokenize import word_tokenize # tokenize the sentences into tokens
from string import punctuation
from sklearn.feature_extraction.text import TfidfVectorizer # vectorize the texts
from sklearn.model_selection import train_test_split # split the testing and training sets
def preprocess(path, ngram):
'''generate cleaned dataset
Args:
path(string): the path of the file of testing data
ngram(tuple (min_n, max_n)): the range of n-gram model
Returns:
X_train (list): the list of features of training data
X_test (list): the list of features of testing data
y_train (list): the list of targets of training data ('1' or '0')
y_test (list): the list of targets of testing data ('1' or '0')
'''
# text preprocessing: iterate through the original file and
with open(path, encoding='utf-8') as file:
# record all words and its label
labels = []
preprocessed = []
for line in file:
# get sentence and label
sentence, label = line.strip('\n').split('\t')
labels.append(int(label))
# remove punctuation and numbers
for ch in punctuation+'0123456789':
sentence = sentence.replace(ch,' ')
# tokenize the words and stem them
words = []
for w in word_tokenize(sentence):
words.append(PorterStemmer().stem(w))
preprocessed.append(' '.join(words))
# vectorize the texts
vectorizer = TfidfVectorizer(stop_words='english', sublinear_tf=True, ngram_range=ngram)
X = vectorizer.fit_transform(preprocessed)
# split the testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
return X_train, X_test, y_train, y_test
from sklearn.metrics import accuracy_score
def classify(clf):
'''to classify the data using machine learning models
Args:
clf: the model chosen to analyze the data
'''
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
return accuracy
Because, from part.1, Multinomial Naive Bayes classifer was fast and accurate. We are going to use MultinomialNB as a baseline model for tune the parameters for it. We can pass in different tuples of parameters, from (1,1) to (3,3) to the classifer and record the performance in a Pandas dataframe as below:
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
# create a dictionary to record the accuracy for each ngram_range
d = {}
# iterate through each ngram_range
for ngram in [(1,1),(1,2),(1,3),(2,2),(2,3),(3,3)]:
X_train, X_test, y_train, y_test = preprocess('imdb_labelled.txt',ngram)
d[str(ngram)] = [classify(MultinomialNB())]
df = pd.DataFrame(data=d)
df
(1, 1) | (1, 2) | (1, 3) | (2, 2) | (2, 3) | (3, 3) | |
---|---|---|---|---|---|---|
0 | 0.795 | 0.81 | 0.79 | 0.615 | 0.58 | 0.5 |
We can see that we must include unigram because (1,1), (1,2) and (1,3) achieve great results. (2,2)'s performance is mediocre. (2,3), (3,3)'s accuracy rate are down to 0.5, meaning that they are useless.
In MultinomialNB model, we can tune the smoothing parameter $\alpha$ of Laplace smoothing to explore a better result. For a more detailed introduction about laplace smoothing, please refer to this article. We can choose $\alpha$ from the list [0.1,0.5,1,1.5,2,2.5] and the n-gram model from (1,1),(1,2),(1,3). Then, run the sentiment analysis and record the accuracy in a Pandas dataframe. In this way, we can find the best pair of parameters.
alpha_list = [0.1,0.5,1,1.5,2,2.5]
d = {'alpha':alpha_list}
for ngram in [(1,1),(1,2),(1,3)]:
acc = []
for value in alpha_list:
X_train, X_test, y_train, y_test = preprocess('imdb_labelled.txt',ngram)
acc.append(classify(MultinomialNB(alpha = value)))
d[ngram] = acc
df = pd.DataFrame(data=d)
df
alpha | (1, 1) | (1, 2) | (1, 3) | |
---|---|---|---|---|
0 | 0.1 | 0.815 | 0.785 | 0.780 |
1 | 0.5 | 0.775 | 0.785 | 0.835 |
2 | 1.0 | 0.760 | 0.815 | 0.825 |
3 | 1.5 | 0.850 | 0.765 | 0.795 |
4 | 2.0 | 0.805 | 0.765 | 0.795 |
5 | 2.5 | 0.805 | 0.745 | 0.800 |