Bayesian Spam Filter

-- by Chengrui Charlie Zheng

Introduction

A spam filter is a filter in your email to filter spams, the mails you do not want to receive, out of hams, the legitimate emails you want. There are multiple ways of implementing a spam filter. I will use Naive Bayes Classifier and Bag-of-Words model to implement a Bayesian spam filter, the oldest kind of spam filter. This page will walk you through the process of implementation, training and testing.

Email and Text Cleaning

When a email comes into your mailbox, there are a lot of features to determine whether the email is a spam or not. For a simple demonstration, I will only be analyzing the words in the body of a email, by calculating the probability of the word occuring in a spam. Here, I will first create a class of email as in the following code. The class consists of a constructor, which cleans the text, getters and setters for the attributes.

An email should have an "ID", a string list of "words", and a "score" indicating its spam score. The higher the spam score is, the more likely the mail is spam. When a new mail comes in, we will first set its score to be 0 by default. As for "words", we will do the text cleaning by using 'nltk' module in python. First, it will read the text from the file, and tokenize the text to split the text into words. For example, if we have a sentence "What do you mean by tokenizing?". Aftering tokenizing the text, it will extract every tokens(a word, a number or a punctuation), and make the tokens into a string list.

Next, we will filter the stop words out of the tokenized string list. Stop words are the commonly used but semantically insignificant words such as "a" and "by". Here are the new string list from our last example without stop words.

Also, we only want to keep the stem of each words by making the letters lowercase, removing the inflection and some derivation; so that "tokenized", "tokenizing" and "tokenize" will be counted as the same word "token".

Ultimately, we will remove the numbers and punctuations in the string list. After the text is clean, we can pass the emails to the filter.

Filter and Naive Bayes Classifier

This filter will be using a Naive Bayes Classifier to classify the comming emails. The more detailed application of Bayes' Theorem will be explained below. Here is the class of filter.

Constructor

In the constructor, we can set a filter's threshold of the spam score. Hams with scores lower than the threshold will be passed into "inbox", an email list resembling the inbox in your mailbox. Spams with scores higher than the threshold will be passed into "spams", an email list of spams. There are also 2 bag-of-words, "spam_bow" for spams and "ham_bow" for hams. The bag-of-words model is a hashmap or a dictionary containing the words appeared in mails as keys, and the number of mails each word appeared as values. The example of the bag-of-words will be illustrated in the Training section.

Naive Bayes Foundation

After the filter is constructed, it will calculate the spamicity, the probability that a message containing a given word is spam in "word_spamicity" function. The calculation process is based on Bayes' Theorem: $$\Pr(S|W)=\frac{\Pr(W|S)\Pr(S)}{\Pr(W)}$$ $Pr(S|W)$ is the probability that the email is a spam, knowing that the target word is in it. Namely, the probability that the email containing the target word is a spam.
$Pr(W|S)$ is the probability that the target word is in the email, knowing that the email is a spam. Namely, the probability that the target word appears in a spam.
$Pr(S)$ is the probability that the given email is a spam.
$Pr(W)$ is the probability that the word appears in any emails.

To futher break down $Pr(W)$, we can calculate it as: $$\Pr(W)={\Pr(W|S)\Pr(S)}+{\Pr(W|H)\Pr(H)}$$ Because the word appears either appears in a ham or spam, we can calculate $Pr(w)$ as the sum of the probability that the word appears in hams, and the probability that the word appears in spams.

The probability that the word appears in spams can be calculated as $Pr(W|S)Pr(S)$. Namely, the probability that the probability that the target word appears in a spam multiplied by the probability that the given email is a spam.

The probability that the word appears in hams can be calculated as $Pr(W|H)Pr(H)$. Namely, the probability that the probability that the target word appears in a ham multiplied by the probability that the given email is a ham.

To plug in the formula of $Pr(w)$ into the formula of Bayes' Theorem, we can the formula: $$\Pr(S|W)=\frac{\Pr(W|S)\Pr(S)}{\Pr(W|S)\Pr(S)+\Pr(W|H)\Pr(H)}$$

Now, the question is what are $Pr(S)$ and $Pr(H)$? If we are taking more features into account, such as probability of the email from a certain email address is 80% a spam and 20% a ham, we can take the exact value into calculation. However, because we are only using Naive Bayes Classifier to analyze the body of emails, we have unbiased hypothesis on whether a email is spam or ham. Therefore, $\Pr(S)=\Pr(H)=0.5$. Since $Pr(S) = Pr(H)$, we can simplify the formula above as: $$\Pr(S|W)=\frac{\Pr(W|S)}{\Pr(W|S)+\Pr(W|H)}$$

Spamicity and Spam Score

Now we found a way to calculate the probability that the email containing the target word is a spam. In our context, $Pr(W|S)$ can be calculated as the number of spams containing the target word divided by the total number of spams. We can look up the word in "spam_bow" to obtain the number of spams containing the target word, and get the length of "spams" to obtain the total number of spams.

On the other hand, $Pr(W|H)$ can be calculated as the number of hams containing the target word divided by the total number of hams. We can look up the word in "ham_bow" to obtain the number of hams containing the target word, and get the length of "inbox" to obtain the total number of hams.

We also need to consider an edge case. What if the filter has never seen the target word before? Since the target word is not in both of the bag-of-words. $Pr(W|S)$ and $Pr(W|H)$ will be 0, resulting in the divisor of our formula, $Pr(W|S) + Pr(W|H)$ to be 0. To solve this problem, we will be make the filter be more lenient and let the word pass the filter with a spamicity of 0.

After figuring out the way to calculate the spamicity of each word, the "spam_score" function will loop through each word of the email and calculate the spam score of email according to the spamicity of each word.

The last step is to receive a email. In the "receive" function, the filter will receive the email, calculate the spam score of that email and sort it to either "inbox" as a ham, or "spams" as a spam. This function will also further train the filter by adding the words of the email to the corresponding bag-of-words. Note: a certain word may appear multiple times in a email, but we are only adding it once a email, because the value of that word in the bag-of-words denotes the number of emails the word appears.

Training

After constructing the 2 classes, we need to train the filter to let it work. For a real filter, we need a great amount of spams and hams to train it. But for demonstration purpose, I will only use 2 spams and 2 hams for training. The training spams will be subscription confirmation emails we hate to receive. Here is the text of one of the training spams.

Thank you for subscribing. To begin receiving the newsletter, you must first confirm your subscription.

Now, we will pass the training data in a filter with a threshold of 10.

Now the filter is trained, let us explore what are the bag-of-words of spams look like now.

We can see that the 2 training spams are added to the "spams" of "spamFilter"

The 2 training hams are also added to the "inbox" of "spam_filter"

We can calculate the spamicity of a word's stem. For example, let us try "subscript" and "thank".

"Subscript" gets the spamicity of 1 because it appears twice in the 2 spams but not in any hams. "Thank" gets the spamicity of 0.5 because it appears once in the 2 spams and once in the 2 hams. We can see the spam scores of these 4 training emails.

Testing

Let us pass the testing data in the filter to check if the filter is well-trained. This is the text of the testing spam, another subscription confirmation email:

Please Confirm Subscription. Yes, subscribe me to this list. If you received this email by mistake, simply delete it. You won't be subscribed if you don't click the confirmation link above. For questions about this list, please contact:

This is the text of the testing ham, a email about COVID-19:

We understand that the recent news and uncertainty surrounding the COVID-19 situation may have caused you to re-think your travel plans and future travel options. Whether you have a trip booked or are planning upcoming travel, we will do whatever we can to support you. We are continually monitoring the situation, including travel restrictions and updates to travel policies that may impact you.

The filter successfully filtered "spam3_test" as a spam and passed "ham3_test" as a ham.

However, this filter is only for demonstration purpose of the application of simple text cleaning and Naive Bayes Classifier. In the real life, we need to consider more features other than contents, such as senders' addresses and subject. We need more complicated models, like decision trees, to build a spam filter like that in Gmail.