This page contains information that I have gathered for a semester project for my Machine Learning class at FIT. The project consists of implementing an application that can filter newsgroups for spam “off content postings” by using the Naive Bayes algorithm to differentiate normal newsgroup posts from off content posts.
To start off, I decided to create a program that could analyze a newsgroup post and tell which newsgroup the post belongs too. The newsgroups that I’m currently working with are:
I created a data set of newsgroup postings by downloading all posts in each newsgroup over a period of a few weeks. I used the Thunderbird e-mail application to subscribe to and download each newsgroup. The Classifier application then reads the files saved by Thunderbird to load the newsgroups. The same process was used to obtain postings from the same newsgroups at a later time that were not included in the original set of postings. This second set of newsgroups contains 10 postings in each newsgroup to be used as test data. The time necessary to read each posting in the original set of data and manually classify it as “off-topic” was too great to accomplish for this project. Therefore, I decided to simply attempt to classify each posting in the test data to predict which newsgroup the posting belonged to.
The Naive Bayes theorem is applied to the problem as follows:
P(N|W) = [ P(W|N) * P(N) ] / P(W) where N = Newsgroups, W = Words.
A newsgroup posting is made up of a list of words. According to the formula above, the probability that a posting is in newsgroup N is equal to the probability that the words will appear in the newsgroup, multiplied by the probability that a posting will appear in the newsgroup, divided by the probability that the words will appear at all. The probability is calculated for each word in the posting, and then multiplied together to obtain the overall probability of the posting in the newsgroup.
One of the biggest problems with the Naive Bayes algorithm occurs when one of the probabilities is equal to 0. A probability of 0 occurs when a word in the posting is not found in the existing list of words from the newsgroup. Since all of the probabilities are multiplied, the resulting probability is also 0, regardless of the probabilities of the rest of the words in the posting. To correct this problem, I replace any probability of 0 with the probability that the word will be found in the English language. For a 4 letter word, the probability is 1 in 233,378.
The classifier seems to work very well, with around an 85% success rate for most test runs. The classifier is configurable to filter out words less than 4, 5, or 6 characters; and will return different results based on which selection is made. I would assume that the classifier is more accurate when filtering the words less than 6 characters, but I don’t have enough data to prove that to be the case.