Fun with Google CodePro

Usually when a software tool has “Pro” in the name, it means that it is expensive. Sometimes it just means that the features are unlocked so that you can actually use the tool. In this case it means FREE.

I have used CodePro on another project, but started using it again to help analyze the Stanford Natural Language Parser (also free).  I would like to use the StanfordNLP in a side project that I have in mind.  The problem is that is has a large code base in a domain (language parsing) that I am not familiar with.

I will have to split this into multiple posts, but you can read about the metrics report below.

Continue reading

The Wumpus Hunter

The design uses a knowledge base to store known information about each location. A rule set is then used to determine what actions the agent should take based upon the information known. The knowledge base uses a forward chaining concept on its tell(percepts) function to add new information to its array of locations. It then uses a backward chaining concept on the ask(location) function to find all possible information about a particular location.

While the knowledge base is somewhat stable, the functionality of the agent can be changed quite drastically by manipulating the rule set. For now, I have the agent programmed to choose its next action based on a random function that is run on a weighted set of possibilities. For example, if the agent’s first choice of move location would be a location that has not been visited that is in the same direction currently being traveled.

The wumpus world program also included a player view of the wumpus world, so that the user can play the part of the agent. The program still has some buggy behavior, but the agent consistently finds the gold in the case that it is not forced to choose between two unknown locations.


Naive Bayes Classifier

This page contains information that I have gathered for a semester project for my Machine Learning class at FIT. The project consists of implementing an application that can filter newsgroups for spam “off content postings” by using the Naive Bayes algorithm to differentiate normal newsgroup posts from off content posts.

To start off, I decided to create a program that could analyze a newsgroup post and tell which newsgroup the post belongs too. The newsgroups that I’m currently working with are:

  • alt.guitar
  • alt.rec.camping
  • alt.rec.hiking
  • comp.lang.c++
  • comp.lang.lisp
  • comp.programming
  • comp.programming.threads
  • comp.unix.programmer
  • microsoft.public.win32.programmer.directx.managed

I created a data set of newsgroup postings by downloading all posts in each newsgroup over a period of a few weeks. I used the Thunderbird e-mail application to subscribe to and download each newsgroup. The Classifier application then reads the files saved by Thunderbird to load the newsgroups. The same process was used to obtain postings from the same newsgroups at a later time that were not included in the original set of postings. This second set of newsgroups contains 10 postings in each newsgroup to be used as test data. The time necessary to read each posting in the original set of data and manually classify it as “off-topic” was too great to accomplish for this project. Therefore, I decided to simply attempt to classify each posting in the test data to predict which newsgroup the posting belonged to.

The Naive Bayes theorem is applied to the problem as follows:
P(N|W) = [ P(W|N) * P(N) ] / P(W) where N = Newsgroups, W = Words.
A newsgroup posting is made up of a list of words. According to the formula above, the probability that a posting is in newsgroup N is equal to the probability that the words will appear in the newsgroup, multiplied by the probability that a posting will appear in the newsgroup, divided by the probability that the words will appear at all. The probability is calculated for each word in the posting, and then multiplied together to obtain the overall probability of the posting in the newsgroup.

One of the biggest problems with the Naive Bayes algorithm occurs when one of the probabilities is equal to 0. A probability of 0 occurs when a word in the posting is not found in the existing list of words from the newsgroup. Since all of the probabilities are multiplied, the resulting probability is also 0, regardless of the probabilities of the rest of the words in the posting. To correct this problem, I replace any probability of 0 with the probability that the word will be found in the English language. For a 4 letter word, the probability is 1 in 233,378.

The classifier seems to work very well, with around an 85% success rate for most test runs. The classifier is configurable to filter out words less than 4, 5, or 6 characters; and will return different results based on which selection is made. I would assume that the classifier is more accurate when filtering the words less than 6 characters, but I don’t have enough data to prove that to be the case.

  1. Here’s the application
  2. Here’s the Newsgroups
  3. Here’s the Test Data