Decision Forests

From Advanced Projects Lab
Revision as of 02:00, 19 March 2016 by Wikiuser (talk | contribs) (Code on Github)
Jump to: navigation, search

A parallel Decision Forest

Here is a link to all of the code that I used for my email spam/ham classification algorithm - github repo

Explanation of some files

  • emaildata.py - contains the EmailData class, whose methods are used to extract data from each individual email.
  • extractwords.py - contains the ExtractWords class, which is used for automatic feature generation.
    • Reads a large number of emails and finds the frequency each word appears in spam or non-spam emails. Then those frequencies are subtracted from eachother. The words with the highest magnitudes then are used as features. (The numbers with large magnitudes should be those that are particularly spammy, or not) A new file -message.fts- is written, containing the features.