Case Study: Spam Prediction
Background
- Spam dataset originally provided by George Forman of Hewlett Packard Laboratories in Palo Alto, CA. The dataset is available at ftp.ivs.uci.edu. You may also download a copy here.
- It contains various information on 4601 emails, where 1813 are categorized as spam and 2788 are legitimate emails.
- It has 57 continuous ordinal predictors
- Percentage of characters and words in email such as 'free', 'address', and 'business', '!', '?','#'
- Average, sum, and maximum length of uninterrupted sequences of capital letters
- Goal is to build a spam filter by predicting a given incoming email is spam
Predictive Modeling
- 1536 records were randomly selected for testing and 3065 records were allocated to the training data
- Various modeling techniques were used to build a SPAM prediction. See more detail...
Model Comparison
- The Comparison shows Gains# has excellent prediction results with great interpretability.
Caution: One test does not prove that Gains# methods can always outperform other methods. It does, however, show that it can compete with these techniques while being transparent and easy interpret.
Copyright Ó 2001-2009 InfoDecipher Corp. All Rights Reserved.