Giter VIP home page Giter VIP logo

spam-blaster's Introduction

Spam Blaster

Author: Michael Bartling

Overview

Spam Blaster is a simple attempt to thwart spam using basic data mining techniques. The idea is to train some bag of words classifier to estimate the probability a sample was taken from some language distribution. The feature vectors are sentences in a document and the document is classified as spam/not-spam by ensembling the individual sentence predictions.

Notes on Spam

Currently, most of the spam is taken directly from the mbed forums. However, this spam is not very diverse and often appears in the form WORD1 WORD1 WORD1 RANDOMWORD WORD1 WORD1 WORD1. Although this makes it easy to detect using signature based approaches, signatures suffer from over specialization and hand crafting. Therefore, I extend the spam feature set to include samples taken from http://untroubled.org/spam/

SpamBlaster operates on sentence level feature vectors where each feature is a vectorized character trigram using the hashing trick. This works because we expect to see common character sequences in real user posts, such as "pin", "read", "write", "the", "and", etc, whereas advertisements (especially primitive spam) use a different subset of atoms with different frequencies.

Results

Training SpamBlaster on notebook/forum posts for the past 10 days and validating with stratified k-fold (7 folds) yields 86% AUC score. However, this is at a sentence by sentence level, after bagging the predictions at a document level (taking an average) we vastly improve the prediction accuracy. Note, nearly all of the spam posts are rudimentary at best. Consequently, if we retrain with more advanced spam samples we expect to see 90%-98% AUC. Update: I added sms spam to the training samples.

ROC on a sentence by sentence level

ROC on sentence per sentence level

ROC on a document level

ROC on document level

spam-blaster's People

Contributors

mbartling avatar

Stargazers

Phil Howell avatar

Watchers

Phil Howell avatar James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.