Statistical (Bayesian) filtering is being used by many mail filters. Although the idea is older, real bayesian filters begun after the well known Paul Graham paper A Plan for Spam. Some other people (e.g. Sahami, Heckerman, Dumains and Horwitz) published research results since 1998.
There are many implementations of Bayesian filtering. Most of them are quite different and obviously each implementation tries to be the best one…
Although the basic idea is the same, differences are in the way tokens are extracted from messages and the way a score is assigned to the message and how this score is handled by the filter.
Currently, j-chkmail uses the score from bayesian filter to confirm or invalidate the score assigned by other filtering criteria. This is what is called “boosting”.
j-chkmail is very concerned by speed, so message handling time when bayesian filtering is enabled remains almost the same.
To let j-chkmail do bayesian filtering, the roadmap is :
To put this in place, you shall contact the author.
Some time in the future, learning databases for other languages/countries may be available.
Usually, the only thing to do on configuration files, is to enable bayesian filtering. The other parameters are default parameters and usually don't need any tuning.
/etc/mail/jchkmail/j-chkmail.cf
BAYESIAN_FILTER YES BAYES_MAX_MESSAGE_SIZE 200K BAYES_MAX_PART_SIZE 30K DB_BAYES j-bayes.db BAYES_HAM_SPAM_RATIO 1000 BAYES_NB_TOKENS 64 BAYES_UNKNOWN_TOKEN_PROB 500
BAYESIAN_FILTER, BAYES_MAX_MESSAGE_SIZE and BAYES_MAX_PART_SIZE
You'll need to modify /var/jchkmail/cdb/Makefile file to add the tokens database (j-bayes.db) to the objects to maintain.
/var/jchkmail/cdb/Makefile
OBJ = j-urlbl.db j-policy.db j-rcpt.db j-bayes.db
Next section explains how to create and maintain the tokens database.
In theory, a statistical filter classes incoming messages based on the knowledge it has of usual messages (spams and hams) received by the final user. The filter should know how recipient mailbox looks like. This is what people call “learning or training the filter”.
So, it must have access to a set of messages representative (both qualitative and quantitative) of the final user mailbox.
In pratice, for many reasons, out of the scope here, it's impossible to constitute a perfect set of messages, mainly if the filter is to be applied to many recipients.
Hopefully, although this apparent difficulty, it's possible to constitute a corpus of messages good enough to reduce the quantity of spam to some acceptable level.
There are many ways to create a spam corpus and to train a filter (Train-on-everything, Train-on-errors, Train-until-mature, Train-until-no-errors, …). But if you examine them deeply, none of them really match any theoretical statistical model of bayesian filtering. Each one has its pros and cons.
This is how I manage the corpus of messages in our production server. This is an idea and may not be the best choice for your environnement. But you can surely begin this way. The ideas are :
To create my spam corpus, I use :
All these spams are classed by source and by month in different files (e.g. spamtrap-2006-09.sbox …).
The corpus of spam messages is updated daily to add new fresh messages and to remove messages older than 6 months.
To create my ham corpus, I use :
The corpus of ham messages is updated each 2-3 months to add wrongly classified messages and to remove messages older than 3 years.
Above ideas seems too empirical but they really aren't. Filter results are more sensitive to the way the filter tokenizer works than to the quality of the corpus of messages. But this doesn't means the corpus isn't important : it MUST roughly match the current flow of messages. It's up to you to roughly identify the kind of messages appearing in real traffic and roughly select their proportion in the corpus.
In some way, creating a good corpus of messages is an iterative process :
You don't need to repeat all this iterative process each time you update the training database, but you surely have to check it from time to time.
The simplest way to maintain the training database is to use the contents of the bayes-toolbox directory you'll find inside j-chkmail distribution tree. This directory contains a Makefile with rules to create the tokens database from mailboxes, and two sample mailboxes (ham and spam).
You can install this directory just after installing j-chkmail. Do it with the following commands at j-chkmail distribution root directory :
make install make install-learn
When you've put put all mailboxes together, you can simply type make, and everything will be done.
Pertinent features of each message/mailbox will be extracted to generate a .tok file. E.g. features from spamtrap-0609.sbox will be extracted into a spamtrap-0609.tok. Features from .tok files will be aggregated into training database, which name will be j-bayes.txt.
If you add or update a mailbox, typing make will recreate the training database and update only what's needed.
After the training database is created or updated, install it at j-chkmail configuration directory.
Complete training sequence of commands is something similar to :
cd /var/jchkmail/bayes-toolbox make make install cd /var/jchkmail/cdb make
j-bayes-tbx is a command line tool needed to perform most tasks related to the bayesian filter, other than the online filter. Functions related to training the filter are called from the Makefile inside bayes-toolbox directory. Probably you won't need to use these functions.
Most of the time you'll use this tool to evaluate the quality of the learning database and the efficiency of the filter. You'll type something like :
$ j-bayes-tbx -c -x -p mailbox
# Checking mailbox Ham.2006
0 : 0.000 1426 ********************************************************************************
1 : 0.050 5 *****
2 : 0.100 20 ********************
3 : 0.150 23 ***********************
4 : 0.200 12 ************
5 : 0.250 6 ******
6 : 0.300 7 *******
7 : 0.350 5 *****
8 : 0.400 6 ******
9 : 0.450 9 *********
10 : 0.500 4 ****
11 : 0.550 1 *
12 : 0.600 2 **
13 : 0.650 0
14 : 0.700 3 ***
15 : 0.750 0
16 : 0.800 1 *
17 : 0.850 0
18 : 0.900 0
19 : 0.950 1 *
20 : 1.000 0
: 1531 Messages
$
/etc/mail/jchkmail/j-bayes.db). If this isn't the case or if you want to use a database installed elsewhere, use the option -b to specify a different location.
???
When applied to some message, the bayesian filter assigns some score to it. This score is a number in the interval [0,1]. Spam scores are near 1 and Ham scores are near 0.
Currently, bayesian score is used to confirm or invalidate the score assign by other content filtering methods : pattern matching, heuristic filter and URL filtering. The rule is simple :
b be the score given by the bayesian filterc be the score given by other j-chkmail content filtersb > 0.50 → c = MAX(c, 1)b > 0.75 → final score is equals to c mutiplied by some coefficient (greater than 1).b < 0.25 → final score is equals to c divided by some coefficient (greater than 1).Although results of current version of the bayesian filter seems very good, some domains may be improved and some features may be added in the future. Visible things are :
rsync to get it.