Heuristic filter ORACLE

Introduction

From “The Merriam-Webster Dictionary”

oracle : one held to give divinely inspired answers or revelations

j-chkmail's oracle is a set of tests about weak spam indicators. Some examples are :

  • the messages contains a text/html part, but not a text/plain part.
  • the message contains text in many colours
  • the subject is entirely in capital letters (HI CAPS)
  • the mailer is usually found in spams
  • envelope From is NULL sender (<>), but the header sender isn't postmaster, MAILER-DAEMON, …
  • there are two subject headers
  • there are some HTML tags mostly found in spams.
  • RBLs

Heuristics may include, but not only, looking for some regular expressions inside some parts of messages.

The main goal of using this kind of heuristics isn't to use them to detect spam, as long as these are weak spam indicators. Heuristic filter isn't a main filtering method. But it can help to confirm the two main filtering methods : bayes filter and URL filtering.

The number of tests are not too big : less than 40 nowadays. Only really relevant checks are integrated into the oracle.

RBL checks can be done here in order to provide a soft alternative to check messages against blacklists without rejecting messages. But, in the author's opinion, this is a bad idea - if an RBL is reliable enough, it shall be used to reject the connection. If it's not reliable enough it shall not be used. But the real reason is about filter performance. In some usual computers, j-chkmail spends something like 20-50 ms to check a message. Doing RBL (DNS) queries takes usually more than 1 sec. So, using RBLs increase message handling time and, for this reason, the mean number of threads in the filter. This may be a performance bottleneck.

You find 5 check categories in ORACLE.

  1. CONN - Checks in these category are related to the SMTP connection/session
  2. RBL - Check SMTP client against Real Time Black Lists
  3. MSGS - Checks done in the message as a whole : headers, …
  4. HTML - Checks text/html MIME part
  5. PLAIN - Checks text/plain MIME part

Configuration - Beginners users

Just enable it !

j-chkmail.cf

# SPAM_ORACLE
#     Do heuristic filtering
#  Syntax : -----
#     VALUES :  NO  YES 
SPAM_ORACLE                        YES

If you want to use RBLs with the Oracle, take a look at “Expert users” section.

Configuration - Expert users

j-chkmail's oracle uses two configuration files :

  • /etc/mail/jchkmail/j-tables - this file is used to enable/disable each Oracle test and assign odds to them.
  • /etc/mail/jchkmail/j-oradata - this file is used to define unwanted things and to assign odds to them. Unwanted things may be one of :
    • HTML-TAGS
    • BAD-EXPR
    • CHARSET
    • BOUNDARY
    • MAILER
The names of these files will probably be changed in the future and both files will be merged in a single XML like coded file.

To change the names of these files, you can edit j-chkmail.cf file :

j-chkmail.cf

# ORACLE_DATA_FILE
#     Some oracle definitions
#  Syntax : -----
ORACLE_DATA_FILE                   j-oradata

# ORACLE_SCORES_FILE
#     Oracle scores
#  Syntax : -----
ORACLE_SCORES_FILE                 j-tables

How to configure RBLs

Declare the RBLs you want to use - no more than 16. The more RBLs you declare the slower will be the filter !

j-chkmail.cf

# RBL
#     Real-Time Blacklists (used at Oracle)
#  Syntax : RBL[/CODE] - rbl.domain.com/127.0.0.1
RBL   rbl.domain.com/127.0.0.1

Enable these RBLs at j-oradata configuration file and assign its odds.

j-oradata

R00   DISABLE  odds=5.000        Realtime Blacklist
R01   DISABLE  odds=4.000        Realtime Blacklist
R02   DISABLE  odds=10.000       Realtime Blacklist
...

How to change original Oracle checks

If you want to enable/disable or change the values of tests, you shall edit j-oradata configuration file :

j-table

C05   DISABLE      odds=1.000      SMTP client sending mail to spamtrap
C06   DISABLE      odds=1.000      Bad EHLO parameter
C07   DISABLE      odds=1.000      Myself EHLO parameter - forged
M01   ENABLE       odds=1.000      No HTML nor TEXT parts

If you you want to modify the list of Unwanted things used by some Oracle checks ( CHARSET | BAD-EXPR | BOUNDARY | MAILER | HTML-TAG ), you may edit j-oradata file :

j-oradata

HTML-TAGS  odds=1.66    <script[^<>]*>
HTML-TAGS  odds=1.40   <script[^<>]+src=[^<>]+>
HTML-TAGS  odds=1.45   <span[^<>]*>

BAD-EXPR   odds=20.88  http[s]?://[^ /#]*#[0-9a-f]
BAD-EXPR   odds=1.00   http[s]?://[^ /&]*&#[0-9]{1,3}
BAD-EXPR   odds=1.03   http[s]?://[^ /@>\\n]*@
BAD-EXPR   odds=6.92   http[s]?://[^ /]*[0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}[.][0-9]{1,3}
BAD-EXPR   odds=3.91   http[s]?://[^>\n\r *]+\\*http[s]?://

CHARSET    odds=13.00    ^big5$
CHARSET    odds=9.00     ^euc-kr$
CHARSET    odds=4519.00  ^gb2312$

Odds ??? What's odds ???

From Wikipedia :

In probability theory and statistics the odds in favour of an event or a proposition are the quantity p / (1 − p) , where p is the probability of the event or proposition. In other words, an event with m to n odds would have probability n/(m + n). For example, if you chose a random day of the week, then the odds that you would choose a Sunday would be 1/6, not 1/7. These 'odds' are actually relative probabilities.

  • Example 1 : if you have 100 messages and the word viagra appears in 75 messages, you can say that viagra odds are 75/25, say 3.
  • Example 2 : Odds, as used in j-chkmail configuration files is the ratio of conditional probabilities. Consider you have 200 hams and 100 spams. The word viagra appears in 90 spams and on 4 hams. So the conditional odds here are : (90/100) / (4/200) → 45.

OBS :

  • If the odds value is 1, that means that the event is neutral !!! I'm sure you've remarked this very interesting and important property of odds.
  • If the odds value is < 1, that means that the event is more frequent in hams than in spams
  • If the odds value is > 1, that means that the event is more frequent in spams than in hams

Debugging

What's triggering the Oracle

/var/log/j-chkmail shows the tests that have been done when checking a mail, that's a usefull if something get rejected. You will find the reason here

/var/log/j-chkmail

Mar  4 17:08:46 mx0 j-chkmail[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - M02 text/html without text/plain (   0.2)
Mar  4 17:08:46 mx0 j-chkmail[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - M13 RFC2822 headers compliance (   1.0)
Mar  4 17:08:46 mx0 j-chkmail[7771]: [ID 000000 local5.info] 47CD740E.001 ORACLE - H06 HTML tag/text ratio (   0.5)

How to see how j-chkmail is interpreting the Oracle configuration tables

Terminal

$ j-chkmail -t oradata
$ j-chkmail -t oracle-checks
doc/spam/heuristic_filter.txt · Last modified: 2008/03/07 20:51 by martins
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0