SpamAssassin 3.0.4 URI Fixes

日本語の情報

Bug #1

Here at RBL.JP we noticed some Japanese spam going through our systems undetected even though they had URLs in the body that are registered in url.rbl.jp. After a bit of investigation we discovered that URLs that directly came after any character in the JIS character set would not be detected and thus SpamAssassin would not check to see if they were registered in url.rbl.jp

What we discovered was that if a URL beginning with http:// came straight after the ASCII escape sequence ( <ESC> ( B ) - which terminates a string of Japanese characters that use JIS encoding - then that URL would not be detected. The regular expressions used to extract URIs (not just URLs) by SpamAssassin are rather tricky so we came up with a little hack which seems to do the trick. We think for this case the problem lies in the fact that the regex patterns will only find a URL if the http:// begins on a word boundary but there is no word boundary between the ASCII escape sequence and the http:// text.

Although we found this problem with JIS encoding it might exist with other language character sets as well. Other Japanese encodings (namely EUC-JP, Shift JIS and Unicode) don't seem to be affected.

The hack is just one line of perl which can be patched to the following file:
/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/PerMsgStatus.pm
We are using Perl version 5.8.6 so make sure to change the above path to whatever version you're using. For version 5.8.6 you can goto line 1816 of PerMsgStatus.pm. Then change the following lines (which are found inside the get_uri_list function):
 for (@$textary) {
    # NOTE: do not modify $_ in this loop
to
 for (@$textary) {
  my $jis = "(\x1b[\x28]B)"; if (/$jis/) { s/$jis/$1 /g; }
  # NOTE: do not modify $_ in this loop
What this one line hack does is search each line of the mail body before it is searched for any URIs. A space is appended to any occurences of ASCII escape sequences. This creates a word boundary between the JIS encoding and the actual URI which after some testing seemed to fix the problem of these URLs not being picked up.

As a side result of this hack, if any lines were appended a space and appear in the "Content preview" part of the SpamAssassin-modified e-mail then those affected lines will be displayed there too.

Bug #2

URIs which have any upper-case letters in them are not detected. For example, HTTP://j-sine.com, FTP://j-sine.com and any other variants such as HtTp:// or fTp:// etc.

In the same file (/usr/local/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin/PerMsgStatus.pm) apply the following changes (note that the line numbers could be different for different versions):

line 1738:
my $schemeRE = qr/(?:https?|ftp|mailto|javascript|file)/;
to
my $schemeRE = qr/(?:https?|ftp|mailto|javascript|file)/i;
and line 1743
my $schemelessRE = qr/(?<![.=])(?:www\.|ftp\.)/;
to
my $schemelessRE = qr/(?<![.=])(?:www\.|ftp\.)/i;
As you can see all that's different is the letter i is added to the end. This tells the regular expression engine to make its pattern matching case-insensitive.

Restart spamd after you've applied these fixes.

Testing

Bug #1

We'll use a spammer's domain which was captured a while ago.
あいうえおhttp://www.j-sine.com/
Copy the above (make sure you can see the Japanese characters before the http://) into an e-mail making sure the encoding is set to ISO-2022-JP (JIS) and send it to yourself. If you do a before and after test you should see something similar to the following only in the after mail. This indicates that the URL http://www.j-sine.com was in fact detected and the j-sine.com domain found to be registered in more than one black list as the following example shows.
 2.5 URIBL_SBL              Contains an URL listed in the SBL blocklist
                            [URIs: j-sine.com]
 1.5 URIBL_JP_SURBL         Contains an URL listed in the JP SURBL blocklist
                            [URIs: j-sine.com]
 2.0 URIBL_OB_SURBL         Contains an URL listed in the OB SURBL blocklist
                            [URIs: j-sine.com]
 4.0 URLBL_RBLJP            Has URI in url.rbl.jp
                            [URIs: j-sine.com]

Bug #2

Do the same as above but copy/paste the following into your mail (each line should be in a mail on its own):
HTTP://WWW.AKMOTIVATIONALSPEAKER.COM 
WWW.AKMOTIVATIONALSPEAKER.COM 
FTP://WWW.AKMOTIVATIONALSPEAKER.COM 
and you should get something like the following in your mail once SpamAssassin has processed it:
 1.5 URIBL_JP_SURBL         Contains an URL listed in the JP SURBL blocklist
                            [URIs: akmotivationalspeaker.com]
 4.0 URLBL_RBLJP            Has URI in url.rbl.jp
                            [URIs: akmotivationalspeaker.com]