february 07, 2005
carefully orchestrated visceral reactions
Phillip Karlsson's random thoughts, musings, and mindless pabulum.
February 07, 2005
Adventures in Spam

I've commented on Goats about our move off the crappy Interland machines, over to VoxRox. Last week and over the weekend I finally completed that move by getting our mail server onto the new machine. This is a sort of summary of what I did and my goals in doing it. None of it is particularly original, but I haven't seen anyone lay down a spam-mitigation-summary anywhere, so this is my reference for any future work I do, or problems I have.

The thing about setting up a mail server in this day and age is that it isn't as easy as just setting up a single application, and letting it run, at least not in the free (as in beer) software world. You need to install the actual mail server, but you also need to get a web interface up (because people demand that these days), which means you need to get IMAP working in addition to the older POP3. Most of all you absolutely need some sort of Spam solution running on the server.

Prior to the move, I was downloading to my machine well over 1500 spams every day. My goal in the past had been to try and limit the amount of spam that would reach me, but this meant that all those messages had to be processed through that machine, by both the mail server and SpamAssassin in order to be downloaded by me and then deleted. This seemed like a massive waste of resources, which was one of the motivating factors to rework our systems.

In re-thinking this for the new machine, I tried to approach it like network security, and also in a resource conservationist methodology. So:

  1. Make sure that there are multiple layers of defenses against spam.
  2. Try to layer those defenses so that the first ones trim the most fat while using the least resources, so that the later ones can use more resources to look at what's left.

Our server uses a qmail/vpopmail/courier-imap/squirrelmail/SpamAssassin set-up. I followed many of the steps at qmailrocks.org to set this up, but skipped his bit on setting up SpamAssassin and clamav. I had followed his guide exactly, with qmail-scanner, etc on my last mail server installation, and had found that letting qmail-scanner do all my anti-[virus|spam] work just used too many resources, and ground the server to a halt. It processes every incoming message, and 99% of what hits our servers is going to accounts that don't even exist. A fun side effect of having had a domain for upwards of 8 years.

So, within this set-up, the first step in stopping spam is to just not let it onto the machine. I used two methods for this, rblsmtpd and a "badrcptto" patch written by the author of (and found via) O'Reilly's qmail book. (The badrcptto patch didn't integrate easily with the qmailrocks mega-patch, so I had to manually integrate parts of it in afterwards.)

rblsmtpd is a "real-time" block list of known spam servers. It queries pseudo name-daemons as to whether the incoming message is coming from a known-bad host, and then denies the connection if it's from a suspected bad guy. The two servers I'm using are sbl.spamhaus.org and cbl.abuseat.org. I've always been a bit wary of these block lists, because it means that I never see the message, and if a "good guy" gets mistakenly classified as bad, it can be difficult to get de-listed. In the meantime, I wouldn't be getting any mail from them. While this is worrisome, I decided that I'll see how it holds up, and that in general, this was a far lesser evil then the unmanageable amount of spam I was getting. If I was running an ISP, I might not be able to make that decision, but for a small business, it just makes a lot of sense. Additionally, for the cost of a simple DNS lookup, I'm alleviating SpamAssassin from the need to run all it's tests on messages destined for the junk heap. Another factor in deciding to install this was reading Cory Doctorow's essay "What to Do About Spam?". He compares running SpamAssassin with something like Razor (more on that later) to a "suggestion mechanism" for spam checking. I decided to consider rblsmptd as "suggesting" that if you run an open relay, I don't necessarily want to share email with you.

badrcptto also blocks the message, but it blocks it for any message going "To" an address in a list I maintain. SInce massive amounts of our spam are going to the now-defunct email addresses we had set-up for each of the characters about 5 years (or more) ago, this was a simple way of stopping them from getting through.

My second layer of defense is SpamAssassin. I tried to be more thorough in setting it up than I had been in the past. Specifically, I decided to make sure that I was using one of the "Hash Sharing Systems", and including the network tests. SpamAssassin offers three different HSSs: Vipul's Razor, Pyzor, and the Distributed Checksum Clearinghouse (or DCC). These work by calculating a (highly probabilistically) unique numerical identifier (hash) for a message and then checking to see if it's in a central database of known good or known spam identifiers. The main variations on these products is which programming language they're in, whether the database stores both ham (known good) and spam message hashes, and how they decide which parts of a message to include in calculating the hash. While I could really increase the score for any given message by using all three, I decided to use just one for now, to conserve resources in the case of "lots of spam" (I would want to limit the number of network connections per message). For now, I'm using DCC. My two main reasons are that it checks for both spam and ham, which should increase it's "correctness", and it's written in C (as compared to perl (for Razor) and Python (for Pyzor), so it should be nominally more resource efficient.

The other network checks that SpamAssassin uses mirror rblsmtpd somewhat. However, in addition to just checking the relaying mail host, they also check the RLS in the spam, to see if they're known to be spammer sites, and ups the score if they are. I'm not convinced that these tests are really worth it, but I'll watch how they affect the scoring, and disable them if it seems unnecessary and if I need to speed things up.

My last layer of defense is my mail client (Eudora). It has it's own Junk scanner, that used to catch about half the stuff that SpamAssassin missed. It also has a tendency to pick up some ham just often enough that I need to watch what it catches, but not often enough that it's worth tweaking it's scoring.

So, what's the verdict?

So far, so good. It seem that either rblsmtp, badrcptto, or both are stopping enough spam that SpamAssassin is using a very manageable amount of resources. The spam I'm downloading overnight dropped from ~600-1000 messages to ~60. That's a quantity I can easily go through. Of those, SpamAssassin missed about 5, or just under 10%, instead of closer to 50% as was happening before. I could probably improve this even more by adding a couple more of the block list servers to my configuration, especially the lists that include suspected trojan-ized machines, but I'll wait a little bit before I see if that becomes necessary. I suspect that by blocking the spams at network time, I've cut the load enough that I could reinstate the qmail-scanner/clamav system into my mail set-up, but again I'm going to wait and see how this holds up for a month before I do that.

I'm a programmer, not a sysadmin. I really hate spending my time dealing with this stuff. But the downside of being "the" tech person for a business, is you have to do it. I'm hoping that by incorporating other dynamic systems (DNS block lists, DCC, and bayesian filtering (not discussed here)) into mine, I can keep my system relatively static and not have to futz with keeping up with the latest rules or whatever. In reality, I know the spammers are constantly figuring out new ways to waste my time, but hopefully I can fend them off long enough to be productive for just a little while.

6