SpamAssassin and Bayes

I’ve been messing around with SpamAssassin a lot lately and the topic of database optimization came up. I’m using Bayesian filtering to improve the spam scores and, to increase speed and manageability, I have SpamAssassin set to use MySQL as the database engine. Bayes is fairly resource intensive on both I/O and CPU depending on the current action being performed. Since I decided to use MySQL as the storage engine, most of the I/O is handled there.

I started looking into performance issues with Bayes recently and noticed a few “issues” that I’ve been trying to work out. The biggest issue is performance on the MySQL side. The Bayes database is enormous and it’s taking a while to deal with the queries. So, my initial thought was to look into reducing the size of the database.

There are a few different tables used by Bayes. The main table that grows the largest is the bayes_token table. That’s where all of the core statistical data is stored and it just takes up a lot of room. There’s not a lot that can be done about it. Or so I thought. Apparently if you have SpamAssassin set up to train Bayes automatically, it doesn’t always train the mail for the correct user. For instance, if you receive mail that is BCCed to you, then the mail could be learned for the user listed in the To: field. This means the Bayes database can contain a ton of “junk” in it that you’ll never use. So my first order of business then is to trim out the non-existent users.

The bayes_seen table is used to track the message IDs of messages that have already been parsed and learned by Bayes. A useful table to prevent unnecessary CPU utilization, but there is no automatic trimming function. This means the database grows indefinitely. The awl table is similar to this in that it can grow indefinitely and has no autotrim mechanism. For both of these tables I’ve added a timestamp field to monitor additions and updates. With that in place, I can write some simple Perl code to automatically trim entries that are sufficiently old enough to be irrelevant. For the bayes_seen database I plan on using a default lifetime of 1 month. For the awl I’m looking at dropping any entries with a single hit over 3 months old, and any entries over 1 month old with less than 5 hits. Since MySQL automatically updates the timestamp field for any changes to the row, this should be sufficient enough to keep any relevant entries from being deleted.

While researching all of this I was directed to a site about MySQL optimization. The MySQL Performance Blog is run by Peter Zaitsev and Vadim Tkachenko, both former MySQL employees. The entry I was directed to dealt with general MySQL optimization and is a great starting point for anyone using MySQL. I hate to admit it, but I was completely unaware that this much performance could be coaxed out of MySQL with these simple settings. While I was aware that tuning was possible, I just never dealt with a large enough database to warrant it.

I discovered, through the above blog and further research, that the default settings in MySQL are extremely conservative! By default, most of the memory allocation variables are maxed out at a mere 8 Megs of memory. I guess the general idea is to ship with settings that are almost guaranteed to work and allow the admin to tune the system from there.

I’m still tuning and playing with the parameters, but it looks like I’ve easily increased the speed of this beast by a factor of 5. It’s to the point now where a simple ‘show processlist’ is hardly listing any processes anymore because they’re completing so fast! I’ve been a fan of MySQL for a while now and I’ve been pretty impressed with the performance I’ve seen from it. With these changes and further tuning, I’m sure I’ll be even more impressed.

So today’s blog entry has a lesson to be learned. Research is key when deploying services like this, even if they’re for yourself. Definitely check into performance tuning for your systems. You’ll thank me later.