To optimize…

… or not to optimize, that is the programmer’s dilemma.  Optimization is a powerful tool in the programmer’s arsenal, but one that should be used sparingly and with care.  The key to optimization is identifying what exactly to optimize.


Shawn Hargreaves, creator of the Allegro game programming library recently posted a link to the blog of Thomas Aylesworth, a.k.a. SwampThingTom.  According to his blog, Thomas is a software engineer for the aerospace and defense industry.  He’s also a hobbyist game developer.

Tom’s first foray into the world of blogging is a series of posts about optimization, specifically targeting XNA based games.  In his first post, he explains the importance of design and what Big O notation is.  His second post delves into prototyping and benchmarking, complete with examples.  In part three, he introduces us to the NProf execution profiler as well as pointing out a few other potential bottlenecks.

Optimization in general is something you shouldn’t really need to worry about until the very end of the development cycle, if ever.  Optimization is a great tool for squeezing just a few more cycles out of your code when you really need it.  What you normally don’t see, however is a significant increase in speed.  If you’re looking for a significant increase in speed, take a look at your underlying algorithm.

Generally speaking, unless you’re writing extremely specialized code, optimization should be used very sparingly.  You’re better off looking for the elegant way to solve your programming dilemma.  Look into alternative algorithms or possibly re-design the data flow.  If you’re not sure where the bottleneck is, look into using a code profiler, or simply add debugging statement that surround suspected “slow” code.  You’ll usually find that a poor design decision is causing the bottleneck and that a simple re-design can result in huge speed increases while keeping your code readable and maintainable.


If you’re interested in optimization, or even just curious, take a look at Tom’s articles, they’re a great read.

SpamAssassin and Bayes

I’ve been messing around with SpamAssassin a lot lately and the topic of database optimization came up. I’m using Bayesian filtering to improve the spam scores and, to increase speed and manageability, I have SpamAssassin set to use MySQL as the database engine. Bayes is fairly resource intensive on both I/O and CPU depending on the current action being performed. Since I decided to use MySQL as the storage engine, most of the I/O is handled there.

I started looking into performance issues with Bayes recently and noticed a few “issues” that I’ve been trying to work out. The biggest issue is performance on the MySQL side. The Bayes database is enormous and it’s taking a while to deal with the queries. So, my initial thought was to look into reducing the size of the database.

There are a few different tables used by Bayes. The main table that grows the largest is the bayes_token table. That’s where all of the core statistical data is stored and it just takes up a lot of room. There’s not a lot that can be done about it. Or so I thought. Apparently if you have SpamAssassin set up to train Bayes automatically, it doesn’t always train the mail for the correct user. For instance, if you receive mail that is BCCed to you, then the mail could be learned for the user listed in the To: field. This means the Bayes database can contain a ton of “junk” in it that you’ll never use. So my first order of business then is to trim out the non-existent users.

The bayes_seen table is used to track the message IDs of messages that have already been parsed and learned by Bayes. A useful table to prevent unnecessary CPU utilization, but there is no automatic trimming function. This means the database grows indefinitely. The awl table is similar to this in that it can grow indefinitely and has no autotrim mechanism. For both of these tables I’ve added a timestamp field to monitor additions and updates. With that in place, I can write some simple Perl code to automatically trim entries that are sufficiently old enough to be irrelevant. For the bayes_seen database I plan on using a default lifetime of 1 month. For the awl I’m looking at dropping any entries with a single hit over 3 months old, and any entries over 1 month old with less than 5 hits. Since MySQL automatically updates the timestamp field for any changes to the row, this should be sufficient enough to keep any relevant entries from being deleted.

While researching all of this I was directed to a site about MySQL optimization. The MySQL Performance Blog is run by Peter Zaitsev and Vadim Tkachenko, both former MySQL employees. The entry I was directed to dealt with general MySQL optimization and is a great starting point for anyone using MySQL. I hate to admit it, but I was completely unaware that this much performance could be coaxed out of MySQL with these simple settings. While I was aware that tuning was possible, I just never dealt with a large enough database to warrant it.

I discovered, through the above blog and further research, that the default settings in MySQL are extremely conservative! By default, most of the memory allocation variables are maxed out at a mere 8 Megs of memory. I guess the general idea is to ship with settings that are almost guaranteed to work and allow the admin to tune the system from there.

I’m still tuning and playing with the parameters, but it looks like I’ve easily increased the speed of this beast by a factor of 5. It’s to the point now where a simple ‘show processlist’ is hardly listing any processes anymore because they’re completing so fast! I’ve been a fan of MySQL for a while now and I’ve been pretty impressed with the performance I’ve seen from it. With these changes and further tuning, I’m sure I’ll be even more impressed.

So today’s blog entry has a lesson to be learned. Research is key when deploying services like this, even if they’re for yourself. Definitely check into performance tuning for your systems. You’ll thank me later.