SpamAssassin and Bayes

I’ve been messing around with SpamAssassin a lot lately and the topic of database optimization came up. I’m using Bayesian filtering to improve the spam scores and, to increase speed and manageability, I have SpamAssassin set to use MySQL as the database engine. Bayes is fairly resource intensive on both I/O and CPU depending on the current action being performed. Since I decided to use MySQL as the storage engine, most of the I/O is handled there.

I started looking into performance issues with Bayes recently and noticed a few “issues” that I’ve been trying to work out. The biggest issue is performance on the MySQL side. The Bayes database is enormous and it’s taking a while to deal with the queries. So, my initial thought was to look into reducing the size of the database.

There are a few different tables used by Bayes. The main table that grows the largest is the bayes_token table. That’s where all of the core statistical data is stored and it just takes up a lot of room. There’s not a lot that can be done about it. Or so I thought. Apparently if you have SpamAssassin set up to train Bayes automatically, it doesn’t always train the mail for the correct user. For instance, if you receive mail that is BCCed to you, then the mail could be learned for the user listed in the To: field. This means the Bayes database can contain a ton of “junk” in it that you’ll never use. So my first order of business then is to trim out the non-existent users.

The bayes_seen table is used to track the message IDs of messages that have already been parsed and learned by Bayes. A useful table to prevent unnecessary CPU utilization, but there is no automatic trimming function. This means the database grows indefinitely. The awl table is similar to this in that it can grow indefinitely and has no autotrim mechanism. For both of these tables I’ve added a timestamp field to monitor additions and updates. With that in place, I can write some simple Perl code to automatically trim entries that are sufficiently old enough to be irrelevant. For the bayes_seen database I plan on using a default lifetime of 1 month. For the awl I’m looking at dropping any entries with a single hit over 3 months old, and any entries over 1 month old with less than 5 hits. Since MySQL automatically updates the timestamp field for any changes to the row, this should be sufficient enough to keep any relevant entries from being deleted.

While researching all of this I was directed to a site about MySQL optimization. The MySQL Performance Blog is run by Peter Zaitsev and Vadim Tkachenko, both former MySQL employees. The entry I was directed to dealt with general MySQL optimization and is a great starting point for anyone using MySQL. I hate to admit it, but I was completely unaware that this much performance could be coaxed out of MySQL with these simple settings. While I was aware that tuning was possible, I just never dealt with a large enough database to warrant it.

I discovered, through the above blog and further research, that the default settings in MySQL are extremely conservative! By default, most of the memory allocation variables are maxed out at a mere 8 Megs of memory. I guess the general idea is to ship with settings that are almost guaranteed to work and allow the admin to tune the system from there.

I’m still tuning and playing with the parameters, but it looks like I’ve easily increased the speed of this beast by a factor of 5. It’s to the point now where a simple ‘show processlist’ is hardly listing any processes anymore because they’re completing so fast! I’ve been a fan of MySQL for a while now and I’ve been pretty impressed with the performance I’ve seen from it. With these changes and further tuning, I’m sure I’ll be even more impressed.

So today’s blog entry has a lesson to be learned. Research is key when deploying services like this, even if they’re for yourself. Definitely check into performance tuning for your systems. You’ll thank me later.

Linux Upgrades – Installation from a software raid

I recently had to upgrade a few machines with a newer version of Linux. Unfortunately, the CD-ROM drives in these machines were not functioning, so I decided to upgrade the system via a hard drive install. Hard drive installs are pretty simple and are part of the standard distro for Redhat. However, the machines I had to upgrade were all set up with software raid 1.

My initial thought was to merely put in the raid location where the install media was located. However, the installation program does not allow this and actually presents a list of acceptable locations.

So from here, I decided to choose one of the 2 raid drives and use that instead. The system graciously accepted this and launched the GUI installer to complete the process. All went smoothly until I reached the final step in the wizard, the actual install. At this point the installer crashed with a Python error. Upon inspection of the error, it appeared that the drive I was trying to use for the installation media was not available. Closer inspection revealed the truth, mdadm started up and activated all of the raid partitions on the system, making the partition I needed unavailable.

So what do I do now? I re-ran the installation and deleted the raid partition, being careful to leave the physical partitions in-tact. Again, the installer crashed at the same step. It seems that the installer scanned the entire hard drive for raid partitions whether they had valid mount points or not.

I finally solved the problem through a crude hack during the setup phase of the installation. I made sure to delete the raid partition once again, leaving the physical drives in-tact. I stepped through the entire process but stopped just before the final install step. At this point, I switched to the CLI console via Ctrl-Shift-F2. I created a small bash script that looked something like this :

#!/bin/bash

mdadm –stop /dev/md3

exec /bin/bash ./myscript.bash

I ran this script and switched back to the installer via Ctrl-F6. I proceeded with the installation and the installer happily installed the OS onto the drives. Once completed, I switched back to the CLI console, edited the new /etc/fstab file and added a mount point for the raid drive I used, and rebooted. The system came up without any issues and ran normally.

Just thought I’d share this with the rest of you, should you run into the same situation. It took me a while to figure out how to make the system do what I wanted. Just be sure that your install media is on a drive that the installer will NOT need to write files to.

Voting in an electronic world

Well, I did my civic duty and voted this morning. I have my misgivings about the entire election process and the corruption that abounds in the government, but if I refuse to vote, then I really can’t complain, can I.

So, after waking up and getting ready for work, I headed to the local polling location to check out the Diebold AccuVote TSX system they wanted me to vote on. It’s a neat looking machine from afar, but once I got up close, I was sorely disappointed.

I can’t put my finger on it exactly, but these seemed to be very flimsy, rushed systems. The touchscreen didn’t feel right, tho it was presumably accurate, lighting up my choices as I chose them. There was a slight delay after I touched the screen, however, and that was annoying. The first time I tried to vote, it rejected the card I was given and flashed an error about being cleared. Well, I hope that’s what it said. Thinking back on it now, I’m upset that I didn’t take more time to read the screen. I’m honestly not sure if the error stated that the card was cleared, or that the machine was cleared. And when I returned the card for one that worked, the lady I gave it to mentioned that there were a bunch of cards she was having problems with. Not good..

On a positive note, the mechanism that held and ejected the voting card seemed to be well built. It worked well. I think that’s about the only piece that I thought was decent though. Kinda pathetic actually.

Speaking of the Diebold machines, I urge you to check out the HBO special, “Hacking Democracy.” The entire show is up on Google Video for your viewing pleasure. You can access the video here.

Firefox 2.0 Released!

Firefox 2.0 was released earlier today. I wrote previously about this latest release while it was still in Beta. I recommended then that you check it out, and with the final release here, I’ll say it again! This is one of the best browsers out there. Give it a try, it’s easy to uninstall if you find you don’t like it!

Internet Explorer 7.0 Released

Well, it looks like Microsoft has finally released Internet Explorer 7.0 to the public. Initially you have to download and install it manually, but they plan on releasing this on Windows Update in the near future.

I’m a huge fan of Firefox, so why am I bringing this to your attention? Well, there are a couple of reasons. It’s more secure than IE 6.0, much closer to being standards compliant, and if you have to use IE at all then this should make life a little safer and easier.

If you use Firefox exclusively, then please, continue doing so! And maybe even take a glance at version 2.0! But if you need IE at all, even for the IE Tab extension for Firefox, then please update IE to this latest version.

Interactive Searching

Saw this over at LetsKillDave.. I guess the Windows Live team has been playing around with the idea of interactive searching. They’ve come up with something they call Ms. Dewey. It’s pretty neat to play with. Just ask her a question and just before she returns the results, she converses with you. Some of the responses are quite funny. Check it out.

Some ideas for questions :

r u hot? (Thanks Ozy)

These are from one of the comments at LetsKillDave :

Where can I get a date?
Where can I get tested for STDs?
How do I tell my mother she’s a grandmother?
Are you gaining weight?
Where do republicans come from?

Or ask her about Bill Gates or George Bush. I think I’ve seen about 20-30 different responses so far. I’m not sure how many there are total.

Other ideas.. Ask about suicide, the weather, or just simply curse at her. All unique responses.. This is quite the time waster.. Highly recommended!

Update : Try telling her to take her clothes off… No, really, try it..

powered by performancing firefox (yeah, I’m checking it out)

Windows Live Writer Beta

I’m writing this post using the new Windows Live Writer Beta. It’s a blogging tool that allows you to write your blog entries offline and upload them later. Useful, I guess, if you’re not connected all the time. For me, it’s just something to play with. Time will tell whether I like it or not.

To use Writer with a Serendipity blog you’ll need to install the XML-RPC plugin. Once that’s up and working you need to tell Writer what kind of blog you’re using. After it fails the auto-detect you’ll need to choose the API to use. I’m using the Metaweblog API and it seems to be working fine. It also asks for the URL for publishing. For the XML-RPC plugin, the URL will be something like this : http://www.example.com/blog/serendipity_xmlrpc.php

So, for now, I’m just messing around with the system to see what it’s capable of. It seems to be a fairly nice system, pretty at least. Just a document editor with the standard font options on the surface. Hyperlinks are available (as they should be), and it seems to handle media as well such as pictures, movies, audio, etc. I’ve haven’t dealt with media yet on this blog, so I’m not that interested in those capabilities.

Writer won’t download the categories I have set up on my blog, so I’ll have to hand-edit that after I publish. No big deal I guess, but kinda defeats the purpose of this utility. I also don’t see a way to add serendipity tags, so that’s another hand-edit. You can add third party tags such as those from Technorati, LiveJournal, and others, but I have no interest in that.

The web preview is pretty nice. It shows you exactly what the web page will look like when you publish it. It’s pretty cool and seems to work well.

Well, I guess it’s a little nicer than the JavaScript WYSIWYG editor that’s built into serendipity, but between the need for XML-RPC and the lack of serendipity features, I don’t think I’ll be continuing to use Live Writer. While trying to get Writer to work, I also ran across two other tools, w.bloggar and Performancing. The first is a program similar to Writer that seems to allow offline editing. The second is a Firefox plugin that seems to have a ton of features. I’ll be checking both out in the near future.

Firefox 2.0

The latest incarnation of the Firefox browser is nearing release. Version 2.0 brings with it a smattering of nifty features as well as an updated UI and enhanced add-on handling.

I’m particularly fond of the built-in spell checker which comes in really handy. It works in a fashion similar to how the spell checker in MS Office and Openoffice works. Each misspelled work is underlined in red. When you right click on the underlined word, Firefox pops up a list of suggestions. You can choose one of the suggested replacements, or add the word to your dictionary. The spell checker only checks text boxes by default, but you can right click on any text entry field to force a spell check.

The new UI places a close icon on each tab, allowing you to close a tab in a rapid fashion. I can see this causing slight problems with people that are too quick to click as it doesn’t prompt you to close the tab. If you have a large number of tabs open, it begins to suppress the close button on all but the current tab. There is also a drop down on the far right side of the tab bar that shows you all of the open tabs in a list, allowing you to read the full title before jumping to the tab you need.

Firefox now defaults to opening all links in new tabs instead of new windows. I prefer this behavior to simply opening new windows. In addition, the popup blocker has apparently been enhanced. Since installing 2.0, I have not seen a single popup.

The default search bar now supports suggestions. As you type, the search engine you have chosen will offer suggestions for search terms, helping you find the information you want. This is the same technology that Google uses for Google Suggest. The new search engine manager allows you to add in additional search engines as well.

Overall, I think this is a real positive step in Firefox’s evolution. You should check it out, it’s a really great browser!

Windows XP ISO Mount Utility

I was looking around earlier today for a tool that would allow me to mount .iso images in Windows XP. I stumbled across a tool Microsoft wrote called the Virtual CD Control Panel. Unfortunately I can’t seem to find a page on the Microsoft web site that directly references this tool, but it is a download from a Microsoft site, and it made it through my virus checker, so my best guess is that it’s ok.

 

 

It’s pretty easy to install. Copy the VCdRom.sys file into your system32\drivers folder and then run the executable. From there use the Driver Control button to load and start the driver and then you can add virtual drives that can be used to mount .iso files. Simple!

Just thought I might share my find. I find it extremely easy to mount .iso files in Linux and wanted something on the Microsoft side as well.