Whoa! Slow down! Or else…

There have been rumblings over the past few years about companies that are throttling customer bandwidth and, in some instances, canceling their service. I can confirm one of the rumors, having worked for the company, and I would tend to believe the other rumors. The problem with most of these situations is that none of these companies ever solidly defines what will result in throttling or loss of service. In fact, most of them merely put clauses in their Terms of Service that states that the bandwidth they are purchasing is not sustained, not guaranteed, etc.

Once particular company has been in the news as of late, having cut customers off time and time again. In fact, they have, what appears to be, a super-secret internal group of customer support representatives that deal with the “offenders.” Really, I’m not making this up. Check out this blog entry. This is pretty typical of companies that enact these types of policies. What I find interesting here is how Comcast determines who to disable. According to the blog entry by Rocketman, Comcast is essentially determining who the top 1% of users are for each month and giving them a high-usage warning. The interesting bit is that this is almost exactly how my previous employer was handling it.

Well, apparently Comcast has come out with a statement to clarify what excessive usage is. According to Comcast, excessive usage is defined as “a user who downloads the equivalent of 30,000 songs, 250,000 pictures, or 13 million emails.” So let’s pull this apart a little. The terms they use are rather interesting. Songs? Pictures? How is this even close to descriptive enough to use? A song can vary wildly in size depending on the encoding method, bitrate, etc. So the same song can range from 1 MB to 100 MB pretty easily. How about pictures then? Well, what kind of pictures? After all, thumbnails are pictures too. So, again, we can vary the size of a picture from 10 KB to 10 MB, depending on the size and detail of the picture. And, of course, let’s not forget emails. An average email is about 10 KB or so, but these can also range up to several MB in size.

So let’s try out some simple math on this. Email seems to be the easiest to deal with, so we’ll use that. 13 Million emails in one month, assuming a 10 KB average size for each email, results in approximately 130 GB of data. That’s only an average of 50 KB per seconds over the course of 30 days. If we assume a user is only on the computer for 8 hours a day, that’s an average of 150KB per second for the entire 8 hours each day. Of course, we don’t normally download at such a consistent rate, it’s much more bursty in nature.

Now, I don’t believe the average user is going to download this much data, but there are business professionals who could easily exceed this rate. But I think the bigger issue here is how these companies are handling these issues. They advertise and sell access rates ranging anywhere from 3 Meg to 10 Meg and then get upset when the customers actually use that bandwidth. Assuming a 3M profile, that means you can download something in the range of 972 GB of data in one month. 10M is even more fun, allowing a max rate of about 3.2 TB. Think about that for a minute. That means you can only use about 13% of a 3M profile, and 4% of a 10M profile before they’ll terminate your service.

While I understand that providers need to ensure that everyone receives a consistent, reliable service, I don’t believe they can treat customers like this. We’ll see how this turns out over time, but I expect that as video becomes more popular, you’ll see customers that exceed this rate on a much more consistent basis. I wonder how providers will handle that…

Satellite TV Woes

Back in the day, I had Analog and then Digital cable.  Having been employed by a sister company of the local cable company, I enjoyed free cable.  There were a lot of digital artifacts, and the picture wasn’t always that great, but it was free and I learned to live with it.

After I left that company, I had to pay full price for the digital cable I had installed.  Of course, I was used to a larger package and the price just outright shocked me.  With the cable modem included, it was somewhere in the $150 a month range.  Between the signal issues on the cable TV, and the constant cable modem outages, I happily decided to drop both the cable and the cable modem and move on to DSL and Satellite TV.

My first foray into Satellite TV was with Dish Networks.  The choice to do so was mostly guided by my brother’s employment by Dish.  So, we checked it out and had it installed.  At the time, they were running a free DVR promotion, so we grabbed that as well.

Dish is great.  The DVR was a dual tuner, so we were able to hook two TVs up to it.  We could record two shows at once, and watch two recorded shows at the same time, one on each TV.  It was pure TV bliss, and it got better.  Dish started adding little features here and there that I started noticing more and more.  First, the on-screen guide started showing better summaries of the shows.  Then it would show the year the show was produced in.  And finally, it started showing actual episode number information.  Little things, but it made all the difference.

Dish, however, had it’s problems.  My family and I only watch a few channels.  The kids like the cartoon channels : Cartoon Network, Nickelodeon, Noggin, and Boomerang.  My wife enjoys the local channels for current shows such as CSI and Law and Order, and also the educational channels such as The History Channel, The Science Channel, and Discovery.  And myself, I’m into stuff like Scifi, FX, and occasionally, G4.  CSI and Law and Order are on my menu as well.  The problem is, in order to get all of the channels we wanted, we needed to subscribe to the largest Dish package.  It’s still cheaper than cable, but more money than we wanted to pay to pick up one or two extra channels.

Enter DirecTV.  DirecTV offered all the channels we wanted in their basic package.  So, we ordered it.  As it turns out, they’ve partnered with Verizon, so we can get our phone, DSL, and dish all on the same bill.  Personally, I couldn’t care less about that, but I guess it is convenient.

At any rate, we got DirecTV about a month or so ago.  Again, we got the DVR, but there’s a problem there.  DirecTV doesn’t offer a dual TV DVR.  It’s dual tuner so we can tape two shows simultaneously, but you can only hook a single TV up to it.  Our other TV has a normal DirecTV receiver on it.  Strike one against DirecTV, and we didn’t even have it hooked up yet.

So the guy comes and installs all the new stuff.  They used the same mount that the Dish Networks dish was mounted on, as well as the same cables, so that was convenient.  Dish Networks used some really high quality cables, so I was pleased that we were able to keep them.  Everything was installed, and the installer was pretty cool.  He explained everything and then went on his way.

I started messing around with the DVR and immediately noticed some very annoying problems.  The remote is a universal remote.  Dish Networks used them too.  The problem with the DirecTV remote, however, is that apparently when you interact with the TV, VCR, or DVD player, it needs to send the signal to the DirecTV receiver first before it will send the signal to the other equipment.  This means merely pressing the volume control results in nothing.  You need to hold the volume down for about a second before it will change the volume on the TV.  Very, very annoying.  I also noticed a considerable pause between pressing buttons on the controller and having the DVR respond.  The standalone receiver is much quicker, but there is definitely a noticeable lag there.  Strike two.

Continuing to mess around with the DVR, I started checking out how to set up the record timers and whatnot.  DirecTV has a nice guide feature the automatically breaks down the channels into sub-groups such as movie channels, family channels, etc.  They also have a nicer search feature than Dish does.  As you type in what you’re searching for, it automatically refreshes the list of found items, allowing you a quick shortcut to jump over and choose what you’re looking for.  Dish allows you to put arbitrary words in and record based on title matches, but I’m not sure if DirecTV does.  I never used that feature anyway.  So the subgroups and the search features are a score for DirecTV.

Once in the guide, however, it gets annoying.  Dish will automatically mask out any unsubscribed channels for you, where DirecTV does not.  Or, rather, if they do, it’s buried somewhere in the options and I can’t find it.  Because of this, I find all sorts of movies and programs that look cool, but give me a “you’re not subscribed to this channel” message when I try to watch them.  Quite annoying.

I set up a bunch of timers for shows my family and I like to watch.  It was pretty easy and worked well.  A few days later, I checked the shows that recorded.  DirecTV groups together episodes for shows which is a really nice feature.  However, I noticed that one or two shows never recorded.  Dish had a problem once in a while with recording new shows where the show didn’t have a “new” flag on it and it would skip it.  Thinking this was the problem with DirecTV, I just switched the timer to record all shows.  I’d have to delete a bunch of shows I already saw, but that’s no big deal.

Another week goes by, still no shows.  Apparently DirecTV doesn’t want me to watch my shows.  Now I’m completely frustrated.  Strike three.

Unfortunately, I’m in a two year contract, so I just have to live with this.  I’m definitely looking to get my Dish Networks setup back at the end, though.  That extra few bucks we spent on Dish was well worth it.

 

DirecTV definitely has some features that Dish doesn’t, but the lack of a dual tuner, the lag time between the controller and the receiver, and the refusal to tape some shows is just too much.  The latter two I can live with, but the dual TV DVR was just awesome and I really miss it.  Since I only have the DVR on the main TV in the house, I need to wait until the kids go to bed before I can watch my shows in peace.  Of course, I need to go to bed too since I get up early for work.  This leaves virtually no time for the few shows I watch, and as a result, I have a bunch of stuff recorded that I haven’t been able to watch yet.  And, since it’s that time of the year where most of my shows aren’t being shown, I know that it’s only going to get worse.

I’m just annoyed at this point.  If you have a choice between Dish and DirecTV, I definitely suggest Dish.  It’s much better in the long run and definitely worth the extra few dollars.

Backups? Where?

It’s been a bit hectic, sorry for the long time between posting.

 

So, backups.  Backups are important, we all know that.  So how many people actually follow their own advice and back their data up?  Yeah, it’s a sad situation for desktops.  The server world is a little different, though, with literally tens, possibly hundreds of different backup utilities available.

 

My preferred backup tool of choice is the Advanced Maryland Automatic Network Disk Archiver, or AMANDA for short.  AMANDA has been around since before 1997 and has evolved into a pretty decent backup system.  Initially intended for single tape-based backups, options have been added recently to allow for tape spanning and disk-based backups as well.

Getting started with AMANDA can be a bit of a chore.  The hardest part, at least for me, was getting the tape backup machine running.  Once that was out of the way, the rest of it was pretty easy.  The config can be a little overwhelming if you don’t understand the options, but there are a lot of guides on the Internet to explain it.  In fact, the “tutorial” I originally used is located here.

Once it’s up and running, you’ll receive a daily email from Amanda letting you know how the previous nights backup went.  All of the various AMANDA utilities are command-line based.  There is no official GUI at all.  Of course, this causes a lot of people to shy away from the system.  But overall, once you get the hang of it, it’s pretty easy to use.

Recovery from backup is a pretty simple process.  On the machine you’re recovering, run the amrecover program.  You then use regular filesystem commands to locate the files you want to restore and add them to the restore list.  When you’ve added all the files, issue the extract command and it will restore all of the files you’ve chosen.  It’s works quite well, I’ve had to use it once or twice…  Lemme tell ya, the first time I had to restore from backups I was sweatin bullets..  After the first one worked flawlessly, subsequent restores were completed with a much lower stress level.  It’s great to know that there are backups available in the case of an emergency.

AMANDA is a great tool for backing up servers, but what about clients?  There is a Windows client as well that runs using Cygwin, a free open-source Linux-like environment for Windows.  Instructions for setting something like this up are located in the AMANDA documentation.  I haven’t tried this, but it doesn’t look too hard.  Other client backup options include remote NFS and SAMBA shares.

Overall, AMANDA is a great backup tool that has saved me a few times.  I definitely recommend checking it out.

Hard drive failure reports

FAST ’07, the File and Storage Technology conference, was held from February 13th through the 16th. During the conference, a number of interesting papers were presented, two of which I want to highlight. I learned of these papers through posts on Slashdot rather than actually attending the conference. Honestly, I’m not a storage expert, but I find these studies interesting.

The first study, “Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?” was written by a Carnegie Mellon University professor, Garth Gibson, and a recent PhD graduate, Bianca Schroeder.

This study looked into the manufacturer’s specifications of MTTF, mean time to failure, and AFR, annual failure rate, compared to real-world hard drive replacement rates. The paper is heavily littered with statistical analysis, making it a rough read for some. However, if you can wade through all of the statistics, there is some good information here.

Manufacturers generally list MTTF rates of 1,000,000 to 1,500,000 hours. AFR is calculated by taking the number of hours in a year and dividing it by the MTTF. This means that the AFR ranges from 0.54% to 0.88%. In a nutshell, this means you have a 0.5 to 0.9% chance of your hard drive failing each year.

As explained in the study, determining whether a hard drive has failed or not is problematic at best. Manufacturers report that up to 40% of drives returned as bad are found to have no defects.

The study concludes that real world usage shows a much higher failure rate than that of the published MTTF values. Also, the failure rates between different types of drives such as SCSI, SATA, and FC, are similar. The authors go on to recommend some changes to the standards based on their findings.

The second study, “Failure Trends in a Large Disk Drive Population” was presented by a number of Google researchers, Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andr´e Barroso. This paper is geared towards trying to find trends in the failures. Essentially, the goal is to create a reliable model to predict a drive failure so that the drive can be replaced before essential data is lost.

The researchers used an extensive database of hard drive statistics gathered from the 100,000+ hard drives deployed throughout their infrastructure. Statistics such as utilization, temperature, and a variety of SMART (Self-Monitoring Analysis and Reporting Technology) signals were collected over a five year period.

This study is well written and can be easily understood by non-academicians and those without statistical analysis training. The data is clearly laid out and each parameter studied is clearly explained.

Traditionally, temperature and utilization were pinpointed as the root cause of most failures. However, this study clearly shows a very small correlation between failure rates and these two parameters. In fact, failure rates due to high utilization seemed to be highest for drives under one year old, and stayed within 1% of low utilization drives. It was only at the end of a given drives expected lifetime that the failure rate due to high utilization jumped up again. Temperature was even more of a surprise showing that low temperature drives failed more often than high temperature drives until about the third year of life.

The report basically concludes that a reliable model of failure detection is mostly impossible at this time. The reason for this is that there is no clear indication of a reliable parameter for detecting imminent failure. SMART signals were useful in indicating impending failures and most drives fail within 60 days of the first reported errors. However, 36% of their failed drives reported no errors at all, making SMART a poor overall predictor.

Unfortunately, neither of these studies elaborated on the manufacturer or model of the drives used. This is likely due to professional courtesy and a lack of interest in being sued for defamation of character. While these studies will doubtlessly be useful to those designing large-scale storage networks, manufacturer specific information would be of great help.

For me, I mostly rely on Seagate hard drives. I’ve had very good luck with them, having had only a handful fail on me over the past few years. Maxtor used to be my second choice for drives, but they were acquired by Seagate at the end of 2005. I tend to stay away from Western Digital drives having had several bad experiences with them in the past. In fact, my brother had one of their drives literally catch fire and destroy his computer. IBM has also had some issues in the past, especially with their Deskstar line of drives which many people nicknamed the “Death Star” drive.

With the amount of information currently stored on hard drives today, and the massive amount in the future, hard drive reliability is a concern for many vendors. It should be a concern for end-users as well, although end-users are not likely to take this concern seriously. Overall these two reports are excellent overview of the current state of reliability and the trends seen today. Hopefully drive manufacturers can take these reports and use them to design changes to increase reliability, and to facilitate earlier detection of impending failures.

Linux Software Raid

I had to replace a bad hard drive in a Linux box recently and I thought perhaps I’d detail the procedure I used.  This particular box uses software raid, so there are a few extra steps to getting the drive up and running.

Normally when a hard drive fails, you lose any data on it.  This is, of course, why we back things up.  In my case, I have two drives in a raid level 1 configuration.  There are a number of raid levels that dictate various states of redundancy (or lack thereof in the instance of level 0).  The raid levels are as follows (Copied from Wikipedia):

  • RAID 0: Striped Set
  • RAID 1: Mirrored Set
  • RAID 3/4: Striped with Dedicated Parity
  • RAID 5: Striped Set with Distributed Parity
  • RAID 6: Striped Set with Dual Distributed Parity

There are additional raid levels for nested raid as well as some non-standard raid levels.  For more information on those, see the Wikipedia article referenced above.

 

The hard drive in my case failed in kind of a weird way.  Only one of the partitions on the drive was malfunctioning.  Upon booting the server, however, the bios complained about the drive being bad.  So, better safe than sorry, I replaced the drive.

Raid level 1 is a mirrored raid.  As with most raid levels, the hard drives being raided should be identical.  It is possible to use different models and sizes in the same raid, but there are drawbacks such as a reduction in speed, possible increased failure rates, wasted space, etc.  Replacing a drive in a mirrored raid is pretty straightforward.  After identifying the problem drive, I physically removed the faulty drive and replaced it with a new one.

The secondary drive was the failed drive, so this replacement was pretty easy.  In the case of a primary drive failure, it’s easiest to move the secondary drive into the primary slot before replacing the failed drive.

Once the new drive has been installed, boot the system up and it should load up your favorite Linux distro.  The system should boot normally with a few errors regarding the degraded raid state.

After the system has booted, login to the system and use fdisk to partition the new drive.  Make sure you set the drive IDs back to Linux raid.  When finished, the partition table will look something like this :

   Device Boot      Start         End      Blocks   Id  System
/dev/hdb1   *           1          26      208813+  fd  Linux raid autodetect
/dev/hdb2              27        3850    30716280   fd  Linux raid autodetect
/dev/hdb3            3851        5125    10241437+  fd  Linux raid autodetect
/dev/hdb4            5126       19457   115121790    f  W95 Ext'd (LBA)
/dev/hdb5            5126        6400    10241406   fd  Linux raid autodetect
/dev/hdb6            6401        7037     5116671   fd  Linux raid autodetect
/dev/hdb7            7038        7164     1020096   82  Linux swap
/dev/hdb8            7165       19457    98743491   fd  Linux raid autodetect

Once the partitions have been set up, you need to format the drive with a filesystem.  This is a pretty painless process depending on your filesystem of choice.  I happen to be using ext3 as my filesystem, so I use the mke2fs program to format the drive.  To format an ext3 partition use the following command (This command, as well as the commands that follow, need to be run as root, so be sure to use sudo.) :

mke2fs -j /dev/hdb1

Once all of the drives have been formatted you can move on to creating the swap partition.  This is done using the mkswap program as follows :

mkswap /dev/hdb7

Once the swap drive has been formatted, activate it so the system can use it.  The swapon command achieves this goal :

swapon -a /dev/hdb7

And finally you can add the drives to the raid using mdadm.  mdadm is a single command with a plethora of uses.  It builds, monitors, and alters raid arrays.  To add a drive to the array use the following :

mdadm -a /dev/md1 /dev/hdb1

And that’s all there is to it.  If you’d like to watch the array rebuild itself, about as much fun as watching paint dry, you can do the following :

watch cat /proc/mdstat

And that’s all there is to it.  Software raid has come a long way and it’s quite stable these days.  I’ve been happily running it on my Linux machines for several years now.  It works well when hardware raid is not available or as a cheaper solution.  I’m quite happy with the performance and reliability of software raid and I definitely recommend it.

Godshell Toaster Wiki Open

I’m pleased to announce that the Godshell Toaster Wiki is now open for editing.

This wiki is intended to be a complete source of information for the qmail toaster I put together several years ago. This particular toaster uses Pawel Foremski’s excellent qmail-spp patch to allow on-the-fly modifications of the qmail server. With this toaster, a server administrator can write small shell scripts to alter the behavior of the server with minimal programming knowledge.

I have spent a considerable amount of time compiling the information that currently exists in the wiki and will continue to add and edit data in the future. Please feel free to take a look at the site and contribute!

Carmack on the PS3 and 360

John Carmack, the 3D game engine guru from id Software and a game developer I hold in very high regard, and Todd Hollenshead, CEO of id Software, were recently interviewed by GameInformer. Carmack recently received a Technology Emmy for his work and innovation on 3D engines, a well deserved award.

I was a bit surprised while reading the interview. Carmack seems to be a pretty big believer in DirectX these days, and thinks highly of the XBox 360. On the flip side, he’s not a fan of the asymmetric CPU of the PS3 and thinks Sony has dropped the ball when it comes to tools. I never realized that Carmack was such a fan of DirectX. He used to tout OpenGL so highly.

Todd and Carmack also talked about episodic gaming. Their general consensus seems to be that episodic gaming just isn’t there yet. It doesn’t make sense because by the time you get the first episode out, you’ve essentially completed all of the development. Shipping episodes at that point doesn’t make sense since you’ve already spent the capital to make the game to begin with.

Episodic games seem like a great idea from the outside, but perhaps they’re right. Traditionally, the initial games have sold well, but expansion packs don’t. Episodic gaming may be similar in nature with respect to sales. If the content is right, however, perhaps episodes will work. But then there’s the issue of release times. If you release a 5-10 hour episode, when is the optimal time to release the next episode? You’ll have gamers who play the entire episode on the day it’s released and then get bored waiting for more. And then there’s the gamers who take their time and finish the episode in a week or two. If you release too early, you upset those some people who don’t want to have to pay for content constantly, while waiting may cause those bored customers to lose interest.

The interview covered a few more areas such as DirectX, Quakecon, and Hollywood. I encourage you to check it out, it makes for good reading!

iPhone… A revolution?

So the cat’s out of the bag. Apple is already in the computer business, the music business, the video/TV business, and now they’re joining the cell phone business. Wow, they’ve come pretty far in the last 7 years alone.

So what is this iPhone thing anyway? Steve says it’s going to revolutionize phones, and that it’s 5 years ahead of the current generation. So does it really stack up? Well, since it’s only a prototype at this point, that’s a little hard to say. The feature set is impressive, as was the demonstration given at Macworld. Most of the reviews I’ve read have been pretty positive too.

So let’s break this down a little bit and see what we have. The most noticeable difference is the complete and total lack of a keypad/keyboard. In fact, there are a grant total of four buttons on this thing, five if you count up/down volume as two. And only one of them is on the actual face of the device. This may seem odd at first, but the beauty here is that any application developed for the iPhone can arbitrarily create their own buttons. How? Why?

Well, the entire face of the phone is one giant touchscreen. In fact, it’s a multi-touch screen meaning that you can touch multiple points on the screen at the same time for some special effects such as zooming in on a picture. This means that developers are not tied to a pre-defined keypad and can create what they need as the application is run. So, for instance, the phone itself has a large keypad for dialing a telephone number. In SMS and email mode, the keypad is shrunk slightly and becomes a full keyboard.

As Steve pointed out in his keynote, this is very similar to what happens on a PC today. A PC application can display buttons and controls in any configuration it needs, allowing the user to interact with it through use of a mouse. Now imagine the iPhone taking the place of the PC and your finger taking the place of the mouse. Your finger is a superb pointing device and it’s pretty handy too.

The iPhone runs an embedded version of OSX, allowing it access to a full array of rich applications. It should also allow developers a access to a familiar API for programming. While no mention of third-party development has been made yet, you can bet that Apple will release some sort of SDK. The full touchscreen capabilities of this device will definitely make for some innovative applications.

It supports WiFi, EDGE, and Bluetooth 2.0 in addition to Quad-Band GSM for telephony. WiFi access is touted as “automatic” and requires no user intervention. While this is likely true in situations where there is no WiFi security in place, the experience when in a secure environment is unknown. More details will likely be released over the coming months.

Cingular is the provider of choice right now. Apple signed an exclusivity contract with Cingular, so you’re tied to their network for the time being. Being a Cingular customer myself, this isn’t such a bad thing. I like Cingular’s network as I’ve had better luck with it than the other networks I’ve been on.

In addition to phone capabilities, the iPhone is a fully functional iPod. It syncs with iTunes as you would expect, has an iPod docking connector, and supports audio and video playback. One of the cooler features is the ability to tip the iPhone on it’s side to enable landscape mode. The iPhone automatically switches to landscape mode when it detects the change in pitch. Video must be viewed in landscape mode.

So it looks like the iPhone has all of the current smartphone capabilities and then some. But how will it do in the market? The two models announced at Macworld are priced pretty high. The 4 Gig model will run you $499 for the 4 Gig model, and $599 for the 8 Gig. This makes the iPhone one of the more expensive phones on the market. However, it seems that Apple is betting that a unified device, phone/iPod/camera/Internet, will be worth the premium price. They may be right, but only time will tell.

UPDATE : According to an article in the New York Times, Jobs is looking to restrict third-party applications on the iPhone. From the article :

“These are devices that need to work, and you can’t do that if you load any software on them,” he said. “That doesn’t mean there’s not going to be software to buy that you can load on them coming from us. It doesn’t mean we have to write it all, but it means it has to be more of a controlled environment.”

So it sounds like Apple is interested in third-party apps, but in a controlled manner. This means extra hoops that third-party developers need to jump through. This may also entail additional costs for the official Apple stamp of approval, meaning that smaller developers may be locked out of the system. Given the price point of the phone, I hope Apple realizes the importance of third-party apps and the impact they have. Without additional applications, Apple just has a fancy phone with little or no draw.

SpamAssassin and Bayes

I’ve been messing around with SpamAssassin a lot lately and the topic of database optimization came up. I’m using Bayesian filtering to improve the spam scores and, to increase speed and manageability, I have SpamAssassin set to use MySQL as the database engine. Bayes is fairly resource intensive on both I/O and CPU depending on the current action being performed. Since I decided to use MySQL as the storage engine, most of the I/O is handled there.

I started looking into performance issues with Bayes recently and noticed a few “issues” that I’ve been trying to work out. The biggest issue is performance on the MySQL side. The Bayes database is enormous and it’s taking a while to deal with the queries. So, my initial thought was to look into reducing the size of the database.

There are a few different tables used by Bayes. The main table that grows the largest is the bayes_token table. That’s where all of the core statistical data is stored and it just takes up a lot of room. There’s not a lot that can be done about it. Or so I thought. Apparently if you have SpamAssassin set up to train Bayes automatically, it doesn’t always train the mail for the correct user. For instance, if you receive mail that is BCCed to you, then the mail could be learned for the user listed in the To: field. This means the Bayes database can contain a ton of “junk” in it that you’ll never use. So my first order of business then is to trim out the non-existent users.

The bayes_seen table is used to track the message IDs of messages that have already been parsed and learned by Bayes. A useful table to prevent unnecessary CPU utilization, but there is no automatic trimming function. This means the database grows indefinitely. The awl table is similar to this in that it can grow indefinitely and has no autotrim mechanism. For both of these tables I’ve added a timestamp field to monitor additions and updates. With that in place, I can write some simple Perl code to automatically trim entries that are sufficiently old enough to be irrelevant. For the bayes_seen database I plan on using a default lifetime of 1 month. For the awl I’m looking at dropping any entries with a single hit over 3 months old, and any entries over 1 month old with less than 5 hits. Since MySQL automatically updates the timestamp field for any changes to the row, this should be sufficient enough to keep any relevant entries from being deleted.

While researching all of this I was directed to a site about MySQL optimization. The MySQL Performance Blog is run by Peter Zaitsev and Vadim Tkachenko, both former MySQL employees. The entry I was directed to dealt with general MySQL optimization and is a great starting point for anyone using MySQL. I hate to admit it, but I was completely unaware that this much performance could be coaxed out of MySQL with these simple settings. While I was aware that tuning was possible, I just never dealt with a large enough database to warrant it.

I discovered, through the above blog and further research, that the default settings in MySQL are extremely conservative! By default, most of the memory allocation variables are maxed out at a mere 8 Megs of memory. I guess the general idea is to ship with settings that are almost guaranteed to work and allow the admin to tune the system from there.

I’m still tuning and playing with the parameters, but it looks like I’ve easily increased the speed of this beast by a factor of 5. It’s to the point now where a simple ‘show processlist’ is hardly listing any processes anymore because they’re completing so fast! I’ve been a fan of MySQL for a while now and I’ve been pretty impressed with the performance I’ve seen from it. With these changes and further tuning, I’m sure I’ll be even more impressed.

So today’s blog entry has a lesson to be learned. Research is key when deploying services like this, even if they’re for yourself. Definitely check into performance tuning for your systems. You’ll thank me later.

Linux Upgrades – Installation from a software raid

I recently had to upgrade a few machines with a newer version of Linux. Unfortunately, the CD-ROM drives in these machines were not functioning, so I decided to upgrade the system via a hard drive install. Hard drive installs are pretty simple and are part of the standard distro for Redhat. However, the machines I had to upgrade were all set up with software raid 1.

My initial thought was to merely put in the raid location where the install media was located. However, the installation program does not allow this and actually presents a list of acceptable locations.

So from here, I decided to choose one of the 2 raid drives and use that instead. The system graciously accepted this and launched the GUI installer to complete the process. All went smoothly until I reached the final step in the wizard, the actual install. At this point the installer crashed with a Python error. Upon inspection of the error, it appeared that the drive I was trying to use for the installation media was not available. Closer inspection revealed the truth, mdadm started up and activated all of the raid partitions on the system, making the partition I needed unavailable.

So what do I do now? I re-ran the installation and deleted the raid partition, being careful to leave the physical partitions in-tact. Again, the installer crashed at the same step. It seems that the installer scanned the entire hard drive for raid partitions whether they had valid mount points or not.

I finally solved the problem through a crude hack during the setup phase of the installation. I made sure to delete the raid partition once again, leaving the physical drives in-tact. I stepped through the entire process but stopped just before the final install step. At this point, I switched to the CLI console via Ctrl-Shift-F2. I created a small bash script that looked something like this :

#!/bin/bash

mdadm –stop /dev/md3

exec /bin/bash ./myscript.bash

I ran this script and switched back to the installer via Ctrl-F6. I proceeded with the installation and the installer happily installed the OS onto the drives. Once completed, I switched back to the CLI console, edited the new /etc/fstab file and added a mount point for the raid drive I used, and rebooted. The system came up without any issues and ran normally.

Just thought I’d share this with the rest of you, should you run into the same situation. It took me a while to figure out how to make the system do what I wanted. Just be sure that your install media is on a drive that the installer will NOT need to write files to.