Bad crawler, no cookie!

My wife is a professional SEO consultant with her own business. I work with her on occasion, helping out with the server end of things. It’s fun and challenging, and I think we work pretty well together.

So, the other day she comes to me with an odd question. Why is Google Analytics suddenly showing a high bounce rate for new keywords? Interesting problem, of course. One of the first things that popped into my mind was either a blackhat SEO or a rival of some sort. It sounds paranoid, but it does happen.

So I pulled the access logs and started pouring through them. Since the bounce rate came from a keyword search, it was easy enough to locate the offending entries. There were hundreds of log entries, all coming from the same 65.55.0.0/16 address space. A couple more seconds of digging showed that 65.55.0.0/16 was owned by Microsoft. Reverse DNS on some of the IPs revealed that these IPs were all part of the MSN web crawler. MSN apparently doesn’t provide reverse DNS for all of their IPs. No matter, there were enough to prove that this was MSN. Here’s an example from the log:

65.55.110.195 – – [24/Mar/2009:03:08:05 -0400] “GET /index.html HTTP/1.0” 200 58838 “http://search.live.com/results.aspx?q=keyword” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)”

So what in the world is going on here? Why are we getting pounded by hundreds upon hundreds of requests from the MSN crawler? And why is the MSN crawler reporting itself as Internet Explorer 6.0? The referrer URL showed the source of the request to be from a live.com search, but these being crawler addresses, I’m willing to bet this was programmed in rather than a result of an actual search. It doesn’t really matter, though, because whatever it is, it’s causing a high bounce rate and really screwing up the site statistics. The high bounce rate may be affecting the Google ranking as well.

Before we blocked these requests, though, we wanted to make sure this was unwanted behavior, so we started digging for info. One of the pages we came across described the same behavior we were seeing. As it turns out, this strange activity is intended. Live.com claims they do this to detect cloaking. Of course, it was quite easy to identify these IPs as coming from Microsoft, and determine (rather quickly) that they are sourcing from a search engine. It would be very simple to broaden any cloaking to include those IPs, making this crazy technique useless.

Microsoft claims they are continuing to tune their crawler to reduce the spam and make the keywords more relevant. The point is, though, that this seems to hurt more than it helps. As a result, many webmasters are blocking the referrer spam, at risk of having MSN blacklist the site. We have followed suit, deeming both MSN and Live.com to be irrelevant search engines.

Of course, if someone out there has a better idea of how to handle this, I’m listening…