Troubleshooting 101

There seems to be a severe lack of understanding and technique when it comes to troubleshooting these days. It seems to me that a large amount of troubleshooting effort is completely wasted on wild ideas and theories while the simplest and most direct solutions are ignored.

Occam’s Razor states: “entities should not be multiplied beyond necessity.” Simply put, the easiest solution is often the best. This is the perfect mindset for anyone who does troubleshooting. There is no need to delve right into the most obscure reasons for a failure, start with the simple stuff.

For instance, questions like “Is the unit plugged in?”, or “Is the power on?” are perfect questions to start with. While it would be wonderful to believe that everyone you encounter has the common sense to check out these simple solutions, you’ll find that, unfortunately, the majority of the population isn’t that bright.

So, how about a real-world example. It’s 2am and you get paged that a router has gone unreachable. After notifying the proper people, you delve into the problem. Using the Occam’s Razor principle, what’s the first thing you should check? Well, for starters, let’s make sure the router really is unreachable. A simple ping should accomplish that. And just for good measure, ping something close to that router just to make sure you’re connected to the network.

Ok, so the router isn’t pingable, now what? Well, let’s look at the next easiest step, power. Since the router is in a remote location, this isn’t easy to check. However, you can check the uplink on the router. You should be able to get to the router just before the one that’s unreachable. Once there, check the interface that feeds your troubled router. Is it up or down? While you’re there, you can check for traffic and errors as well, but don’t focus on these yet, store them for later.

If the interface is down, then it’s quite possibly a physical line issue or, possibly, power. Just for good measure, I would suggest bouncing the interface to see if it’s something temporary. Sometimes, the interface will come back up and start running errors, indicating a physical line issue. What will often happen is that the interface comes back up and starts running errors, but allows limited traffic to get through. Once the error threshold is passed, the line goes back down. At this point, I’d call a technical to look at the physical line itself.

If the interface is up, try pinging the troubled router from the directly connected router. This process can help identify a routing issue in the network. Directly connected interfaces are considered to be the most specific route unless specifically overridden, which isn’t likely. If the ping is successful, take note of the ping time. If it seems overly high, you may be looking at a traffic issue. Depending on the type of router, traffic may be processor switched and cause high CPU usage. This can be identified by a sluggish interface and high ping times. Notes, high ping times don’t always indicate this. Most routers set a very low priority for ICMP traffic destined for the CPU, deeming throughput more important.

Remember the traffic and error counts you looked at previously? Those come into play now. If the traffic on the interface is very high, notably higher than usual, then this is likely the cause of the problem. Or, rather, an effect of the actual cause which may be a DoS attack or Virus outbreak. DoS, or Denial of Service, attacks are targeted attacks against a specific IP or range of IPs. A side effect of these attacks is that interfaces between the attacker and victim are often overloaded.

There are a number of different DoS attacks out there, but often when you see traffic as the cause of the DoS, you’ll notice that small packets are being used. One way to quickly identify this is to take the current bps on the interface, divide it by the packets per second, and then by 8 to get bytes per packet. Generally speaking, a normal interface ranges average packet size between 1000 and 1500 bytes. NOTE : This is referring to traffic received from a remote source such as a web site. Outgoing traffic, to the website, has a much lower average packet size because these packets generally contain control information such as acknowledgements, ICMP, etc.

Once you’ve identified that there is a traffic issue, the next step is to identify where the traffic is sourced from, or destined to. Remember, the end-goal here is to repair the problem so that normal operations can continue. Since you’re already aware of the overloaded interface, it’s easiest to concentrate your efforts there. Identifying the traffic source and destination is usually pretty easy, provided it’s not a distributed attack. On a Cisco router, you can try the “IP accounting” command. This command will show the source and destination for all output packets on an interface. Included is a count of the number of packets and the bits used by those packets. Simply look for rapidly increasing source and destination pairs and you’ll likely find your culprit.

Another option is to use an access list. If the router can handle it, place an access list on the interface that passes all traffic, but logs each packet. Then you can watch the log and try to identify large sources of traffic. Refine the access list to block that traffic until you’ve halted the attack. Be careful, however, as many routers will processor switch the traffic when an access-list is applied. This may cause a spike in CPU usage, sometimes causing a loss of connectivity to the router. If IP accounting is available, use that instead.

Once you identify the source and/or target of the attack, craft an appropriate access list to block the traffic as far upstream as you can. If the DoS attack is distributed, then the most effective means to stop the attack is probably to remove the targeted routes from the routing table and allow it to be blocked at the edges. This will likely result in an outage for that specific customer, but with a distributed attack, that’s often the only solution. From there you can work with your upstream providers to track down the perpetrator of the attack and take it offline permanently.

The preceding seems a bit long when written down, but in reality, this is a 15-30 minute process. Experienced troubleshooters can identify and resolve these problems even quicker. The point, of course, is to identify the most likely causes in the quickest manner possible. Often times, the simplest solution is the correct solution. Take the extra few seconds to check out the obvious before moving on to the more advanced. Often, you’ll resolve the solution quicker and sometimes wind up with a funny story as a bonus!

Please, troubleshoot responsibly.