I’ve been working a lot with network monitoring lately. While mostly dealing with utilization monitoring, I do dabble with general network health systems as well.
There are several ways to monitor a network and determine the “health” of a given element. The simple, classic example is the ICMP echo request. Simply ping the device and if it responds, it’s alive and well.
This doesn’t always work out, however. Take, for instance, a server. Pinging the server simply indicates that the TCP/IP stack on the server is functioning properly. But what about the processes running on the server? How do you make sure those are running properly?
Other “health” related items are utilization, system integrity, and environment. When designing and/or implementing a network health system, you need to take all of these items into account.
I have used several different tools to monitor the health of the networks I’ve dealt with. These tools range from custom written tools to off-the-shelf products. Perhaps at some point in the future I can release the custom tools, but for now I’ll focus on the freely available tools.
For general network monitoring I use a tool called Argus. Argus is a pretty robust monitoring system written in Perl. It’s pretty simple to set up and the config file is pretty self explanatory. Monitoring capabilities include ping (using fping), SNMP, http, and DNS. You can monitor specific ports on a device, allowing you to determine the health of a particular service.
Argus also has some unique capabilities that I haven’t seen in many other monitoring systems. For instance, you can monitor a web page and detect when specific strings within that webpage change. This is perfect for monitoring software revisions and being alerted to new releases. Other options include monitoring of databases via the Perl DBI module.
The program can alert you in a number of different manners such as email or paging (using qpage). Additional notification methods are certainly possible with custom code.
The program provides a web interface similar to that older versions of What’s Up Gold. There is a fairly robust access control system that allows the administrator to lock users into specific sections of the interface with custom lists of available elements.
Elements can be configured with dependencies, allowing alerts to be suppressed for child elements. Each element can also be independently configured with a variety of options to allow or suppress alerts, modify monitoring cycle times, send custom alert messages, and more. Check out the documentation for more information. There’s also an active mailing list to help you out if you have additional questions.
In future posts I’ll touch on some of the other tools I have in my personal toolkit such as host intrusion detection systems, graphing systems, and more. Stay tuned!