monitoring – Technological Musings

Prometheus

We will entangle buds and flowers and beams
Which twinkle on the fountain’s brim, and make
Strange combinations out of common things
“Prometheus Unbound” by Percy Bysshe Shelley

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

Welcome to the world of metrics collection and performance monitoring. As with most things IT, entire market sectors have been built to sell these tools. And, of course, there are a number of open source tools that serve the same purpose. It’s one of these open source tools that we’re going to take a look at.

What is Prometheus?

Prometheus is a metrics collection and alerting tool developed and released to open source by SoundCloud. Prometheus is similar in design to Google’s Borgmon monitoring system and a relatively modest system can handle collecting hundreds of thousands of metrics every second. Properly tuned and deployed, a Prometheus cluster can collect millions of metrics every second.

Prometheus is made up of roughly four parts:

The main Prometheus app itself that is responsible for scraping metrics, storing them in the database, and (optionally) retrieving them when queried.
- The database backend is an internal Time Series database. This database is always used, but data can also be sent to remote storage backends.
Exporters are optional external programs that ingest data from a variety of sources and convert it to metrics that Prometheus can scrape.
- Exporters are purpose built applications for working with specific applications and hardware.
AlertManager is an alert management system. It ships with Prometheus.
Client Libraries that can be used to instrument custom applications.

I say “roughly” four parts because there are plenty of additional applications that are often used with a standard Prometheus cluster. If you need or want better graphing capabilities, applications like Grafana can be deployed. If you need to store metrics for long periods of time, remote storage backends are worth looking into. And the list goes on. For the purposes of this article, however, we’re going to focus on Prometheus itself with a small detour into exporters.

What is a metric?

But before we get there, we need to understand why something like Prometheus exists. So let’s start with a question. What are metrics? Simply put, metrics measure something. For instance, the time it takes you to read this article is a metric. The number or words is a metric. The average number of letters in the words of this article is a metric.

But those metrics are fairly static and not something you’d necessarily need a system like Prometheus for. Prometheus excels at metrics that change over time. For instance, what if you wanted to know how many “views” this article is getting? Or what if you wanted to know how much traffic was entering and leaving your network? Or how many build and deploy cycles are happening each hour? All of these are metrics that can be fed into Prometheus.

Now that we understand what a metric is, let’s take a look at how Prometheus gets the metrics it needs to store. The first thing Prometheus needs is a target. Targets are the endpoints that supply the metrics that Prometheus stores. These endpoints can be the actual endpoint being monitored or they can be a piece of middleware known as an exporter. Endpoints can be supplied via a static configuration or they can be “found” through a process called service discovery. Service Discovery is a more advanced topic and will be covered in a future article.

Once Prometheus has a list of endpoints, it can begin to retrieve metrics from them. Prometheus retrieves metrics in a very straightforward manner, a simple HTTP request. The configuration points at a specific location on the endpoint that will supply a stream of text identifying the metric and its current value. Prometheus reads this stream of text, ignoring lines beginning with a # as comments, and stores the metrics it receives in a local database.

Figure 1 – Example metrics output (from itNext)

A short sidetrack into Exporters

Prometheus can only talk HTTP to endpoints for metrics collection. So what happens when you’re trying to monitor a router or switch that only communicates using SNMP? Or perhaps you want to monitor a cloud service that doesn’t have a native Prometheus metrics endpoint? Fortunately, there’s a solution. Exporters.

Exporters come in many shapes and sizes. These are small, purpose-built programs designed to stand between Prometheus and anything you want to monitor that doesn’t natively support Prometheus. Some exporters sit idle until Prometheus polls them for data. When this happens, the exporter reaches out to the device it’s monitoring, gets the relevant data, and converts it to a format that Prometheus can ingest. Other exporters poll devices automatically, caching the results locally for Prometheus to pick up later.

Regardless of design, exporters act as a translator between Prometheus and endpoints you want to monitor. Chances are, if you’re trying to monitor a common device or application, there’s an exporter out there for it.

Data Storage

Prometheus uses a special type of database on the back end known as a time series database. Simply put, this database is optimized to store and retrieve data organized as values over a period of time. Metrics are an excellent example of the type of data you’d store in such a database.

External storage is also an option. There are many options such as Thanos, Cortex, and VictoriaMetrics that provide a variety of benefits. One of the primary benefits is to centralize the gathered metrics and allow for long term storage. Tools such as Grafana can query these third party storage solutions directly.

So you have a bunch of metrics…

Now that you’re an expert on Prometheus and you have it storing metrics, how do you use this data? Much like a SQL database, Prometheus has a custom query language known as PromQL. PromQL is pretty straightforward for simple metrics but has a lot of complexity when needed. Simply supplying the name of a metric will show all “instances” of that metric:

Figure 2 – Simple PromQL query (from Digital Ocean)

Or you can use some PromQL methods and generate a graph representing the data you’re after.

Figure 3 – Graphing example (from Digital Ocean)

Of course, if you’re serious about graphing, it’s worth looking into a package such as Grafana. Grafana allows you to create dashboards of metrics, send alerts, and more.

Alerting

While graphs are pretty to look at, metrics can serve another, important, purpose. They can be used to send alerts. Prometheus includes a separate application, called AlertManager, that serves this purpose. AlertManager receives notifications from Prometheus and handles all of the necessary logic to dedupe and deliver the alerts.

Alerts are created by writing alert rules. These rules are simply PromQL queries that fire when the query is true. That is, if you have a query that checks if the temperature on the cpu is over 80C then the query fires for each metric that meets that condition.

Alert rules can also include a time period over which a rule must evaluate to true. Expanding on our temperature example, exceeding 80C is ok if it’s a brief period of time. But if it lasts more than 5 minutes, send an alert. Alerts can be sent via email, slack, twitter, sms, and pretty much anything else you can write an interface for.

Figure 4 – Alerting rules (from Rancher)

Wrap Up

Monitoring is important. It helps identify when things have gone wrong and it can show when things are going right. Proper monitoring can be used across a variety of disciplines to squeeze everything you can out of the object being monitored.

Prometheus is a powerful open-source metrics package. It is highly scalable, robust, and extremely fast. A single modern server can be used to monitor a million metrics or more per second. Distributing Prometheus servers allows for many tens and even hundreds of millions of metrics to be monitored every second.

PromQL provides a robust querying language that can be used for graphing as well as alerting. The built-in graphing system is great for quick visualizations but longer term dashboarding should be handled in external applications such as Grafana.

How to leverage SNMP and not compromise the security of your server

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

Simple Network Management Protocol, or SNMP, has been around since 1988. While initially intended as an interim protocol as the Internet was first being rolled out, it quickly became a de facto standard for monitoring — and in some cases, managing — network equipment. Today, SNMP is used across most networks, small and large, to monitor the very equipment you likely passed through to get to this blog entry.

There are three primary flavors of SNMP: SNMPv1, SNMPv2c, and SNMPv3. SNMPv1 is, by far, the more popular flavor, despite being considered obsolete due to a complete lack of discernible security. This is likely because of the simplicity of SNMPv1 and that it’s generally used inside of the network and not exposed to the outside world.

The problem, however, is that SNMPv1 and SNMPv2c are unencrypted and even the community string used to “authenticate” is sent in the clear. An attacker can simply listen on the wire and grab the community as it passes by. This gives the attacker access to valuable information on your various devices, and even the ability to make changes if write access is enabled.

But wait, you may be thinking, what about SNMPv3? And you’re right, SNMPv3 *can* be more secure by using authentication and encryption. However, not all devices support SNMPv3 and thus interoperability becomes an issue. At some point, you’ll have to drop down to SNMPv2c or SNMPv1 and you’re back to the “in the clear” issue.

Despite the security shortcoming, SNMP can still be used without compromising the security of your server or network. Much of this security will rely on limiting use of SNMP to read-only and using tools such as iptables to limit where incoming SNMP requests can source from.

To keep things simple, we’ll worry about SNMPv1 and SNMPv2c in this article. SNMPv3 requires some additional setup and, in my opinion, isn’t worth the hassle. So let’s get started with setting up SNMP.

First things first, install the net-snmp package. This can be installed via whatever package manager you use. On the Redhat based systems I use, that tool is yum.

$ yum install net-snmp

Next, we need to configure the snmp daemon, snmpd. The configuration file is located in /etc/snmp/snmpd.conf. Open this file in your favorite editor (vim FTW!) and modify it accordingly. For example, the following configuration enables SNMP, sets up a few specific MIBs, and enables drive monitoring.

################################################################################
# AGENT BEHAVIOUR

agentaddress udp:0.0.0.0:161

################################################################################
# ACCESS CONTROL

# ------------------------------------------------------------------------------
# Traditional Access Control

# ------------------------------------------------------------------------------
# VACM Configuration
#       sec.name       source        community
com2sec notConfigUser default mysecretcommunity


#       groupName      securityModel securityName
group   notConfigGroup v1            notConfigUser
group   notConfigGroup v2c           notConfigUser

#       name          incl/excl  subtree             mask(optional)
view    systemview included .1.3.6.1.2.1.1
view    systemview included .1.3.6.1.2.1.2.2
view    systemview included .1.3.6.1.2.1.25
view    systemview included .1.3.6.1.4.1.2021
view    systemview included .1.3.6.1.4.1.8072.1.3.2.4.1.2

#       group          context sec.model sec.level prefix read       write notif
access  notConfigGroup ""      any       noauth    exact  systemview none  none

# ------------------------------------------------------------------------------
# Typed-View Configuration

################################################################################
# SYSTEM INFORMATION

# ------------------------------------------------------------------------------
# System Group
sysLocation The Internet
sysContact Internet Janitor
sysServices 72
sysName myserver.example.com

################################################################################
# EXTENDING AGENT FUNCTIONALITY


###############################################################################
## Logging
##

## We do not want annoying "Connection from UDP: " messages in syslog.
## If the following option is set to 'no', snmpd will print each incoming
## connection, which can be useful for debugging.

dontLogTCPWrappersConnects no

################################################################################
# OTHER CONFIGURATION

disk /         10%
disk /var      10%
disk /tmp      10%
disk /home     10%

Next, before you start up snmpd, make sure you configure iptables to allow SNMP traffic from trusted sources. SNMP uses UDP port 161, so all you need is a simple rule to allow traffic to pass. Be sure to add an outbound rule as well; UDP traffic is stateless.

iptables -A INPUT -s <ip addr> -p udp -m udp --dport 161 -j ACCEPT

iptables -A OUTPUT -p udp -m udp --sport 161 -j ACCEPT

You can set this up in firewalld as well, just search for SNMP and firewalld on Google.

Now that SNMP is set up, you can point an SNMP client at your server and pull data. You can pull data via the name of the MIB (if you have the MIB definitions installed) or via the OID.

$ snmpget -c mysecretcommunity myserver.example.com hrSystemUptime.0
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (6638000) 18:26:20.00

$ snmpget -c mysecretcommunity myserver.example.com .1.3.6.1.2.1.25.1.1.0
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (6638000) 18:26:20.00

And that’s about it. It’s called SIMPLE Network Management Protocol for a reason, after all.

One additional side note about SNMP. While SNMP is pretty solid, the security shortcomings are significant. I recommend looking at other solutions such as agent-based systems versus using SNMP. Tools like Nagios and Prometheus have more secure mechanisms for monitoring systems.

Network Graphing

Visual representations of data can provide additional insight into the inner workings of your network. Merely knowing that one of your main feeds is peaking at 80% utilization isn’t very helpful when you don’t know how long the peak is, at what time, and when it started.

There are a number of graphing solutions available. Some of these are extremely simplistic and don’t do much, while others are overly powerful and provide almost too much. I prefer using Cacti for my graphing needs.

Cacti is a web-based graphing solution built on top of RRDtool. RRDtool is a round-robin data logging and graphing tool developed by Tobias Oetiker of MRTG fame, MRTG being one of the original graphing systems.

Chock full of features, Cacti allows data collection from almost anywhere. It supports SNMP and script-based collection by default, but additional methods can easily be added. Graphs are fully configurable and can display just about any information you want. You can combine multiple sources on a single graph, or create multiple graphs for better resolution. Devices, once added, can be arranged into a variety of hierarchies allowing multiple views for various users. Security features allow the administrator to tailor the data shown to each user.

Cacti is a wonderful tool to have and is invaluable when it comes to tracking down problems with the network. The ability to graph anything that spits out data makes it incredibly useful. For instance, you can create graphs to show you the temperature of equipment, utilization of CPUs, even the number of emails being sent per minute! The possibilities are seemingly endless.

There is a slight learning curve, however. Initial setup is pretty simple, and adding devices is straightforward. The tough part is understanding how Cacti gathers data and relates it all together. There are some really good tutorials on their documentation site that can help you through this part.

Overall, I think Cacti is one of the best graphing tools out there. The graphs come out very professional looking, and the feature set is amazing. Definitely worth looking into.

Host Intrusion Detection

Monitoring your network includes trying to keep the bad guys out. Unfortunately, unless you disconnect your computer and keep it in a locked vault, there’s no real way to ensure that your system is 100% hack proof. So, in addition to securing your network, you need to monitor for intrusions as well. It’s better to be able to catch an intruder early rather than find out after they’ve done a huge amount of damage.

Intrusion detection systems (IDS) are designed to detect possible intrusion attempts. There are a number of different IDS types, but this post concentrates on the Host Intrusion Detection System (HIDS).

My preferred HIDS of choice is Osiris. Osiris uses a client/server architecture, making it one of the more unique HIDS out there. The server stores all of the configurations and databases, and triggers the scanning process. SSL is used between the client and server to ensure communication integrity.

Once a new client is added, the server performs an initial scan. A configuration file is pushed to the client which then scans the computer accordingly, reporting the results back to the server. This first scan is then used as a baseline database for future comparisons.

The host periodically polls the clients and requests scans. The results of those scans are compared to the baseline database and an alert is sent if there are differences. An administrator can then determine if the changes were authorized and take appropriate action. If the changes are ok, Osiris is updated to use the new results as the baseline database. If the changes are suspect, the administrator can look further into them.

Osiris is very configurable. Scanning intervals can be set, allowing you fine-grained control over the time between scans. Multiple administrators can be set up to monitor and accept changes. Emails can be sent for each and every scan, regardless of changes.

The configuration file allows you to pick and choose what files on the client system are to be monitored. Fine-grain control over this allows the administrator to specify whole directories, or individual files. A filtering system can prevent erroneous results to be sent. For instance, some backup systems change the ctime to reflect when the file was last backed up. Without a filter, Osiris would report changes to all of the files each time a backup is run. Setting up a simple filter to ignore ctime on a file allows the administrator to ignore the backup process.

Overall, Osiris is a great tool for monitoring your server. Be prepared, though, monitoring HIDS can get cumbersome, especially with a large number of servers. Every update, change, or new program installed can trigger a HIDS alert.

There are other HIDS packages as well. I have not tested most of these, but they are included for completeness :

OSSEC
OSSEC is an actively maintained HIDS that supports log analysis, integrity checking, rootkit detection, and more.
AFICK

AFICK is another actively maintained HIDS that offers both CLI and GUI based operation
Samhain

Samhain is one of the more popular HIDS that offers a centralized monitoring system similar to that of Osiris.
Tripwire

Tripwire is a commercial HIDS that allows monitoring of configurations, files, databases and more. Tripwire is quite sophisticated and is mostly intended for large enterprises.
Aide

Aide is an open-source HIDS that models itself after Tripwire

Network Monitoring

I’ve been working a lot with network monitoring lately. While mostly dealing with utilization monitoring, I do dabble with general network health systems as well.

There are several ways to monitor a network and determine the “health” of a given element. The simple, classic example is the ICMP echo request. Simply ping the device and if it responds, it’s alive and well.

This doesn’t always work out, however. Take, for instance, a server. Pinging the server simply indicates that the TCP/IP stack on the server is functioning properly. But what about the processes running on the server? How do you make sure those are running properly?

Other “health” related items are utilization, system integrity, and environment. When designing and/or implementing a network health system, you need to take all of these items into account.

I have used several different tools to monitor the health of the networks I’ve dealt with. These tools range from custom written tools to off-the-shelf products. Perhaps at some point in the future I can release the custom tools, but for now I’ll focus on the freely available tools.

For general network monitoring I use a tool called Argus. Argus is a pretty robust monitoring system written in Perl. It’s pretty simple to set up and the config file is pretty self explanatory. Monitoring capabilities include ping (using fping), SNMP, http, and DNS. You can monitor specific ports on a device, allowing you to determine the health of a particular service.

Argus also has some unique capabilities that I haven’t seen in many other monitoring systems. For instance, you can monitor a web page and detect when specific strings within that webpage change. This is perfect for monitoring software revisions and being alerted to new releases. Other options include monitoring of databases via the Perl DBI module.

The program can alert you in a number of different manners such as email or paging (using qpage). Additional notification methods are certainly possible with custom code.

The program provides a web interface similar to that older versions of What’s Up Gold. There is a fairly robust access control system that allows the administrator to lock users into specific sections of the interface with custom lists of available elements.

Elements can be configured with dependencies, allowing alerts to be suppressed for child elements. Each element can also be independently configured with a variety of options to allow or suppress alerts, modify monitoring cycle times, send custom alert messages, and more. Check out the documentation for more information. There’s also an active mailing list to help you out if you have additional questions.

In future posts I’ll touch on some of the other tools I have in my personal toolkit such as host intrusion detection systems, graphing systems, and more. Stay tuned!