The Hard Drive Shuffle

I recently decided to pick up some new hard drives to replace the ones in my server that have been dutifully running for far too many years (somewhere in the 5-10 years range) without any failures. Because the process of swapping things out was a tad more involved that simply replacing the drives, I thought I’d write it up.

My server has nine drives in it across several SATA controllers. One of the drives is an SSD where I put the OS, two drives are mirrored on a slower SATA controller, and the remaining six are in a RAID6 array. The RAID6 array is the target of this upgrade. I picked up some shiny new Seagate Exos X16 16TB drives to replace them.

Before I could start replacing the drives, I needed to figure out which slots corresponded to which drives. Unfortunately, there are no lights I can blink on the case to figure this out, so a quick shutdown of the system so I can pull all the drives and write down the serial number and what slot they were in. Once complete, I was able to use udevadm to figure out which device corresponded to which serial number.

[root@server root]# udevadm info --query=all --name=/dev/sda | grep ID_SERIAL

The RAID array is set up using Linux Software Raid, so mdadm is the tool I’ll be using to fail and replace the drives, one by one. After some reading, the process is as follows :

First, fail and then remove the device to be replaced from the array :

[root@server root]# mdadm --manage /dev/md0 --fail /dev/sda1
[root@server root]# mdadm --manage /dev/md0 --remove /dev/sda1

The drive can now be replaced with the new drive. I took an additional step to print out labels for each drive with the serial number on them and labelled the slots. Once the drive has been re-added, you can watch /var/log/messages to see when the new drive is detected by the OS. It can take a minute or two for this to happen.

May 5 12:06:00 server kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 5 12:06:00 server kernel: ata11.00: ATA-11: ST16000NM001G-2KK103, SNA3, max UDMA/133
May 5 12:06:00 server kernel: ata11.00: 31251759104 sectors, multi 16: LBA48 NCQ (depth 32), AA
May 5 12:06:00 server kernel: ata11.00: Features: NCQ-sndrcv
May 5 12:06:00 server kernel: ata11.00: configured for UDMA/133
May 5 12:06:00 server kernel: scsi 10:0:0:0: Direct-Access ATA ST16000NM001G-2K SNA3 PQ: 0 ANSI: 5
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] 31251759104 512-byte logical blocks: (16.0 TB/14.6 TiB)
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] 4096-byte physical blocks
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] Write Protect is off
May 5 12:06:00 server kernel: sd 10:0:0:0: Attached scsi generic sg0 type 0
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] Preferred minimum I/O size 4096 bytes
May 5 12:06:00 server kernel: sd 10:0:0:0: [sda] Attached SCSI disk

Now that the drive has been detected, you can partition it and then re-add it to the array. These are pretty huge disks, so I used gdisk to partition them. The process is quite easy. Run gdisk on the drive you want to partition, choose “o” to write a new GPT, choose “n” to create a new partition. I simply hit enter to choose the partition number, first, and last sector. Then enter “fd00” to set the partition to Linux RAID. Finally, you can print out what you’ve done with “p” or jump right to entering “x” to write the partition table and exit.

[root@server root]# gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.7

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

Creating new GPT entries in memory.

Command (? for help): o
This option deletes all partitions and creates a new protective MBR.
Proceed? (Y/N): y

Command (? for help): n
Partition number (1-128, default 1):
First sector (34-31251759070, default = 2048) or {+-}size{KMGTP}:
Last sector (2048-31251759070, default = 31251759070) or {+-}size{KMGTP}:
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): fd00
Changed type of partition to 'Linux RAID'

Command (? for help): p
Disk /dev/sdf: 31251759104 sectors, 14.6 TiB
Model: ST16000NM001G-2K
Sector size (logical/physical): 512/4096 bytes
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 31251759070
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048     31251759070   14.6 TiB    FD00  Linux RAID

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/sda.
mdThe operation has completed successfully.

Now that the drive is paritioned, you can add it into the array.

[root@server root]# mdadm --manage /dev/md0 --add /dev/sda1

And now you have to wait as the array resyncs. With a RAID6 array, you can technically replace two drives at the same time, but you increase your risk because if an additional drive fails, you can kiss goodbye to the data. In fact, when I started this process on my server, the first drive I pulled caused some sort of restart on the SATA controller and I ended up with two failed drives instead of just the one. Since the array already saw the drives as failed, I just went ahead and replaced both at the same time. I have backups, though, so I wouldn’t recommend this to everyone.

You can see the current state of the resync by watching the /proc filesystem. Specifically, the /proc/mdstat file.

[root@server root]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid6 sdf1[11] sdg1[10] sdh1[9] sdi1[8] sdb1[7] sda1[6]
      7813525504 blocks super 1.2 level 6, 512k chunk, algorithm 2 [6/5] [UU_UUU]
      [>....................]  recovery =  0.0% (148008/1953381376) finish=659.8min speed=49336K/sec
      bitmap: 3/15 pages [12KB], 65536KB chunk

unused devices: <none>

It can take a while. Each drive took about 12-14 hours to resync.

Once you’ve replace the last drive and the array has resynced (YAY!), you have a few more steps to follow. First, we need to tell the system that the array size has increased. Again, a simple process.

[root@server root]# mdadm --grow /dev/md0 --size=max

And once again, you’re resyncing, but your array has now grown to it’s new size.

I wasn’t quite complete, however. On my system, I have an LVM on top of the RAID array, so I needed to tell LVM that the array was larger so I could then allocate that space where needed. This was easily accomplished with one simple command.

[root@server root]# pvresize /dev/md0
  Physical volume "/dev/md0" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized

And poof, the PV was resized as was the associated VG. Lots more space available to allocate where necessary.

Welcome to the Fediverse

Ok, I’ve done it. I’ve installed the ActivityPub plugin and put this blog on the Fediverse. Why? Well, it’s new, it’s interesting, and I really want to see the Fediverse grow.

What is the Fediverse, you ask? Well, in a nutshell, it’s a network comprised of various types of social networks that use a common protocol to talk between them. That protocol is ActivityPub.

There are tons of different types of social networks you can use. Mastodon is the primary one folks know about. Mastodon is a social network similar to Twitter, Bluesky, etc. But there are others.

PeerTube is a Youtube style network for hosting videos.

PixelFed is an image sharing network. Think Imgur or Instagram.

Lemmy is a conversation platform similar to Reddit.

And there are a TON more! Each of these is an app you can run on your own, thus owning your own content and making whatever choices you want with respect to the content. And if you’re not interested in running your own instance, there are plenty of instances out there that you can join.

Right now I’m just experimenting to see how this works with a blog on the Fediverse, but I’m hoping it works out and I can get back to blogging on a regular basis.

Risks, Threats, and Vaccines

One of the most common tasks you’ll perform in life is a risk analysis. You may not realize you’re doing this, but it’s happening nonetheless. Do I drink the odd smelling milk I found in the fridge, or do I throw it out? Do I leave my coat at home today and hope that it doesn’t get too cold? Do I exceed the speed limit because I’m late for work, or do I risk my boss being upset with me? All day, every day, risk analysis is a constant.

If you’re reading this in the now time, you’re likely aware of the ongoing SARS-CoV-2 global pandemic. If you’re in the tomorrow time, I trust things have worked out and we’ve finally been able to handle the situation. Regardless, risk analysis is being performed on the world stage by leaders, medical professionals, and average folk, as it pertains to the virus that has affected our lives. Should I go out today? Should I wear a mask? Should I get vaccinated?

Risk analysis is part of a wider field known as Risk Management. According to Wikipedia, “Risk management is the identification, evaluation, and prioritization of risks followed by coordinated and economical application of resources to minimize, monitor, and control the probability or impact of unfortunate events or to maximize the realization of opportunities.” Or, to put it more simply, identifying the risks and dealing with them accordingly.

When it comes to something like the current pandemic, risk management can literally mean a choice between life and death. If you choose to spend time with someone infected by the virus, you are significantly increasing your risk of catching the virus. And it follows that catching the virus significantly increases your risk of your body developing the disease caused by the virus which can then lead, potentially, to death. Not everyone, but often enough that everyone should be thinking seriously about what it means to contract this disease.

Analyzing the potential outcomes at the extreme end of the chain allows us to work backwards and potentially change behaviors that can lead to undesired outcomes. For instance, if you’re interested in prolonging your life, avoiding situations that can lead to being exposed to the virus is desired. You can simply lock yourself in a basement and wait patiently until the virus has either died out or a perfect cure has been developed.

When making decisions based on risk analysis, those decisions often lead to other potential risks. Hiding in your basement may keep you safe from the virus, but now you have a risk of running out of food. You can reduce the starvation risk by obtaining food, but that increases the risk of being exposed to the virus. And on and on it goes. Decisions made as a result of risk analysis are often a balancing act. Sometimes you have to make a decision to increase a risk to reduce another.

Wearing masks, washing your hands, social distancing, and getting a vaccine are all things that reduce the risk of contracting the virus. None of these is a perfect solution and none are guaranteed to prevent you from being infected, but together they can provide an extremely high level of safety.

Vaccines, in particular, are often misunderstood. There are a wide variety of reasons that people don’t want to get vaccinated. Some people believe that vaccines aren’t safe and can cause more problems than they solve. Some believe that vaccines don’t work. Some have religious or political opposition to them.

The fact is, however, that vaccines do work and the evidence is out there. Vaccines are why diseases such as smallpox and polio aren’t around anymore. As with most things, there are also incidents where vaccines cause side effects. Some side effects are fairly mild such as soreness, mild fever, and fatigue. More severe side effects such as shortness of breath, rash, and elevated heartbeat are possible, but extremely rare, occurring for approximately 1 in a million patients. Vaccines were also thought to be linked to Autism, but this has been thoroughly debunked and the doctor responsible for the paper has had his license revoked by England’s General Medical Council, the organization responsible for licensing doctors in the UK.

Much of the confusion about vaccines, though, seem to be in how they work. For instance, an excuse for not getting a flu vaccine that I’ve often heard is that despite getting the vaccine, the recipient contracted the flu anyway. This is definitely possible, but likely not what happened.

Vaccines work by providing the immune system with a template of what to protect against. In the case of the flu vaccine, an inactivated virus is injected into the patient and the immune system builds the necessary defense to the virus. This process often includes an inflammatory response which can manifest as common symptoms of the virus you’ve been inoculated against. This can be unpleasant, but is generally much milder than being infected with the virus itself.

Another possibility for our flu vaccine recipient is that they did, in fact, contract the flu, but a different strain of the flu than the vaccine was designed to protect against. The flu virus mutates from year to year and vaccines are developed to protect against the strains that are expected to be prevalent during flu season. Because the vaccine takes 5-6 months to manufacture, vaccine manufacturers have to guess which strains they’ll need to protect against. It is, of course, an educated guess based on history and sampling done throughout the year. Historically, the flu vaccine has been quite effective.

And finally, it’s also possible the recipient was infected by one of the strains that the vaccine was supposed to protect against. It is a common misconception that receiving a vaccine is a guarantee against contracting the virus it was created for. Flu vaccines generally have an efficacy of 40-60%. That is, if you receive the vaccine, you have a 40-60% chance of not contracting the virus. So you may ask, if the vaccine isn’t guaranteed to protect you, why get one? Well, to put it simply, if you are exposed to an infectious dose of the virus and you have received the vaccine, you only have a 40-60% change of being infected. If you haven’t received the vaccine, you have a nearly 100% chance of being infected.

Further, if a vaccinated person does contract the virus, the severity of the illness is significantly reduced. So yes, you can still get the flu, but it won’t be as severe as it would have been if you didn’t get it. And, it often reduces the chance you will pass the disease onto someone else.

So, back to our risk analysis. If a vaccine is available for a given virus, should you get it? Given the reasonably low risk of severe side effects, the answer is almost always yes. If you suffer from conditions, such as being immunocompromised, that may increase the risk of receiving a vaccine, you need to add that risk into the equation as well. For the current SARS-CoV-2 pandemic, the new mRNA vaccines are considered to be safe for immunocompromised people because there is no live virus in the vaccine. Instead, it uses smaller RNA strands which are used to build immunity against the virus. Similarly, it appears, based on current evidence, that these vaccines are safe for people with autoimmune diseases.

To conclude, social distancing, washing your hands, wearing masks, and getting vaccinated are all ways to reduce the risk of being infected by the pandemic virus as well as reducing the risk of passing it on to someone else. For myself and my family, we practice these on a daily basis and will be vaccinated at the earliest possible convenience. We do this not only for us, but for those around us. Please, join us, analyze the risk to yourself and others and make an informed, responsible decision.


We will entangle buds and flowers and beams

Which twinkle on the fountain’s brim, and make

Strange combinations out of common things

“Prometheus Unbound” by Percy Bysshe Shelley

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

Welcome to the world of metrics collection and performance monitoring. As with most things IT, entire market sectors have been built to sell these tools. And, of course, there are a number of open source tools that serve the same purpose. It’s one of these open source tools that we’re going to take a look at.

What is Prometheus?

Prometheus is a metrics collection and alerting tool developed and released to open source by SoundCloud. Prometheus is similar in design to Google’s Borgmon monitoring system and a relatively modest system can handle collecting hundreds of thousands of metrics every second. Properly tuned and deployed, a Prometheus cluster can collect millions of metrics every second.

Prometheus is made up of roughly four parts:

  • The main Prometheus app itself that is responsible for scraping metrics, storing them in the database, and (optionally) retrieving them when queried.
    • The database backend is an internal Time Series database. This database is always used, but data can also be sent to remote storage backends.
  • Exporters are optional external programs that ingest data from a variety of sources and convert it to metrics that Prometheus can scrape.
    • Exporters are purpose built applications for working with specific applications and hardware.
  • AlertManager is an alert management system. It ships with Prometheus.
  • Client Libraries that can be used to instrument custom applications.

I say “roughly” four parts because there are plenty of additional applications that are often used with a standard Prometheus cluster. If you need or want better graphing capabilities, applications like Grafana can be deployed. If you need to store metrics for long periods of time, remote storage backends are worth looking into. And the list goes on. For the purposes of this article, however, we’re going to focus on Prometheus itself with a small detour into exporters.

What is a metric?

But before we get there, we need to understand why something like Prometheus exists. So let’s start with a question. What are metrics? Simply put, metrics measure something. For instance, the time it takes you to read this article is a metric. The number or words is a metric. The average number of letters in the words of this article is a metric.

But those metrics are fairly static and not something you’d necessarily need a system like Prometheus for. Prometheus excels at metrics that change over time. For instance, what if you wanted to know how many “views” this article is getting? Or what if you wanted to know how much traffic was entering and leaving your network? Or how many build and deploy cycles are happening each hour? All of these are metrics that can be fed into Prometheus.

Now that we understand what a metric is, let’s take a look at how Prometheus gets the metrics it needs to store. The first thing Prometheus needs is a target. Targets are the endpoints that supply the metrics that Prometheus stores. These endpoints can be the actual endpoint being monitored or they can be a piece of middleware known as an exporter. Endpoints can be supplied via a static configuration or they can be “found” through a process called service discovery. Service Discovery is a more advanced topic and will be covered in a future article.

Once Prometheus has a list of endpoints, it can begin to retrieve metrics from them. Prometheus retrieves metrics in a very straightforward manner, a simple HTTP request. The configuration points at a specific location on the endpoint that will supply a stream of text identifying the metric and its current value. Prometheus reads this stream of text, ignoring lines beginning with a # as comments, and stores the metrics it receives in a local database.

Figure 1 – Example metrics output (from itNext)

A short sidetrack into Exporters

Prometheus can only talk HTTP to endpoints for metrics collection. So what happens when you’re trying to monitor a router or switch that only communicates using SNMP? Or perhaps you want to monitor a cloud service that doesn’t have a native Prometheus metrics endpoint? Fortunately, there’s a solution. Exporters.

Exporters come in many shapes and sizes. These are small, purpose-built programs designed to stand between Prometheus and anything you want to monitor that doesn’t natively support Prometheus. Some exporters sit idle until Prometheus polls them for data. When this happens, the exporter reaches out to the device it’s monitoring, gets the relevant data, and converts it to a format that Prometheus can ingest. Other exporters poll devices automatically, caching the results locally for Prometheus to pick up later.

Regardless of design, exporters act as a translator between Prometheus and endpoints you want to monitor. Chances are, if you’re trying to monitor a common device or application, there’s an exporter out there for it.

Data Storage

Prometheus uses a special type of database on the back end known as a time series database. Simply put, this database is optimized to store and retrieve data organized as values over a period of time. Metrics are an excellent example of the type of data you’d store in such a database.

External storage is also an option. There are many options such as Thanos, Cortex, and VictoriaMetrics that provide a variety of benefits. One of the primary benefits is to centralize the gathered metrics and allow for long term storage. Tools such as Grafana can query these third party storage solutions directly.

So you have a bunch of metrics…

Now that you’re an expert on Prometheus and you have it storing metrics, how do you use this data? Much like a SQL database, Prometheus has a custom query language known as PromQL. PromQL is pretty straightforward for simple metrics but has a lot of complexity when needed. Simply supplying the name of a metric will show all “instances” of that metric:

Figure 2 – Simple PromQL query (from Digital Ocean)

Or you can use some PromQL methods and generate a graph representing the data you’re after.

Figure 3 – Graphing example (from Digital Ocean)

Of course, if you’re serious about graphing, it’s worth looking into a package such as Grafana. Grafana allows you to create dashboards of metrics, send alerts, and more.


While graphs are pretty to look at, metrics can serve another, important, purpose. They can be used to send alerts. Prometheus includes a separate application, called AlertManager, that serves this purpose. AlertManager receives notifications from Prometheus and handles all of the necessary logic to dedupe and deliver the alerts.

Alerts are created by writing alert rules. These rules are simply PromQL queries that fire when the query is true. That is, if you have a query that checks if the temperature on the cpu is over 80C then the query fires for each metric that meets that condition.

Alert rules can also include a time period over which a rule must evaluate to true. Expanding on our temperature example, exceeding 80C is ok if it’s a brief period of time. But if it lasts more than 5 minutes, send an alert. Alerts can be sent via email, slack, twitter, sms, and pretty much anything else you can write an interface for.

Figure 4 – Alerting rules (from Rancher)

Wrap Up

Monitoring is important. It helps identify when things have gone wrong and it can show when things are going right. Proper monitoring can be used across a variety of disciplines to squeeze everything you can out of the object being monitored.

Prometheus is a powerful open-source metrics package. It is highly scalable, robust, and extremely fast. A single modern server can be used to monitor a million metrics or more per second. Distributing Prometheus servers allows for many tens and even hundreds of millions of metrics to be monitored every second.

PromQL provides a robust querying language that can be used for graphing as well as alerting. The built-in graphing system is great for quick visualizations but longer term dashboarding should be handled in external applications such as Grafana.

Accidental Forkbomb

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

One of the first industry jobs I ever had was for a small regional ISP. At the time, 56k modems were shiny and new. We had a couple dozen PoPs (Points of Presence) where we installed banks of modems and fed the data back to our main office via a series of full and fractional T1 lines.

We provided the typical slew of services: email, net news, and general internet access. Of course, to provide these services we needed servers of our own. The solution was to set up a cluster of SCO UNIX systems. Yes, *that* SCO. It’s been quite a while, but a cluster setup like this is hard to forget. The servers were set up in such a way that they had dependencies on each other. If one failed, it didn’t cause everything to come crashing down, but getting that one server back up generally required restarting everything.

The general setup was that the servers nfs mounted each other on startup. This, of course, causes a race condition during startup. The engineers had written a detailed document that explained the steps required to bring the entire cluster up after a failure. The entire process usually took 30-45 minutes.

At the time, I was a lowly member of tech support, spending the majority of my time hand-holding new customers through the process of installing the software necessary to get online. I was relatively new to the world of Unix and high speed networking and sucking up as much knowledge as I could.

One of the folks I worked with, Brett, taught me a lot. He wrote the network monitoring system we used and spent his time between that and keeping the network up and running. He was also a bit of a prankster at times.

At the end of one pretty typical day, I happened to be on the unix cluster. Out of the blue, my connection failed and I was booted back to my local OS. This was a bit weird, but it happened occasionally so I simply logged back in. Within a few seconds I was booted out again.

I started doing a bit of debugging, trying to figure out what was going on. I don’t remember everything that I did, but I do remember putting together some quick scripts to log in, check various processes, and try ot figure out what was happening. At some point I determined that I was being booted off the system by another user, Brett.

Once I figured out what was going on, I had to fight back. So I started playing around with shell scripting, trying to figure out how to identify what the pid of his shell was so I could boot him offline. This went back and forth for a bit, each of us escalating the attacks. We started using other services to regain access, launch attacks, etc.

That is, until I launched what I thought would be the ULTIMATE attack. I wrote a small shell script that searched for his login, identified the shell, and subsequently killed his access. Pretty simple, but I added the ultimate twist. After the script ran, it ran a copy of itself. BOOM. No way he can get back on now.

And it worked! Brett lost his access and simply could not gain a foothold over the next 5 or so minutes. And, of course, I had backgrounded the task so I could interact with the console and verify that he was beaten. I had won. I had proven that I could beat the experienced engineer and damn, I felt good about it.


ksh: fork: Resource temporarily unavailable

The beginning of the end

I had never seen such an error before. What was this? Why was the system doing this? And why was it streaming all over my console, making it impossible for me to do anything?

It took a few moments but Brett noticed the problem as well. He came out to see what had happened. I explained my brilliant strategy and he simply sighed, smiled, and told me I’d have to handle rebooting and re-syncing the servers. And then he took the time to explain to me what I had done wrong. That was the day I learned about “exec” and how important it is.

Unfortunately, Brett passed away about a decade or so after this. He was a great friend, a great mentor, and I miss him.

Bret Silberman and his family

You got your web in my firewall

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

Firewalls have been around in one form or another since the beginning of networking. The first firewalls weren’t even identified as firewalls. They were nothing more than physical barriers between networks. It wasn’t until the 1980’s that the first device specifically designed to be, and named, a firewall was developed by DEC. Since then, firewalls have evolved into a myriad of forms.

But what is a firewall? At its core, a firewall is a devices designed to allow or deny traffic based on a set of rules. Those rules can be as simple as “allow http and block everything else” or can be infinitely more complex including protocols, ports, addresses, and even application fingerprinting. Some modern firewalls have even incorporated machine learning into the mix.

Like other technologies, as firewalls have evolved, some niche uses have been identified. Web Application Firewalls (WAFs) are one of those niche uses. A WAF is a firewall specifically designed to handle “web” traffic. That is, traffic using the HTTP protocol. Generally speaking, the role of a WAF is to inspect all HTTP traffic destined for a web server, discard “bad” requests, and pass “good” traffic on. The details of how this works are, as you might suspect, a bit more complicated.

Much like “normal” firewalls, a WAF is expected to block certain types of traffic. To do this, you have to provide the WAF with a list of what to block. As a result, early WAF products are very similar to other products such as anti-virus software, IDS/IPS products, and others. This is what is known as signature-based detection. Signatures typically identify a specific characteristic of an HTTP packet that you want to allow or deny.

For instance, WAFs are often used to block SQL Injection attacks. A very simplistic signature may just look for key identifying elements of a typical SQL Injection attack. For instance, it may look for something like ' AND 1=1 included as part of the GET or POST request. If this matches an incoming packet, the WAF marks this as bad and discards it.

Signatures work pretty well but require a lot of maintenance to ensure that false positives are kept to a minimum. Additionally, writing signatures is often more of an art form rather than a straight-forward programming task. And signature writing can be quite complicated as well. You’re often trying to match a general attack pattern without also matching legitimate traffic. To be blunt, this can be pretty nerveracking.

To illustrate this a bit more, let’s look at ModSecurity. The ModSecurity project is an open source WAF project. It started out as a module for the Apache web server but has since evolved into a modular package that works with IIS, nginx, and others. ModSecurity is a signature-based WAF and often ships with a default set of signatures known as the OWASP ModSecurity Core Rule Set.

The Core Rule Set (CRS) is an excellent starting point for deploying a signature-based WAF. It includes signatures for all of the OWASP Top Ten web application security risks as well as a wide variety of other attacks. The developers have done their best to ensure that the CRS has few false alerts but, inevitably, anyone deploying the CRS will need to tweak the rules. This involves learning the rules language and having a deep understanding of the HTTP protocol.

Technology evolves, however, and newer WAF providers are using other approaches to block bad traffic. There has been a pretty widespread move from static configuration approaches such as allow and block lists to more dynamic methods involving APIs and machine learning. This move has been across multiple technologies including traditional firewalls, anti-virus software, and, you guessed it, WAFs.

In the brave new world of dynamic rulesets, WAFs use more intelligent approaches to identifying good and bad traffic. One of the “easier” methods employed is to put the WAF in “learning” mode so it can monitor the traffic flowing to and from the protected web server. The objective here is to “train” the WAF to identify what good traffic looks like. This may include traffic that matches patterns labeled as bad when signatures were used. Once the WAF has been trained, it’s moved to enforcement mode.

Training a WAF like this is similar to what happens when you train an email system to identify spam. Email systems often use a Bayesian filtering algorithm to identify spam. These algorithms work relatively well but can be poisoned to allow spam. Similar issues exist with algorithms used by WAF providers, especially when the WAF is in the learning mode.

More advanced WAF providers are using proprietary techniques to allow and block traffic. These techniques include algorithms that can identify whether certain attacks will work against the target system and only blocking those that would be harmful. Advanced techniques like this, however, are typically only found in WAF SaaS providers and not in self-contained WAF appliances.

WAFs, and firewalls in general, have evolved a lot over the years, moving from static to dynamic methods for identifying and blocking traffic. These techniques will only get better in the future. There are a variety of solutions available from open-source to commercial providers. No matter what your needs, there’s a WAF out there for you.

Journey of a DevOps Engineer

“The advantage of a bad memory is that one enjoys several times the same good things for the first time.”

Friedrich Nietzsche

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

Memory is a funny construct. What we remember is often not what happened, or even the order it happened in. One of the earliest memories I have about technology was a trip my father and I took. We drove to Manhattan, on a mission to buy a Christmas gift for my mother. This crazy new device, The Itty Bitty Book Light, had come out recently and, being an avid reader, my mother had to have one.

We drove to Manhattan, on a mission to buy a Christmas gift for my mother. This crazy new device, The Itty Bitty Book Light, had come out recently and, being an avid reader, my mother had to have one.

After navigating the streets of Manhattan and finding a parking spot, we walked down the block to what turned out to be a large bookstore. You’ve seen bookstores like this on TV and in the movies. It looks small from the outside, but once you walk in, the store is endless. Walls of books, sliding ladders, tables with books piled high. It was pretty incredible, especially for someone like me who loves reading.

But in this particular store, there was something curious going on. One of the tables was surrounded by adults, awed and whispering between one another. Unsure of what was going on, we approached. After pushing through the crowd, what I saw drew me in immediately. On the table, surrounded by books, was a small grey box, the Apple Macintosh. It was on, but no one dared approach it. No one, that is, except me. I was drawn like a magnet, immediately grokking that the small puck like device moved the pointer on the screen. Adults gasped and murmured, but I ignored them all and delved into the unknown. The year was, I believe, 1984.

Somewhere around the same time, though likely a couple years before, my father brought home a TI-99/4A computer. From what I remember, the TI had just been released, so this has to be somewhere around 1982. This machine served as the catalyst for my love of computer technology and was one of the first machines I ever cut code on.

My father tells a story about when I first started programming. He had been working on an inventory database, written from scratch, that he built for his job. I would spend hours looking over his shoulder, absorbing everything I saw. One time, he finished coding, saved the code, and started typing the command to run his code (“RUN”). Accordingly to him, I stopped him with a comment that his code was going to fail. Ignoring me as I was only 5 or 6 at the time (according to his recollection), he ran the code and, as predicted, it failed. He looked at me with awe and I merely looked back and replied “GOSUB but no RETURN”.

At this point I was off and running. Over the years I got my hands on a few other systems including the Timex Sinclair, Commodore 64 and 128, TRS-80, IBM PS/2, and finally, my very own custom built PC from Gateway 2000. Since them I’ve built hundreds of machines and spend most of my time on laptops now.

The Commodore 64 helped introduce me to the online world of Bulletin Board Systems. I spent many hours, much to the chagrin of my father and the phone bill, calling into various BBS systems in many different states. Initially I was just there to play the various door games that were available, but eventually discovered the troves of software available to download. Trading software become a big draw until I eventually stumbled upon the Usenet groups that some boards had available. I spent a lot of time reading, replying, and even ended up in more than one flame war.

My first introduction to Unix based operating systems was in college when I encountered an IBM AIX mainframe as well as a VAX. I also had access to a shell account at the local telephone company turned internet service provider. The ISP I was using helpfully sent out an email to all subscribers about the S.A.T.A.N. toolkit and how accounts found with the software in their home directories would be immediately banned. Being curious, I immediately responded looking for more information. And while my account was never banned, I did learn a lot.

At the same time, my father and I started our own BBS which grew into a local Internet Service Provider offering dial-up services. That company still exists today, though the dial-up service died off several years ago. This introduced me to networks and all the engineering that comes along with it.

From here my journey takes a bit of a turn. Since I was a kid, I’ve wanted to build video games. I’ve read a lot of books on the subject, talked to various developers (including my idol, John Carmack) and written a lot of code. I actually wrote a pacman clone, of sorts, on the Commodore 64 when I was younger. But, alas, I think I was born on the wrong coast. All of the game companies, at the time, were located on the west coast and I couldn’t find a way to get there. So, I redirected my efforts a bit and focused on the technology I could get my hands on.

Running a BBS, and later an ISP, was a great introduction into the world of networking. After working a few standard school-age jobs (fast food, restaurants, etc), I finally found a job doing tech support for an ISP. I paid attention, asked questions, and learned everything I could. I took initiative where I could, even writing a cli-based ticketing system with an mSQL database backing it.

This initiative paid of as I was moved later to the NOC and finally to Network Engineering. I lead the way, learning everything I could about ATM and helping to design and build the standard ATM-based node design used by the company for over a decade. I took over development of the in-house monitoring system, eventually rewriting it in KSH and, later, Perl. I also had a hand in development of a more modern ticketing system written in Perl that is still in use today.

I spent 20+ years as a network engineer, taking time along the way to ensure that we had Linux systems available for the various scripting and monitoring needed to ensure the network performed as it should. I’ve spent many hours writing code in shell, expect, perl, and other languages to automate updates and monitoring. Eventually, I found myself looking for a new role and a host of skills ranging from network and systems administration to coding and security.

About this time, DevOps was quickly becoming the next new thing. Initially I rejected the idea, solely responding to the “Development” and “Operations” tags that make up the name. Over time, however, I came to realize that DevOps is yet another fancy name for what I’ve been doing for decades, albeit with a few twists here and there. And so, I took a role as a DevOps Engineer.

DevOps is a fun discipline, mixing in technologies from across the spectrum in an effort to automate away everything you can. Let the machine do the work for you and spent your time on more interesting projects like building more automation. And with the advent of containers and orchestration, there’s more to do than ever.

These days I find myself leading a team of DevOps engineers, both guiding the path we take as we implement new technology and automate existing processes. I also spend a lot of time learning about new technologies both on my own and for my day job. My trajectory seems to be changing slightly as I head towards the world of the SRE. Sort of like DevOps, but a bit heavier on the development side of things.

Life throws curves and sometimes you’re not able to move in the direction you want. But if you keep at it, there’s something out there for everyone. I still love learning and playing with technology. And who knows, maybe I’ll end up writing games at some point. There’s plenty of time for another new hobby.

Intrusion Prevention Systems

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

“Intruder alert!  Strace to PID 45555, log and store.”
“Roger that, overseer. Target acquired.”
“Status report!”
“Suspicious activity verified, permission to terminate?”
“Permission granted, deploy Sigkill.”
“Process terminated at 2146 hours.”

Intrusion Prevention Systems, or IPS, are tools designed to detect and stop intrusions in their tracks. They come two basic flavors, network-based and host-based. As you may suspect, a network-based IPS is meant to be deployed to monitor the network and a host-based IPS is deployed on a host with the intention of monitoring just a single host.

How these tools work varies from vendor to vendor, but the basics are the same. The network-based tool monitors traffic on the network and matches it to a long list of known signatures. These signatures describe a variety of attacks ranging from simple corrupt packets to more specific attacks such as SQL injection.

Host-based tools tend to have more capabilities as they have access to the entire host. A host-based IPS can look at network traffic as well as monitor files and logs. One of the more popular tools, OSSEC-HIDS, monitors traffic, logs, file integrity, and even has signatures for common rootkits.

More advanced tools have additional detection capabilities such as statistical anomaly detection or stateful protocol inspection. Both of these capabilities use algorithms to detect intrusions. This allows detection of intrusions that don’t yet have signatures created for them.


Unlike it’s predecessor, the Intrusion Detection System, or IDS, when an IPS detects an intrusion it moves to block the traffic and prevent it from getting to its target. As you can imagine, ensuring that the system blocks only bad traffic is of utmost importance. Deploying a tool that blocks legitimate traffic is a quick way to frustrate users and get yourself in trouble. An IDS merely detects the traffic and sends an alert. Given the volume of traffic on a typical network or host today, an IDS is of limited use.

The Redhat Enterprise Packages for Enterprise Linux repository includes two great HIDS apps that every admin should look into. HIDS apps are a bit of a hybrid solution in that they’re both an IDS and an IPS. That is, they can be configured to simply alert when they see issues or they can often be configured to run scripts to react to scenarios that trigger alerts.


First up from EPEL is Tripwire, a file integrity monitoring tool, which Seth Kenlon wrote about for Enable Sysadmin back in April. Tripwire’s job in life is to monitor files on the host and send an alert if they change in ways they’re not supposed to. For instance, you can monitor a binary such as /bin/bash and send an alert if the file changes in some way such as permissions or other attributes of the file. Attackers often modify common binaries and add a payload intended to keep their access to the server active. Tripwire can detect that.


The second EPEL package is fail2ban. Fail2ban is more of an IPS style tool in that it monitors and acts when it detects something awry. One common implementation of fail2ban is monitoring the openssh logs. By building a signature that identifies a failed login, fail2ban can detect multiple attempts to login from a single source address and block that source address. Typically, fail2ban does this by adding rules to the host’s firewall, but in reality, it can run any script you can come up with. So, for instance, you can write a script to block the IP on the local firewall and then transmit that IP to some central system that will distribute the firewall block to other systems. Just be careful, however, as globally blocking yourself from every system on the network can be rather embarrassing.


OSSEC-HIDS, mentioned previously, is a personal favorite of mine. It’s much more of a swiss army knife of tools. It combines tools like tripwire and fail2ban together into a single tool. It can be centrally managed and uses encrypted tunnels to communicate with clients. The community is very active and new signatures are created all the time. Definitely worth checking out.


Snort is a network-based IDS/IPS (NIDS/NIPS). Where HIDS are installed on servers with the intention of monitoring processes on the server itself, NIDS are deployed to monitor network traffic. Snort was first introduced in 1998 and has more recently been acquired by Cisco. Cisco has continued to support Snort as open source while simultaneously incorporating it into their product line.

Snort, like most NIDS/NIPS, works by inspecting each packet as it passes through the system. Snort can be deployed as an IDS by mirroring traffic to the system, or it can be deployed as an IPS by putting it in-line with traffic. In either case, Snort inspects the traffic and compares it to a set of signatures and heuristic patterns, looking for bad traffic. Snort’s signatures are updated on a regular basis and uses libpcap, the same library used by many popular network inspection tools such as tcpdump and wireshark.

Snort can also be configured to capture traffic for later inspection. Be aware, however, that this can eat up disk space pretty rapidly.


Suricata is a relatively new IDS/IPS, released in 2009. Suricata is designed to be multi-threaded, making it much faster than competing products. Like Snort, it uses signatures and heuristic detection. In fact, it can use most Snort rules without any changes. It also has it’s own ruleset that allows it to use additional features such as file detection and extraction.

Zeek (Formerly Bro)

Bro was first release in 1994, making it one of the oldest IDS applications mentioned here. Originally named in reference to George Orwell’s book, 1984, more recent times have seen a re-branding to the arguably less offensive name, Zeek.

Zeek is often used as a network analysis tool but can also be deployed as an IDS. Like Snort, Zeek uses libpcap for packet capture. Once packets enter the system, however, any number of frameworks or scripts can be applied. This allows Zeek to be very flexible. Scripts can be written to be very specific, targeting specific types of traffic or scenarios, or Zeek can be deployed as an IDS, using various signatures to identify and report on suspect traffic.


One final word of caution. IPS tools are powerful and extremely useful in automating the prevention of intrusions. But like all tools, they need to be maintained properly. Ensure your signatures are up to date and monitor the reports coming from the system. It’s still possible for a skilled attacker to slip by these tools and gain access to your systems. Remember, there is no such thing as a security silver bullet.

Ping, traceroute, and netstat are the trifecta network troubleshooting tools for a reason

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

I’ve spent a career building networks and servers, deploying, troubleshooting, and caring for applications. When there’s a network problem, be it outages or failed deployments, or you’re just plain curious about how things work, three simple tools come to mind: ping, traceroute, and netstat.


Ping is quite possibly one of the most well known tools available. Simply put, ping sends an “are you there?” message to a remote host. If the host is, in fact, there, it returns a “yup, I’m here” message. It does this using a protocol known as ICMP, or Internet Control Message Protocol. ICMP was designed to be an error reporting protocol and has a wide variety of uses that we won’t go into here.

Ping uses two message types of ICMP, type 8 or Echo Request and type 0 or Echo Reply. When you issue a ping command, the source sends an ICMP Echo Request to the destination. If the destination is available, and allowed to respond, then it replies with an ICMP Echo Reply. Once the message returns to the source, the ping command displays a success message as well as the RTT or Round Trip Time. RTT can be an indicator of the latency between the source and destination.

Note: ICMP is typically a low priority protocol meaning that the RTT is not guaranteed to match what the RTT is to a higher priority protocol such as TCP.

Ping diagram (via GeeksforGeeks)

When the ping command completes, it displays a summary of the ping session. This summary tells you how many packets were sent and received, how much packet loss there was, and statistics on the RTT of the traffic. Ping is an excellent first step to identifying whether or not a destination is “alive” or not. Keep in mind, however, that some networks block ICMP traffic, so a failure to respond is not a guarantee that the destination is offline.

 $ ping
PING ( 56 data bytes
64 bytes from icmp_seq=0 ttl=56 time=15.740 ms
64 bytes from icmp_seq=1 ttl=56 time=14.648 ms
64 bytes from icmp_seq=2 ttl=56 time=11.153 ms
64 bytes from icmp_seq=3 ttl=56 time=12.577 ms
64 bytes from icmp_seq=4 ttl=56 time=22.400 ms
64 bytes from icmp_seq=5 ttl=56 time=12.620 ms
--- ping statistics ---
6 packets transmitted, 6 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 11.153/14.856/22.400/3.689 ms

The example above shows a ping session to From the output you can see the IP address being contacted, the sequence number of each packet sent, and the round trip time. 6 packets were sent with an average RTT of 14ms.

One thing to note about the output above and the ping utility in general. Ping is strictly an IPv4 tool. If you’re testing in an IPv6 network you’ll need to use the ping6 utility. The ping6 utility works roughly identical to the ping utility with the exception that it uses IPv6.


Traceroute is a finicky beast. The premise is that you can use this tool to identify the path between a source and destination point. That’s mostly true, with a couple of caveats. Let’s start by explaining how traceroute works.

Traceroute diagram (via StackPath)

Think of traceroute as a string of ping commands. At each step along the path, traceroute identifies the IP of the hop as well as the latency to that hop. But how is it finding each hop? Turns out, it’s using a bit of trickery.

Traceroute uses UDP or ICMP, depending on the OS. On a typical *nix system it uses UDP by default, sending traffic to port 33434 by default. On a Windows system it uses ICMP. As with ping, traceroute can be blocked by not responding to the protocol/port being used.

When you invoke traceroute you identify the destination you’re trying to reach. The command begins by sending a packet to the destination, but it sets the TTL of the packet to 1. This is significant because the TTL value determines how many hops a packet is allowed to pass through before an ICMP Time Exceeded message is returned to the source. The trick here is to start the TTL at 1 and increment it by 1 after the ICMP message is received.

$ traceroute
traceroute to (, 64 hops max, 52 byte packets
 1 (  1747.782 ms  1.812 ms  4.232 ms
 2 (  10.838 ms  12.883 ms  8.510 ms
 3  xx.xx.xx.xx (xx.xx.xx.xx)  10.588 ms  10.141 ms  10.652 ms
 4  xx.xx.xx.xx (xx.xx.xx.xx)  14.965 ms  16.702 ms  18.275 ms
 5  xx.xx.xx.xx (xx.xx.xx.xx)  15.092 ms  16.910 ms  17.127 ms
 6 (  13.711 ms  14.363 ms  11.698 ms
 7 (  12.802 ms (  12.647 ms  12.963 ms
 8 (  11.901 ms  13.666 ms  11.813 ms

Traceroute displays the source address of the ICMP message as the name of the hop and moves on to the next hop. When the source address matches the destination address, traceroute has reached the destination and the output represents the route from the source to the destination with the RTT to each hop. As with ping, the RTT values shown are not necessarily representative of the real RTT to a service such as HTTP or SSH. Traceroute, like ping, is considered to be lower priority so RTT values aren’t guaranteed.

There is a second caveat with traceroute you should be aware of. Traceroute shows you the path from the source to the destination. This does not mean that the reverse is true. In fact, there is no current way to identify the path from the destination to the source without running a second traceroute from the destination. Keep this in mind when troubleshooting path issues.


Netstat is an indispensable tool that shows you all of the network connections on an endpoint. That is, by invoking netstat on your local machine, all of the open ports and connections are shown. This includes connections that are not completely established as well as connections that are being torn down.

$ sudo netstat -anptu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0    *               LISTEN      4417/master
tcp        0      0   *               LISTEN      2625/java
tcp        0      0*               LISTEN      559/slapd
tcp        0      0    *               LISTEN      1180/sshd
tcp        0      0        ESTABLISHED 2625/java
tcp        0      0      ESTABLISHED 559/slapd

The output above shows several different ports in a listening state as well as a few established connections. For listening ports, if the source address is, it is listening on all available interfaces. If there is an IP address instead, then the port is open only on that specific interface.

The established connections show the source and destination IPs as well as the source and destination ports. The Recv-Q and Send-Q fields show the number of bytes pending acknowledgement in either direction. Finally, the PID/Program name field shows the process ID and the name of the process responsible for the listening port or connection.

Netstat also has a number of switches that can be used to view other information such as the routing table or interface statistics. Both IPv4 and IPv6 are supported. There are switches to limit to either version, but both are displayed by default.

In recent years, netstat has been superseded by the ss command. You can find more information on the ss command in this post by Ken Hess.


As you can see, these tools are invaluable when troubleshooting network issues. As a network or systems administrator, I highly recommend becoming intimately familiar with these tools. Having these available may save you a lot of time troubleshooting.

How to leverage SNMP and not compromise the security of your server

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

Simple Network Management Protocol, or SNMP, has been around since 1988. While initially intended as an interim protocol as the Internet was first being rolled out, it quickly became a de facto standard for monitoring — and in some cases, managing — network equipment. Today, SNMP is used across most networks, small and large, to monitor the very equipment you likely passed through to get to this blog entry.

There are three primary flavors of SNMP: SNMPv1, SNMPv2c, and SNMPv3. SNMPv1 is, by far, the more popular flavor, despite being considered obsolete due to a complete lack of discernible security. This is likely because of the simplicity of SNMPv1 and that it’s generally used inside of the network and not exposed to the outside world.

The problem, however, is that SNMPv1 and SNMPv2c are unencrypted and even the community string used to “authenticate” is sent in the clear. An attacker can simply listen on the wire and grab the community as it passes by. This gives the attacker access to valuable information on your various devices, and even the ability to make changes if write access is enabled.

But wait, you may be thinking, what about SNMPv3? And you’re right, SNMPv3 *can* be more secure by using authentication and encryption. However, not all devices support SNMPv3 and thus interoperability becomes an issue. At some point, you’ll have to drop down to SNMPv2c or SNMPv1 and you’re back to the “in the clear” issue.

Despite the security shortcoming, SNMP can still be used without compromising the security of your server or network. Much of this security will rely on limiting use of SNMP to read-only and using tools such as iptables to limit where incoming SNMP requests can source from.

To keep things simple, we’ll worry about SNMPv1 and SNMPv2c in this article. SNMPv3 requires some additional setup and, in my opinion, isn’t worth the hassle. So let’s get started with setting up SNMP.

First things first, install the net-snmp package. This can be installed via whatever package manager you use. On the Redhat based systems I use, that tool is yum.

$ yum install net-snmp

Next, we need to configure the snmp daemon, snmpd. The configuration file is located in /etc/snmp/snmpd.conf. Open this file in your favorite editor (vim FTW!) and modify it accordingly. For example, the following configuration enables SNMP, sets up a few specific MIBs, and enables drive monitoring.


agentaddress udp:


# ------------------------------------------------------------------------------
# Traditional Access Control

# ------------------------------------------------------------------------------
# VACM Configuration
#       source        community
com2sec notConfigUser default mysecretcommunity

#       groupName      securityModel securityName
group   notConfigGroup v1            notConfigUser
group   notConfigGroup v2c           notConfigUser

#       name          incl/excl  subtree             mask(optional)
view    systemview included .
view    systemview included .
view    systemview included .
view    systemview included .
view    systemview included .

#       group          context sec.model sec.level prefix read       write notif
access  notConfigGroup ""      any       noauth    exact  systemview none  none

# ------------------------------------------------------------------------------
# Typed-View Configuration


# ------------------------------------------------------------------------------
# System Group
sysLocation The Internet
sysContact Internet Janitor
sysServices 72


## Logging

## We do not want annoying "Connection from UDP: " messages in syslog.
## If the following option is set to 'no', snmpd will print each incoming
## connection, which can be useful for debugging.

dontLogTCPWrappersConnects no


disk /         10%
disk /var      10%
disk /tmp      10%
disk /home     10%

Next, before you start up snmpd, make sure you configure iptables to allow SNMP traffic from trusted sources. SNMP uses UDP port 161, so all you need is a simple rule to allow traffic to pass. Be sure to add an outbound rule as well; UDP traffic is stateless.

iptables -A INPUT -s <ip addr> -p udp -m udp --dport 161 -j ACCEPT

iptables -A OUTPUT -p udp -m udp --sport 161 -j ACCEPT

You can set this up in firewalld as well, just search for SNMP and firewalld on Google.

Now that SNMP is set up, you can point an SNMP client at your server and pull data. You can pull data via the name of the MIB (if you have the MIB definitions installed) or via the OID.

$ snmpget -c mysecretcommunity hrSystemUptime.0
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (6638000) 18:26:20.00

$ snmpget -c mysecretcommunity .
HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (6638000) 18:26:20.00

And that’s about it. It’s called SIMPLE Network Management Protocol for a reason, after all.

One additional side note about SNMP. While SNMP is pretty solid, the security shortcomings are significant. I recommend looking at other solutions such as agent-based systems versus using SNMP. Tools like Nagios and Prometheus have more secure mechanisms for monitoring systems.