This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.
One of the first industry jobs I ever had was for a small regional ISP. At the time, 56k modems were shiny and new. We had a couple dozen PoPs (Points of Presence) where we installed banks of modems and fed the data back to our main office via a series of full and fractional T1 lines.
We provided the typical slew of services: email, net news, and general internet access. Of course, to provide these services we needed servers of our own. The solution was to set up a cluster of SCO UNIX systems. Yes, *that* SCO. It’s been quite a while, but a cluster setup like this is hard to forget. The servers were set up in such a way that they had dependencies on each other. If one failed, it didn’t cause everything to come crashing down, but getting that one server back up generally required restarting everything.
The general setup was that the servers nfs mounted each other on startup. This, of course, causes a race condition during startup. The engineers had written a detailed document that explained the steps required to bring the entire cluster up after a failure. The entire process usually took 30-45 minutes.
At the time, I was a lowly member of tech support, spending the majority of my time hand-holding new customers through the process of installing the software necessary to get online. I was relatively new to the world of Unix and high speed networking and sucking up as much knowledge as I could.
One of the folks I worked with, Brett, taught me a lot. He wrote the network monitoring system we used and spent his time between that and keeping the network up and running. He was also a bit of a prankster at times.
At the end of one pretty typical day, I happened to be on the unix cluster. Out of the blue, my connection failed and I was booted back to my local OS. This was a bit weird, but it happened occasionally so I simply logged back in. Within a few seconds I was booted out again.
I started doing a bit of debugging, trying to figure out what was going on. I don’t remember everything that I did, but I do remember putting together some quick scripts to log in, check various processes, and try ot figure out what was happening. At some point I determined that I was being booted off the system by another user, Brett.
Once I figured out what was going on, I had to fight back. So I started playing around with shell scripting, trying to figure out how to identify what the pid of his shell was so I could boot him offline. This went back and forth for a bit, each of us escalating the attacks. We started using other services to regain access, launch attacks, etc.
That is, until I launched what I thought would be the ULTIMATE attack. I wrote a small shell script that searched for his login, identified the shell, and subsequently killed his access. Pretty simple, but I added the ultimate twist. After the script ran, it ran a copy of itself. BOOM. No way he can get back on now.
And it worked! Brett lost his access and simply could not gain a foothold over the next 5 or so minutes. And, of course, I had backgrounded the task so I could interact with the console and verify that he was beaten. I had won. I had proven that I could beat the experienced engineer and damn, I felt good about it.
Until…
ksh: fork: Resource temporarily unavailable
The beginning of the end
I had never seen such an error before. What was this? Why was the system doing this? And why was it streaming all over my console, making it impossible for me to do anything?
It took a few moments but Brett noticed the problem as well. He came out to see what had happened. I explained my brilliant strategy and he simply sighed, smiled, and told me I’d have to handle rebooting and re-syncing the servers. And then he took the time to explain to me what I had done wrong. That was the day I learned about “exec” and how important it is.
Unfortunately, Brett passed away about a decade or so after this. He was a great friend, a great mentor, and I miss him.