John Carmack is something of an idol to me. He’s an incredible programmer that has created some of the most advanced graphical games ever seen on the PC. He also dabbles in amateur rocketry with his rocketry company, Armadillo Aerospace, whom I’ve written about before.
I joined the Amateur Rocketry mailing list a couple years ago. The aRocket list is a great place to read about what’s going on in the amateur rocketry scene. The various rocket scientists on the list openly discuss designs, fuel mixtures, and a host of other topics. There’s also a lot of advice for both those getting into the game, as well as those who have been in a while.
Recently, John posted a note about the Rocket Racing League and some advice about the programming controlling vital components of the jets. Unfortunately, the mailing list archives require you to be a member of the list to view, but I’ll include some snippets of his post here.
The test pilot for the rocket racing league project made the suggestion that we should not allow the computer to shutdown the engine during the critical five to fifteen second period where the plane is at takeoff speed, but too close to the ground to make the turn needed to get backdown on a runway. We currently shut the engine down whenever a sensor goes out of expected range, and there are indeed plausible conditions where this can happen even when the engine is operating acceptably, such as a pressure transducer line cracking from vibration. On the other hand, there are plausible conditions where you really do want the computer to shut the valves immediately, such as a manifold explosion blowing the chamber off.
Disregarding the question of whether it was a good idea or not, this seems a really straightforward thing to implement. However, I cautioned everyone that this type of modification has all the earmarks of something that will almost certainly cause some problems when implemented.
Shutting off the engines on a regular plane is bad enough, but we’re talking about a full-blown rocket with wings here. I can imagine that a sudden loss of engines is enough to cause a good deal of stress for any pilot, but losing the engines just as the plane is taking off could be devastating. Of course, the engine exploding could be pretty devastating too.
We did implement it, and guess what? It caused a problem. We caught it in a static test and fixed it, and haven’t seen another problem with it since, but it still fell into the category of “risky to implement”. If we weren’t operating at a high testing tempo, I wouldn’t have done it. I certainly wouldn’t have done it if we only got one testing opportunity a year (of course, I wouldn’t undertake a project that only got one testing opportunity a year…).
Our flight control code really isn’t all that complicated, the change was only a few lines of code, and I’m a pretty good programmer. The exact nature of why I considered it a bit risky deal with internal details, but the point is that even fairly trivial sounding changes aren’t risk free to make. There are certainly some classes of changes that I make to our codebase regularly that I don’t bat an eyelash at, but you can’t usually tell the difference without intimate knowledge of the code.
I’ve found similar situations in my own programs. There are areas of the code that I’ll change, knowing it will have no real effect on anything else, and then there are those areas where changes are trivial, but they cause odd problems that come back to bite you later. Testing is, of course, the best way to find these problems, but testing isn’t always possible. But then, I’m not writing code that could mean the difference between life and death for a pilot. Not *that* has to be some serious stress.
Many software guys do not have a reasonable gut check feel for the assessment of software changes in an aerospace project. My normal coding practice has over an order of magnitude more test cycles than Armadillo does physical tests, and Armadillo does over an order of magnitude more tests than most other aerospace companies. Things are fundamentally different at different orders of magnitude.
John’s team probably runs tests more than any other team out there. He has successfully married the typical programming cycle with aerospace engineering. They constantly make incremental improvements and then run out to test them. And as surprising as it sounds, it seems to cost them less to do this. By making incremental improvements, they can control, to some degree, the impact on the system. What this means in the end is that they don’t spend an inordinate amount of time building this huge, complex system, only to have it explode on the first test. Not that they haven’t had their share of failures, but they’ve been a bit less spectacular than some.
John also presented some additional info from his other job.
As another cautionary tale, I recently had the entire codebase for our current Id Software project analyzed by a high end static analysis company. I was very pleased when they reported that our discovered defect rate was well under half the average that they find on codebases of comparable size. However, there were still over a hundred things that we could look at and say, “yes, that is broken”. Sure, most of them wouldn’t really be a problem, but it illustrates the inherent danger of large software solutions. The best defense, by far, is to be as small and simple as possible.
Small and simple is definitely the best. The more complexity you add, the more bugs and odd behavior pop up. Use the KISS principle!