Journey of a DevOps Engineer

“The advantage of a bad memory is that one enjoys several times the same good things for the first time.”

Friedrich Nietzsche

This post first appeared on Redhat’s Enable Sysadmin community. You can find the post here.

Memory is a funny construct. What we remember is often not what happened, or even the order it happened in. One of the earliest memories I have about technology was a trip my father and I took. We drove to Manhattan, on a mission to buy a Christmas gift for my mother. This crazy new device, The Itty Bitty Book Light, had come out recently and, being an avid reader, my mother had to have one.

We drove to Manhattan, on a mission to buy a Christmas gift for my mother. This crazy new device, The Itty Bitty Book Light, had come out recently and, being an avid reader, my mother had to have one.

After navigating the streets of Manhattan and finding a parking spot, we walked down the block to what turned out to be a large bookstore. You’ve seen bookstores like this on TV and in the movies. It looks small from the outside, but once you walk in, the store is endless. Walls of books, sliding ladders, tables with books piled high. It was pretty incredible, especially for someone like me who loves reading.

But in this particular store, there was something curious going on. One of the tables was surrounded by adults, awed and whispering between one another. Unsure of what was going on, we approached. After pushing through the crowd, what I saw drew me in immediately. On the table, surrounded by books, was a small grey box, the Apple Macintosh. It was on, but no one dared approach it. No one, that is, except me. I was drawn like a magnet, immediately grokking that the small puck like device moved the pointer on the screen. Adults gasped and murmured, but I ignored them all and delved into the unknown. The year was, I believe, 1984.

Somewhere around the same time, though likely a couple years before, my father brought home a TI-99/4A computer. From what I remember, the TI had just been released, so this has to be somewhere around 1982. This machine served as the catalyst for my love of computer technology and was one of the first machines I ever cut code on.

My father tells a story about when I first started programming. He had been working on an inventory database, written from scratch, that he built for his job. I would spend hours looking over his shoulder, absorbing everything I saw. One time, he finished coding, saved the code, and started typing the command to run his code (“RUN”). Accordingly to him, I stopped him with a comment that his code was going to fail. Ignoring me as I was only 5 or 6 at the time (according to his recollection), he ran the code and, as predicted, it failed. He looked at me with awe and I merely looked back and replied “GOSUB but no RETURN”.

At this point I was off and running. Over the years I got my hands on a few other systems including the Timex Sinclair, Commodore 64 and 128, TRS-80, IBM PS/2, and finally, my very own custom built PC from Gateway 2000. Since them I’ve built hundreds of machines and spend most of my time on laptops now.

The Commodore 64 helped introduce me to the online world of Bulletin Board Systems. I spent many hours, much to the chagrin of my father and the phone bill, calling into various BBS systems in many different states. Initially I was just there to play the various door games that were available, but eventually discovered the troves of software available to download. Trading software become a big draw until I eventually stumbled upon the Usenet groups that some boards had available. I spent a lot of time reading, replying, and even ended up in more than one flame war.

My first introduction to Unix based operating systems was in college when I encountered an IBM AIX mainframe as well as a VAX. I also had access to a shell account at the local telephone company turned internet service provider. The ISP I was using helpfully sent out an email to all subscribers about the S.A.T.A.N. toolkit and how accounts found with the software in their home directories would be immediately banned. Being curious, I immediately responded looking for more information. And while my account was never banned, I did learn a lot.

At the same time, my father and I started our own BBS which grew into a local Internet Service Provider offering dial-up services. That company still exists today, though the dial-up service died off several years ago. This introduced me to networks and all the engineering that comes along with it.

From here my journey takes a bit of a turn. Since I was a kid, I’ve wanted to build video games. I’ve read a lot of books on the subject, talked to various developers (including my idol, John Carmack) and written a lot of code. I actually wrote a pacman clone, of sorts, on the Commodore 64 when I was younger. But, alas, I think I was born on the wrong coast. All of the game companies, at the time, were located on the west coast and I couldn’t find a way to get there. So, I redirected my efforts a bit and focused on the technology I could get my hands on.

Running a BBS, and later an ISP, was a great introduction into the world of networking. After working a few standard school-age jobs (fast food, restaurants, etc), I finally found a job doing tech support for an ISP. I paid attention, asked questions, and learned everything I could. I took initiative where I could, even writing a cli-based ticketing system with an mSQL database backing it.

This initiative paid of as I was moved later to the NOC and finally to Network Engineering. I lead the way, learning everything I could about ATM and helping to design and build the standard ATM-based node design used by the company for over a decade. I took over development of the in-house monitoring system, eventually rewriting it in KSH and, later, Perl. I also had a hand in development of a more modern ticketing system written in Perl that is still in use today.

I spent 20+ years as a network engineer, taking time along the way to ensure that we had Linux systems available for the various scripting and monitoring needed to ensure the network performed as it should. I’ve spent many hours writing code in shell, expect, perl, and other languages to automate updates and monitoring. Eventually, I found myself looking for a new role and a host of skills ranging from network and systems administration to coding and security.

About this time, DevOps was quickly becoming the next new thing. Initially I rejected the idea, solely responding to the “Development” and “Operations” tags that make up the name. Over time, however, I came to realize that DevOps is yet another fancy name for what I’ve been doing for decades, albeit with a few twists here and there. And so, I took a role as a DevOps Engineer.

DevOps is a fun discipline, mixing in technologies from across the spectrum in an effort to automate away everything you can. Let the machine do the work for you and spent your time on more interesting projects like building more automation. And with the advent of containers and orchestration, there’s more to do than ever.

These days I find myself leading a team of DevOps engineers, both guiding the path we take as we implement new technology and automate existing processes. I also spend a lot of time learning about new technologies both on my own and for my day job. My trajectory seems to be changing slightly as I head towards the world of the SRE. Sort of like DevOps, but a bit heavier on the development side of things.

Life throws curves and sometimes you’re not able to move in the direction you want. But if you keep at it, there’s something out there for everyone. I still love learning and playing with technology. And who knows, maybe I’ll end up writing games at some point. There’s plenty of time for another new hobby.

Does it do the thing?

Back in 2012 I gave a talk at Derbycon 2.0. This was my first infosec talk and I was a little nervous, to say the least. Anyway, I described a system I wanted to write that handled distributed baseline scanning.

After a lot of starts and stops, I finished a basic 1.0 version in 2014. It’s still quite rough and I’ve since been working, intermittently, on making the system more robust and solid. I’ve been working on a python replacement for the GUI as well, instead of the current PHP one. The repository is located here, if you’re interested in taking a look.

Why am I telling you all of this? Well, as part of the updates I’m making, I wanted to do things the “right way” and make sure I have unit testing in place before I start making additional changes to the code. Problem is, while I learned about unit testing, I’ve never really implemented it in any meaningful way, so this is a bit new to me.

So why unit testing? Well, the hypothesis is that by creating tests that check every line of code, you ensure that the code is working as expected. Thus, if the tests pass, then the code should be solid and bug free. In reality, this is rarely the case. Tests can be just as flawed as any other code. Additionally, you may miss testing certain corner cases and miss potential bugs. In the end, the general consensus is that unit testing is a complicated religious argument.

Let’s assume that we want to unit test anyway and move on to the actual testing bits, shall we? We’ll start with a contrived example to make things easier. Assume we have the following code in a file called mytestcode.py:

#!/bin/python

def add(value1, value2):
    return value1 + value2

Simple enough, just a simple function to return the value of two numbers added together. Let’s create some test cases, shall we?

#!/bin/python

from mytestcode import add

class TestAdd(object):
    def test_add(self):
        assert add(1,1) == 2

    def test_add_fail(self):
        assert add(1,1) != 3

What we have here are two simple test cases. First, we test to make sure that if we call the add function with two values, 1 and 1, we get a 2 as a return value. Second, we test that providing the same values as input does not return a 3. Simple, right? But have we really tested all of the corner cases? What happens if we feed the function a negative? How about a non-numeric value? Are there cases where we can cause an exception?

To be fair, the original function is poorly written and is merely being used as a simple example. This is the problem with contrived examples, of course. They miss important details, often simply things too much, and can lead to beginners making big mistakes when using them as teaching tools. So please, be aware, the above code really isn’t very good code. It’s intended to be simple to understand.

Let’s take a look at some “real” code directly from my distributed scanner project. This particular code is something I found on Stack Overflow when I was looking for a way to identify whether a process was still running or not.

#!/usr/bin/python

import errno
import os
import sys

def pid_exists(pid):
    """Check whether pid exists in the current process table.
    UNIX only.
    """
    if pid < 0:
        return False
    if pid == 0:
        # According to "man 2 kill" PID 0 refers to every process
        # in the process group of the calling process.
        # On certain systems 0 is a valid PID but we have no way
        # to know that in a portable fashion.
        raise ValueError('invalid PID 0')
    try:
        os.kill(pid, 0)
    except OSError as err:
        if err.errno == errno.ESRCH:
            # ESRCH == No such process
            return False
        elif err.errno == errno.EPERM:
            # EPERM clearly means there's a process to deny access to
            return True
        else:
            # According to "man 2 kill" possible error values are
            # (EINVAL, EPERM, ESRCH)
            raise
    else:
        return True

Testing this code should be relatively straightforward, with the exception of the os.kill call. For that, we’ll need to delve into mock objects. Let’s tackle the simple cases first:

#!/usr/bin/env python

import pytest

from libs.funcs import pid_exists

class TestFuncs(object):
    def test_pid_negative(self):
        assert pid_exists(-1) == False

    def test_pid_zero(self):
        with pytest.raises(ValueError) as e_info:
            pid_exists(0)

    def test_pid_typeerror(self):
        with pytest.raises(TypeError):
            pid_exists('foo')
        with pytest.raises(TypeError):
            pid_exists(5.0)
        with pytest.raises(TypeError):
            pid_exists(1234.4321)

That’s relatively simple. We verify that False is returned for a negative PID and a ValueError is returned for a PID of zero. We also test that a TypeError is returned if we don’t provide an integer value. What’s left is handling a valid PID and testing that it returns True for a running process and False otherwise. In order to test the rest, we could go through a lot of elaborate setup to start a process, get the PID, and then test our code, but there’s a lot that can go wrong there. Additionally, we’re looking to test our logic and not the entirety of another module. So, what we really want is a way to provide an arbitrary return value for a given call. Enter the mock module.

The mock module is part of the unittest framework in python. Essentially, the mock module allows you to identify a call or an object that you want to create a fake version of, and then provide the behavior you’re expecting that mocked version to have. So, for instance, you can mock a function call and simply provide the return value you’re looking for instead of having to call the function directly. This functionality allows you to precisely test your logic versus doing a deeper integration test.

To finish up our testing code for the pid_exists() function, we want to mock the os.kill() function and have it return specific values so we can check the various branches of code we have.

    @patch('os.kill')
    def test_pid_exists(self, oskillobj):
        oskillobj.return_value = None
        assert pid_exists(100) == True

    @patch('os.kill')
    def test_pid_does_not_exist(self, oskillobj):
        oskillobj.side_effect = OSError(errno.ESRCH, 'No such process')
        assert pid_exists(1234) == False

    @patch('os.kill')
    def test_pid_no_permissions(self, oskillobj):
        oskillobj.side_effect = OSError(errno.EPERM, 'Operation not permitted')
        assert pid_exists(1234) == True

    @patch('os.kill')
    def test_pid_invalid(self, oskillobj):
        oskillobj.side_effect = OSError(errno.EINVAL, 'Invalid argument')
        with pytest.raises(OSError):
            pid_exists(2468)

    @patch('os.kill')
    def test_pid_os_typeerror(self, oskillobj):
        oskillobj.side_effect = TypeError('an integer is required (got type str)')
        with pytest.raises(TypeError):
            pid_exists(1234)

The above code tests all of the branching available in the rest of the code, verifying the logic we’ve written. The code should be pretty straightforward. The return_value attribute of a mock object directly defines what we want the mocked function to recall while the side_effect attribute allows us to throw an exception in response to the function call. With those two features of a mocked object, we’re able to successfully test the rest of the cases we need.

This little journey to learn how to write unit tests has been fun and informative. I just need to finish up the rest of the code, striving to hit as close to 100% coverage as I can while keeping the test cases reasonable. It’s taken a while to get going, but the more code I’ve been writing, the faster and more accurate I’m getting. As they say, “practice makes perfect,” though I’d settle with functionally complete and relatively bug-free.

One final word of caution. I’m a sole developer working on this code, so I’m the only one around to write test cases. In a larger shop, the originator of the logic should not be the one writing the test cases. The reason for this is that the original coder typically knows their code quite well and has expectations regarding how the code will be used. For instance, I’m expecting that anyone calling the add() function I wrote above to only supply numbers and I haven’t added any sort of type checking or input validation. As a result, I avoided adding test cases that supply invalid inputs, knowing that would fail. Someone else writing the test cases would likely have provided a number of different inputs and found that input validation was missing. So if you’re in a larger shop, do yourself a favor and have someone else write your test cases. And to ensure they provide robust test cases, only provide the function prototypes and not the full function definitions.

Contemplating the Future

In 2005 I obtained a job at a regional ILEC as a Data Operations Technician. As part of this job, I took over development of one of the tools we used to diagnose customer DSL connections. Problem was, this tool was written in PHP, a programming language I was, as yet, unfamiliar with.

At the same time, I was also looking for a web-based tool I could use to keep track of various tasks. While there were a few open-source tools I could use, none had the features I was looking for. So I decided to write one myself, and to write it in PHP so I could learn the language better. In the end, I’m glad I did as PHP has become indispensable for writing web-based tools.

The tool I wrote was a web-based todo manager called phpTodo. Since the alpha release in 2005, I have released 7 more versions. Work on phpTodo has ebbed and lowed with time, often interrupted by work and life in general. In fact, the last formal release was made almost 5 years ago, bringing the current version up to 0.8.1. In 2009, I found out that phpTodo was being packaged and released with Fedora as well.

After releasing 0.8.1, I decided to switch from using categories to using tags, similar to how the blogging system I use, Serendipity, uses them. This required rewriting a good deal of the back end of the system, as well as making extensive changes to the front end. I also started using the Prototype and Scriptaculous Javascript frameworks, and then later switched to jQuery. In all, a great deal of code has been rewritten.

I’m quite happy with the general feel of the new version I’ve been working on. While there is a good deal more code to be written, I’m confident there will be a code release soon enough.

I’ve been thinking a lot about the future of phpTodo and where I want to take it. When I originally started, I wrote the system such that I could see my todo list items via an RSS feed. At the time, I had a Blackberry phone and this worked brilliantly. Of course, this was purely a one-way feed with no way to update any todo items on the go. Since that time, I started working on a mobile view for the system, but stopped quickly after I realized how horrible working with WAP was. Fortunately, technology has progressed quickly since that time and WAP is no longer necessary. So, I’m considering working on a mobile version again.

A mobile version brings new challenges, however. It should be trivial to develop a mobile view that can be used while online, but my hope was to have an offline version as well that can be synchronized with the online version. One possibility is to develop an app that can be loaded onto a phone. That, of course, severely limits the platforms it can be run on. Another possibility is an HTML5 version, though that brings challenges of its own.

Another thought was to build a web service into phpTodo. The basic premise is an XML generator that, given a set of parameters, can supply an XML feed for external systems to use as input. And an XML parser that can receive data from external systems in order to update phpTodo data. I believe this can be used as the interface for the mobile view.

A web service can also be used to power another idea I had. I stumbled across the website of Brett Terpstra a while back and found a treasure trove of interesting ideas and useful code snippets. Among these is an obsession for recording notes to keep track of projects, interesting ideas, and helpful code snippets. Brett uses a number of custom scripts and software packages, most of which are exclusive to his platform of choice, OS X. To be honest, I find this incredibly intriguing, and potentially useful. So, I’ve been thinking about developing a command-line tool I can use to interact with phpTodo. A web service could make this a great deal easier.

I have no plans to stop working on the project, and, in fact, I’m eager to keep moving forward. As I continue to rely on phpTodo itself for my daily work, I rely on improvements I can make to the system. So overall, the future of phpTodo is bright.

Blacklisted!

Back in October of 2011, a bill was introduced in the House of Representatives called HR.3261, or the “Stop Online Privacy Act (SOPA).” Go take a look, I’ll wait. It’s a relatively straightforward bill, especially compared to others I’ve looked at. Hell, it’s only 15 pages long! And it’s going to kill the Internet.

Ok,ok.. It won’t *KILL* the Internet, but it has the potential to ruin what we consider to be the Internet. Personally, I believe that if this passes, it has the potential to turn the Internet into nothing more than a collection of business websites, at least in the US.

So how does this thing work? Well, it’s actually pretty straightforward. If your website is suspected of infringing on copyrighted material, your website is taken down, any advertising you have on your site is cut, and you are removed from search engines. But so what, you deserve it! You were breaking copyright law!

Not so fast. This applies to *any* content on your website. So if someone comments on a blog entry, or you innocently link to a website that infringes copyright, or other situations out of your control, you’re responsible. Basically, you have to police every single comment, link, etc. that appears on your website.

It’s even worse for service providers since they have to do the blocking. Every infringing site is blocked via DNS. And since the US doesn’t have control of all of DNS, and some infringing sites are not located in the US, this means we move into the realm of having DNS blacklist files. The ISP becomes the responsible party if they fail to block these sites, which in turn means more overhead for the ISP. Think you pay a lot for Internet access now?

So what can you do? Well, for one, you can contact your representative and tell them how insane this whole idea is. And you can protest SOPA itself by putting up a protest overlay on your site. There’s a github project with all of the source code you need to add an overlay to your website. Or, if you have a Serendipity web blog, you can download the Stop SOPA plugin I’ve written.

Get out there and protest!

Fixing the Serendipity XMLRPC plugin

A while ago I purchased a copy of BlogPress for my iDevices.. It’s pretty full-featured, and seems to work pretty well. Problem was, I couldn’t get it to work with my Serendipity-based blog. Oh well, a wasted purchase.

But not so fast! Every once in a while I go back and search for a possible solution. This past week I finally hit paydirt. I came across this post on the s9y forums.

This explained why BlogPress was crashing when I used it. In short, it was expecting to see a categoryName tag in the resulting XML from the Serendipity XMLRPC plugin. Serendipity, however, used description instead, likely because Serendipity has better support for the MetaWeblog API.

Fortunately, fixing this problem is very straightforward. All you really need to do is implement both APIs and return all of the necessary data for both APIs at the same time. To fix this particular problem, it’s a single line addition to the serendipity_xmlrpc.inc.php file located in $S9YHOME/plugins/serendipity_event_xmlrpc. That addition is as follows :


if ($cat['categoryid']) $xml_entries_vals[] = new XML_RPC_Value(
    array(
      'description'   => new XML_RPC_Value($cat['category_name'], 'string'),
      // XenoPhage: Add 'categoryName' to support mobile publishing (Thanks PigsLipstick)
      'categoryName'  => new XML_RPC_Value($cat['category_name'], 'string'),
      'htmlUrl'       => new XML_RPC_Value(serendipity_categoryURL($cat, 'serendipityHTTPPath'), 'string'),
      'rssUrl'        => new XML_RPC_Value(serendipity_feedCategoryURL($cat, 'serendipityHTTPPath'), 'string')
    ),
    'struct'
);

And poof, you now have the proper category support for Movable Type.

Sort By Sound ?

I ran across this a few weeks ago and I thought it was simply brilliant. Sorting algorithms are, for better or worse, one of the most used algorithms in a programmers toolbox. For many, sorting is just something you need to learn to pass a computer science course. For others, they devote their lives to researching them.

The following two videos show an interesting view of sorting. An enterprising programmer decided to add a bit of sound to the sorting. There are endless ways the initial data can be arranged, so these sounds don’t represent how every sort of that type will sound. But the sound coupled with the visual representation of the sort make these videos worth a glance.

SQL Query Conundrum…

I have a brain teaser for ya.. I’m looking for a way to solve a SQL problem efficiently, specifically using MySQL. The goal is to get a count of the number of unique rows returned for a complex query. It’s actually for a pagination system so I can determine the limits necessary to efficiently query the database for the right amount of data rather than return everything and try to brute force it.

Let’s say I have three tables as follows :

mysql> describe person;
+——-+——————+——+—–+———+——-+
| Field | Type             | Null | Key | Default | Extra |
+——-+——————+——+—–+———+——-+
| id    | int(10) unsigned | NO   | PRI | NULL    |       |
| first | char(15)         | YES  |     | NULL    |       |
| last  | char(15)         | YES  |     | NULL    |       |
+——-+——————+——+—–+———+——-+
3 rows in set (0.02 sec)

mysql> describe interests;
+———-+——————+——+—–+———+——-+
| Field    | Type             | Null | Key | Default | Extra |
+———-+——————+——+—–+———+——-+
| id       | int(10) unsigned | NO   | PRI | NULL    |       |
| interest | char(15)         | YES  |     | NULL    |       |
+———-+——————+——+—–+———+——-+
2 rows in set (0.00 sec)

mysql> describe interest_link;
+————-+——————+——+—–+———+——-+
| Field       | Type             | Null | Key | Default | Extra |
+————-+——————+——+—–+———+——-+
| person_id   | int(10) unsigned | NO   |     | NULL    |       |
| interest_id | int(10) unsigned | NO   |     | NULL    |       |
+————-+——————+——+—–+———+——-+
2 rows in set (0.00 sec)

Simple enough. I’m mapping interests to people. I’ve entered data into these tables as follows :

mysql> select * from person;
+—-+——-+———-+
| id | first | last     |
+—-+——-+———-+
|  1 | John  | Doe      |
|  2 | Bob   | Jones    |
|  3 | Joe   | Smith    |
+—-+——-+———-+
3 rows in set (0.00 sec)

mysql> select * from interests;
+—-+———–+
| id | interest  |
+—-+———–+
|  1 | Computers |
|  2 | Music     |
|  3 | Food      |
|  4 | Beer      |
|  5 | Gaming    |
+—-+———–+
5 rows in set (0.00 sec)

mysql> select * from interest_link;
+———–+————-+
| person_id | interest_id |
+———–+————-+
|         1 |           1 |
|         1 |           2 |
|         1 |           4 |
|         2 |           1 |
|         2 |           5 |
|         2 |           4 |
|         3 |           3 |
|         3 |           2 |
|         3 |           4 |
+———–+————-+
9 rows in set (0.00 sec)

So far, so good. Now, I want to do a search to find which users are interested in music. Simple enough search, I’d do this with a simple select statement as follows :

mysql> select * from person as p left join interest_link as il on il.person_id = p.id where interest_id = 2;
+—-+——-+——–+———–+————-+
| id | first | last   | person_id | interest_id |
+—-+——-+——–+———–+————-+
|  1 | John  | Doe    |         1 |           2 |
|  3 | Joe   | Smith  |         3 |           2 |
+—-+——-+——–+———–+————-+
2 rows in set (0.00 sec)

But what if I want to find out who’s interested in music *and* beer?

mysql> select * from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4);
+—-+——-+———-+———–+————-+
| id | first | last     | person_id | interest_id |
+—-+——-+———-+———–+————-+
|  1 | John  | Doe      |         1 |           2 |
|  1 | John  | Doe      |         1 |           4 |
|  2 | Bob   | Jones    |         2 |           4 |
|  3 | Joe   | Smith    |         3 |           2 |
|  3 | Joe   | Smith    |         3 |           4 |
+—-+——-+———-+———–+————-+
5 rows in set (0.00 sec)

That’s a problem, now I have 5 rows.. How do I make this a unique list? Well, I’m merely interested in names and ids, so I can do this :

mysql> select p.id, p.first, p.last from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4);
+—-+——-+———-+
| id | first | last     |
+—-+——-+———-+
|  1 | John  | Doe      |
|  1 | John  | Doe      |
|  2 | Bob   | Jones    |
|  3 | Joe   | Smith    |
|  3 | Joe   | Smith    |
+—-+——-+———-+
5 rows in set (0.00 sec)

but that’s still 5 rows.. so what now?

mysql> select distinct p.id, p.first, p.last from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4);
+—-+——-+———-+
| id | first | last     |
+—-+——-+———-+
|  1 | John  | Doe      |
|  2 | Bob   | Jones    |
|  3 | Joe   | Smith    |
+—-+——-+———-+
3 rows in set (0.00 sec)

Aha! perfect. That’s what I need.. almost. For this particular application, I want to paginate, so I need a total number of matching rows so I can properly identify the limits as well as the upper bound on page numbers. So, I’ll just replace the specific field names with a count(*) :

mysql> select distinct count(*) from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4);
+———-+
| count(*) |
+———-+
|        5 |
+———-+
1 row in set (0.00 sec)

And here is where I’m stuck. I need the total count of DISTINCT names, not the total number of rows returned. I tried a GROUP BY, but that didn’t help much :

mysql> select count(*) from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4) group by p.id;
+———-+
| count(*) |
+———-+
|        2 |
|        1 |
|        2 |
+———-+
3 rows in set (0.00 sec)

Sure, I get 3 rows, but what I’m looking for here is a single row with the total number of items. … So, what if I count the number of returned rows! :

mysql> select count(*) from (select count(*) from person as p left join interest_link as il on il.person_id = p.id where interest_id in (2,4) group by p.id) as foo;
+———-+
| count(*) |
+———-+
|        3 |
+———-+
1 row in set (0.00 sec)

BUT… at what cost? This seems like a rather complex query that might break down, significantly, when there’s a lot of data.. And the examples above are rather simplistic. In reality, we’re talking about more fields and more tables, so the simpler query gets a little complex to begin with. I’m open to ideas on how to do this properly via SQL. Yes, I am aware of indexing and how that speeds things up. I use indexing, I just eliminated it from the above example to simplify things. I’m open to ideas on how to do this properly via SQL.

I can simply return all the rows with the distinct clause, count them programmatically, and then proceed with the rest of the program, but depending on the selections made by the user, there could be a significant amount of data returned. I’m worried about both memory exhaustion on the part of the scripting language, as well as the processing and transmission time required to pass all of that data back to the program from the SQL database. Besides, this is the sort of problem that SQL was designed to solve.

I don’t think this is a unique problem, so someone out there has a solution. Perhaps the subselect *is* the better solution, but I don’t think so. I’m open to ideas. You can leave a comment here, or hit me up on twitter.

 

The Authentication Problem

Authentication is a tricky problem. The goal of authentication is to verify the identify of the person, device, machine, etc. that is attempting to gain access to the protected system. There are many factors to consider when designing an authentication system. Here is a brief sampling:

  • How much security is necessary?
  • Do we require username?
  • How strong should the password be?
  • Do we need multi-factor authentication?

The need for authentication typically means that the data being accessed is sensitive in some way. This can be something as simple as a todo list or a user’s email, or as important as banking or top secret information. It can also mean that the data being accessed is valuable in some way such as a site that requires a subscription. So, the security necessary is dependent on the data being protected.

Usually, authentication systems require a username and some form of a password. For more secure systems, multi-factor authentication is used. Multi-factor authentication means that multiple pieces of information are used to authenticate the user. These vary depending on the security required. In the United States, federal regulators recognize the following factors:

  • Something the user knows (e.g., password, PIN)
  • Something the user has (e.g., ATM card, smart card)
  • Something the user is (e.g., biometric characteristic such as a fingerprint)

A username and a password is an example of a single-factor authentication mechanism. When you use an ATM machine, you supply it with an ATM card and then use a PIN. This is an example of two-factor authentication.

The U.S. Federal Financial Institutions Examination Council (FFIEC) recommends the use of multi-factor authentication for financial institutions. Unfortunately, most of the authentication systems currently in place are still single-factor authentication systems, despite asking for several pieces of information. For example, if you log into your bank system you use a username and password. Once the username and password pass, you are often asked for additional information such as answers to challenge questions. These are all examples of things the user knows, thus only a single factor.

Some institutions have begun using additional factors to identify the user such as a one-time password sent to an email address or cell phone. This can be cumbersome, however, as it can often take additional time to receive this information. To combat this, browser cookies are used after the first successful authentication. After the user logs in for the first time, they are offered a chance to have the system place a “secure token” on their system. Subsequent logins use this secure token in addition to the username and password to authenticate the user. This is arguably a second factor as it’s something the user has, as opposed to something they know. On the other hand, it is extremely easy to duplicate or steal cookies.

There are other ways that two-factor authentication can be circumvented as well. Since most institutions only use a single communication mechanism, hijacking that communication medium can result in a security breach.

Man-in-the-middle attacks use fake websites to lure users in and steal the authentication information the user uses to authenticate. This can happen transparently to the user by forwarding the information to the actual institution and letting the user continue to access the system. More sophisticated attacks have the user “fail” authentication the first time and let them in on subsequent tries. The attacker can then use the first authentication attempt to gain access themselves.

Another method is the use of Trojans. If a user can be tricked into installing malicious software into their system, an attacker can ride on the user’s session, injecting their own transactions into the communications channel.

Defending against these attacks is not easy and may be impossible in many situations. For instance, requiring a second method of communication for authentication may help to authenticate the user, but if an attacker can hijack the main communication path, they can still obtain access to the user’s current session. Use of encryption and proper training of users can help mitigate these types of attacks, but ultimately, any system using a public communication mechanism is susceptible to hijacking.

Session Security

Once authentication is complete, session security comes into play. Why go through all the trouble of authenticating the user if you’re not protecting the data they’re accessing? Assuming that the data itself is protected, we need to focus on protecting the data being transferred to and from the user. Additionally, we need to protect the user’s session itself.

Session hijacking is the term used to identify the stealing of a user’s session information to gain access to the information the user is accessing. There are four primary method of session hijacking.

  • Session Fixation
  • Session Sidejacking
  • Physical Access
  • Cross-site Scripting

Physical access is pretty straightforward. This involves an attacker directly accessing the user’s computer terminal and copying the session data. Session data can be something as simple as an alphanumeric token displayed right in the URL of the site being accessed. Or, it can be a piece of data on the machine such as a browser cookie.

Session fixation refers to a method by which an attacker can trick a user into using a pre-determined session ID. Once the user authenticates, the attacker gains access by using the same session ID. The system recognized the session ID as an authenticated session and lets the user in without verification.

Session Sidejacking involves an attacker intercepting the traffic between a user and the system. If a session is not encrypted, the attacker can obtain the session ID or cookie used to identify the user’s session. Once this information is obtained, the attacker can use the same information to gain access to the user’s session.

Finally, cross-side scripting is when an attacker tricks the user’s computer into sending session information to the attacker. This can happen when a user accesses a website that contains malicious code. For instance, an attacker can create a website with a special link to a well-known site such as a bank. The link contains additional code that, when run, sends the user’s authentication or session information to the attacker.

Encryption of the communications channel can mitigate some of these attack scenarios, but not all of them. Programmers should ensure that additional information is used to verify a user’s session. For instance, something as simple as verifying the user’s source IP address in addition to a session cookie is often enough to mitigate both physical access and session sidejacking. Not allowing a pre-defined session ID can prevent session fixation. And finally, proper coding can prevent cross-side scripting.

Additionally, any session information stored on the remote system being accessed should be properly secured as well. Merely securing the data accessed isn’t enough if an attacker can access the remote system and steal session information.

Unauthentication

Finally, how and when should a user be unauthenticated? Unauthentication is often overlooked when designing a secure system. If the user fails to log out, then attacks such as session hijacking become easier. Unauthentication can be tricky, however. There a number of factors to consider such as:

  • How and when should a user’s session be closed?
  • Should a user’s session time out?
  • How long should the timer be?

Most unauthentication currently consists of a user’s session timing out. After a pre-determined period of inactivity, the system will log a user out, deleting their session. Depending on the situation, this can be incredibly disruptive. For example, if a user’s email system has a short time out, they run the risk of losing a long email they’ve been working on. Some systems can mitigate this by recording the user’s data prior to logging them out, making it available again upon login so the user doesn’t lose it. Regardless, the longer the time out, the less secure a session can be.

Other unauthentication mechanisms have been discussed as well. When a physical token such as a USB key is used, the user can be unauthenticated if the key is removed from the system. Or, a device with some sort of radio in it, such as bluetooth, can unauthenticate the user if it is removed from the proximity of the system. Unfortunately, user’s will likely end up leaving these devices behind, significantly reducing their effectiveness.

As with authentication, unauthentication methods can depend on the sensitivity of the data being protected. Ultimately, though, every system should have some form of automatic unauthentication.

Data security in general can be a difficult nut to crack. System designers are typically either very lax in their security design, often overlooking session security and unauthentication, or they can be very draconian, opting to make the system very secure at the expense of the user. Designing a user-friendly, but secure, system is difficult, at best.

 

phpTodo … In Fedora!!

Apparently I’m always the last to know… But.. I found out today that phpTodo, the todo list manager I wrote (and continue to write) has been included in Fedora. In fact, it seems it’s been in there since Fedora 7. It’s not in the main distribution, nor should it be, but apparently it’s a maintained package. Thanks, Marc!

Honestly, I’m truly honored. I wrote this on a whim and it has served me well. I use it every day! And since writing it, I’ve had a handful of people make suggestions and offer patches. I think it’s been pretty successful for a small project.

So, how about an update? Well, I’ve been working on phpTodo in my spare time, which, unfortunately, has been relatively lacking as of late. I have been able to add in a number of fixes and new features, however. The biggest change in the next release will be the removal of categories in favor of having tags. In using phpTodo over the years, I’ve noticed a number of times where I’d like to be able to put an item in multiple categories, and display multiple categories at once. While this may have been possible with categories as they were implemented, I think tags works a bit better. I’ve borrowed an idea from the Serendipity blogging platform to implement tags in a user-friendly manner, so I think the implementation works pretty well. I still have some more work to tie it all together, but it is coming along.

Another change is the addition of the Prototype and Scriptaculous javascript frameworks. There are a few reasons I decided to go this route. First and foremost, it significantly reduces the amount of work necessary to perform cross-platform javascript operations. To date, I’ve used relatively simple javascript functions, mostly for front-line input validation, but with the addition of tags, I wanted to move into some more advanced techniques. I’m striving to keep it simple and not overdo it, so don’t worry.

And, of course, there are the various bug fixes that need to be added. Overall, I’m excited about the next release of phpTodo. I don’t have a timetable as of now, but I’m hopeful that my free time will increase shortly, giving me more time to work on it. If so, then I’m optimistic about a new release sometime in the next 3-4 months. We’ll see what happens.

If you’re using phpTodo, I’d like to hear from you. I’m interested in what you like and what you dislike about the program, the interface, the workflow, etc. What features would you like to see? What features would you hate to see?

Thanks!

 

CVS to Subversion…

I’ve been using CVS for a number of years and it has served me well. I have looked at Subversion a number of times, but never really had the time to deal with it. That has changed somewhat and I have had the chance to use SVN a bit more recently. SVN feels a bit more elegant, and, in most cases, faster than CVS. But, I’m also having a bit of trouble as well. Perhaps someone out there can provide me with some insight into my problems.

Most, if not all, of my recent coding has been in languages such as Perl and PHP. Additionally, I mainly code alone, so my use of a revisioning system is purely for historical data rather than proper merging. I also use CVS to handle updates of deployed code. This alone has proven to be the strongest reason to continue using a revisioning system.

With CVS, I develop code until I’m ready to deploy it. At that point, I tag the current revision, usually with a tag of RELEASE. Code is then deployed by checking out the code currently tagged as RELEASE. From here, when I update the code for a new release, I use the -F flag to force the RELEASE tag onto the new code. A simple cvs update handles updating the deployed code to the latest release. If the deployed code was changed for some reason, as sometimes happens, CVS handles merging and I can make and necessary adjustments. Overall, this has worked quite well for some time. There are hiccups here and there, but overall it has been pretty flawless.

Recently, I used cvs2svn to convert my existing CVS repositories over to SVN. After some false starts, some research, and a few minor headaches, I have all of my code converted over to SVN. I was able to get websvn running as well, which is a nice change as I can browse the repositories freely. I started playing around a bit and noticed that all of the imports have three additional directories, trunk, tags, and branches. More research and I discovered that SVN doesn’t handle tags the same way that CVS does… This concerned me as I used tags pretty heavily for deployment.

So now we come to my problem. I have identified how to create new tags using svn copy. This works great for the first copy of a given tag, but it breaks down when updating a tag. A second copy fails because the files already exist. I can use svn delete to remove the files before copying the new ones, but that’s an additional step I have no desire to do. After all, the purpose of moving to SVN is to make life easier, not harder.

After some more reading, I find that I can merge releases. Presumably, I can check out the tagged version and then merge changes from the trunk version. However, this is still more complicated as I have to merge the code and then commit it back to the repository. So, again, we have more steps than I want to deal with.

I think I understand the reason behind not being able to copy twice. I’m also aware that the way I was using CVS was fairly non-standard, but it worked for me. The code base I normally worked on could have multiple features I’m implementing at any given time, and deployment of one feature may get prioritized. So, merely copying the base to a new tag doesn’t quite work as not everything in that code may be complete at a given time.

So what are my options here? SVN has some advantages that I really like, including the web view and better handling of authentication and permissions. However, being unable to re-tag is kind of a pain. One way or another, I think I’ll be using SVN anyway, but I was kind of hoping to find a decent way to handle everything… Anyone out there have any suggestions?