By Christopher Cicchitelli
As a fervent Second Amendment supporter and the founder of a technology company, I’ve followed the ATF eForms story with great personal interest. It showed great promise at first, cutting wait times down to under 60 days…and then it imploded. Exactly what happened was a question we’ve all asked, and until reading the ATFs announcement yesterday I didn’t really have a good idea. However, in that announcement I believe the ATF tipped us off to the key that unravels the mystery . . .
First, some full disclosure: prior to starting my company, CastleOS, I worked as a contractor to a government R&D lab where, among other responsibilities, I designed integrated systems and managed servers for multiple civilian and military agencies.
Now for some key facts we know: despite the ATFs initial claims otherwise, the eForms system was unusable even when batch processing was not taking place. Even if we presume Silencer Shop initiated batch processes outside the 4-5 AM window they claim, I find it hard to believe they did it all day every day.
In addition, in today’s ATF announcement, they claimed there are “memory allocation errors” within the system. In order to keep it usable for all, it needs to be rebooted multiple times a day, taking up to an hour for each reboot. That’s key, and I’ll come back to it later.
In addition to these facts, there are several questions we are all asking. How was batch processing ongoing if the system wasn’t built for it? Why does the system need to be rebooted every few hours (this is 2014 after all!)? Whose fault is it really?
After doing some research, I think I have narrowed down the realm of possibilities for the answers to those questions, so here goes. One of the first things I noticed when visiting the eForms website (well, after I got to the website to load – it took a few minutes) was the technology it’s built upon: JavaServer Pages. To say JSP isn’t exactly the most popular platform these days would be an understatement. While still around for legacy reasons, you don’t often seen it used for new projects without a specific reason – a reason I don’t see the ATF having. From my own experience as a contractor, I wonder if this was a matter of the contracting company having available labor in this speciality, and thus pushing the platform on the ATF. If so, the ATF contracting officer would certainly deserve some blame for this as well.
The other immediate observation was the speed of the website loading up. It repeatedly took about five minutes to reach the login screen at around 7pm EST. What this shows is a faulty system architecture design. A proper design would separate all tiers, including at the servers themselves, into the following three categories: user interface, data broker, and data warehouse.
There’s no doubt that there is a heavy amount of processing of data going on in this system, and it’s slowing the whole system down – so there isn’t a proper separation of tiers in this design. The fact that the data warehousing can slow down the most basic parts of the user interface – the website itself and login page – is a major failure. So much so that for the millions of dollars spent on this website, I’d argue the government would have recourse to recoup (at least some of) the money it laid out.
Next is the question of how batch processing was happening if it wasn’t designed for it. The ATF has continually sounded the horn about the batch processing, but the truth is it’s misdirection. Whether the eForms system has a batch option or whether the Silencer Shop is using a custom method — possibly something as simple as a macro — to automate the process is 100% irrelevant. The reason is for all intents and purposes, a batch process just simulates the effect of multiple stores logging in at the same time. What we are seeing is a system that goes down with very little load, possible just a few dozen simultaneous users, or one user uploading a few dozen forms a night, and I think the reason lies in the ATF’s “memory allocation error”.
Traditionally, a memory allocation error is when data is corrupted or misplaced on actual memory. I don’t think the ATF has an actual error with the server memory itself – using JavaServer Pages should prevent that – but rather is PR spin for a more complicated resource allocation error.
What I think is happening is as forms are entered into the system, they enter a bureaucratically-inflated workflow. In that workflow, the system is probably generating loads of actual documents in addition to lots of database entries and so forth. (After all, do you really expect the ATF not to keep a backup paper trail? Pfff.) In other words, each application isn’t as simple as the typical web form we are used to using, and it does require some real horsepower to process it from start to finish.
Now normally that’s not an issue, each server has its maximum number of users, and you add servers as needed. The government even has a gov-only cloud it can use to deploy servers on demand – so keeping up with demand shouldn’t require more than a few minutes to boot up a new instance. However with the eForms system, the opposite seems to be true, and they can never bring enough servers online to keep up with demand (they tried at the beginning). I believe the reason is because load isn’t the issue, batch or otherwise, but a fundamental flaw in the design of the system: applications are gobbling up resources as they are processed, and then when complete, not releasing those resources.
The proof of that is the reboots every few hours to clear the “memory allocation errors” – if the code isn’t releasing the resource, you need to reboot. (Why it takes an hour isn’t clear – but that may have more to do with their attempts jury rig fixes into the system, and/or having to bring systems online in a certain order.)
I also believe that in the beginning, before too many people were using the eForms system, this same flaw was present, but was being covered up by nightly reboots. Once the load increased, the problem literally spread exponentially, and now the system can only run for roughly 4 hours until all the servers run out of resources.
The saddest thing about this for me, is not that the system failed, I’ve seen that far too many times before in government. It’s that so many millions were spent to bring a system online with such a fundamental flaw. I’d give the Obamacare website team a break long before I’d cut ATF any slack for this fiasco.