The Mystery of the Missing Service Provider
In the Defense Department, there is a web-conferencing service called “Defense Connect Online“. Until about two weeks ago, no one was responsible for providing this service. DCO is a finalist for the 2013 Excellence.gov awards, and no one was responsible for providing the service. (i-kid-you-not) This service has 800,000+ registered users, and there were over a half-billion user-conferencing-minutes used last year across the DoD. It is a huge success, especially in an era of declining budgets and severe travel restrictions. People love it, and they hate it too. They love it because it helps them do their jobs. They hate it because they see how much more it could be. Even the people who criticise DCO, (and there are such) do so because they absolutely need the capabilities it provides them, and they wish it did more.
How could this be?
There is a Program Manager for DCO; I’ll call her “Carla” (not her real name). Carla is smart and hardworking, but she can only do what the bureaucracy lets her do. Once, last year, I noticed that a major part of DCO was “down”, and I contacted her immediately: “Hey, Carla, DCO is down!” She responded calmly, “I know.” (I expected her hair to be on fire.) “Well, what are you doing about it?” With a resigned, almost dejected tone, she replied: “I’m eating lunch. There isn’t anything I can do.” Me, incredulous: “What?!” Carla replied, “The DECC (Defense Enterprise Computing Center) guys do that. It’s up to the operations guys. My team isn’t allowed to touch anything in their data centers.”
Here’s the rub: the “operations guys” in the Enterprise Computing Centers who actually run the servers don’t know anything about web conferencing. They aren’t accountable for providing a web-conferencing service; they are accountable for making sure the servers in their data centers are up-and-running. They are the same guys who run the servers for the timecard-and-leave system, and logistics databases, and command-and-control systems. They are responsible for making sure the datacenter has a proper Authority To Operate, and systems in the datacenter have Authority To Connect, that the OSes on the servers have all the current patches, and that the mandated intrusion detection systems are installed and operating. They run the servers, SANs, firewalls, routers and switches.
The PM is responsible for “acquiring the system”. The Enterprise Computing Centers are responsible for “running the servers”. There is a huge responsibility gap in between: “providing the service”. A system can be 100% available, working as designed, passed Operational Test, meet the JROC-validated requirements, and still not satisfy user needs.
An acquisition program manager is graded on cost, schedule, & performance. A datacenter operator is graded on uptime, capacity, & security. A service provider is graded on customer satisfaction – which, I argue is the thing that really matters.
The Capacity Crisis
In Nov 2011, DCO started hitting its user-limit of 3500 concurrent users for the web conferencing capability. The operations team responded by analyzing the performance of the system components and determined that they could safely raise the limit to 3750 users without compromising performance. (The limit was based on both licensing but also on system architecture.) DCO hit the new limit within two weeks. Later, previously scheduled upgrade to a newer version of the server software allowed an increase to 4000 concurrent users. I don’t know what bottleneck caused this limit, but I understand that there was no simple fix. The 4000-user limit was hit two days after the upgrade.
Because of the responsibility gap, no one had the authority, ability and motivation to fix this in a timely fashion. This failure should have been anticipated (it was, actually) and corrective action taken in advance (not so much), but there was no perceived urgency, because no one was responsible for end-user satisfaction. The government PM was understood the problem, and wanted to fix it, but was constrained by not-enough-money and too-much-process.
When the system first started hitting the 4000-user limit for concurrent users, the PM’s asked her contractor to develop courses-of-action for increasing capacity. The way the system was designed, when the 4,001st user tried to log on, he would be denied access. The system architecture was essentially unchanged from the initial pilot configuration from 2007. Carla convinced the Program Executive Officer (her boss’s boss) to allocate money to do the engineering work, and managed to get a contract mod awarded to redesign the system to be elastic so that additional capacity was easy to add in the future (and add 500 additional users off-the-bat).
I don’t know how long it took the Carla to get funds allocated and put on contract. My guess is that that took 6 months, at least. The contractor delivered the new system architecture on-schedule in Dec 2012.
If my guesses are correct, then the actual engineering only took about a month (they maybe started in advance of the contract mod award). [UPDATE: my guesses were wrong and my timeframe was off.] During those months, the usage/capacity issue had gotten progressively worse: at first the system only occasionally reached the capacity limit for brief periods, but eventually it spent several hours at the limit every weekday. This was like the frog in the pan of hot water: the first time it doesn’t work, you assume it’s a glitch. If the performance degrades slowly enough, you don’t realize that there’s a real problem.
By the time the new architecture was delivered, I asked Carla how long it would take to get it operational. She estimated that the upgraded solution could be fielded no-earlier-than 9 months. I found this outrageous. Other members of the team (based on past experience) felt this was unreasonably optimistic, and predicted that it would be 18-24 months before the upgrade was fielded.
The reasons for this delay were myriad: no approved security tests for the new 64-bit OS, the primary data center was out of electrical power reserves, some of the components to the new architecture weren’t on the approved product list, the certification & accreditation processes “just takes a long time”, etc. All surmountable issues, but time-consuming issues driven by a bureaucratic culture of risk-avoidance. Worst of all – there was no plan, no schedule. Many of the critical players did not work for Carla, and could not be “hurried”.
Consider: If Facebook or Twitter hit their underlying system-capacity limits, how long do you think it would take them to upgrade? Would they let their own rules slow them down from doing what they believed was the right solution?
In the Fall of 2012, the capacity issue rapidly got worse. The GSA conference scandals, the continuing budget resolution, and the impending sequestration forced widespread restrictions on travel across all of DoD. Since DCO is our designated enterprise web-conferencing solution, the capacity issues that DCO had been struggling with became an acute problem.
In late January 2013, an order went out to all DISA employees instructing them not to use the web-conferencing capability of DCO until further notice. People were confused, some astonished, and many angry. Personally, I thought it was brilliant, maybe even breathtaking.
It accomplished three things:
- On its face, temporarily forbidding DISA employees to use DCO freed up additional capacity for DISA’s customers.
- It forced each and every DISA employee to realize that DISA has customers, and implicitly stated that service to their customers is more important than their own convenience.
DISA is a combat support agency – they exist to support the warfighting mission.
- It created a burning-platform issue hot enough to get the Agency leadership’s attention.
Effect #3 caused Carla’s boss to be chewed out personally by the Director of DISA, and forced him to explain to the Director how this crisis arose and what needed to be done to fix it. I’m sure the conversation was extremely unpleasant. The important outcome was that suddenly the Director was personally engaged in fixing the problems. He stood up and accepted that, as the Director of DISA, providing DCO as-a-service to the Defense Department was his responsibility, and he was not going to let his own agency’s bureaucracy keep him from succeeding. Carla and her chain became empowered to “break glass” to fix the problem.
The Happy Ending
Two weeks later, DISA fielded upgrades to double DCO capacity to 8000 concurrent users. This was an interim step towards fielding an elastic solution, which will take another month or two. (The additional capacity is temporarily at https://www2.dco.dod.mil/)
In my analysis, DISA has started acting like a service provider for DCO, rather than an acquisition agency. The agency has started to accept that they are responsible for providing web conferencing as a service. This is a really good start, but much work remains.
I believe the jury is still out if the Agency will accept this as a long-term change of perspective. Can the agency take the lessons-learned from this DCO-crisis to other DISA-provided services? Or will the bureaucracy will squash their newfound agility as an one-time aberration? It would be easy see the broken glass as shameful: “We had to break our processes,” rather than as courageous: “Our processes were breaking us, so we rose above them.”
Time will tell. Hope springs eternal.
This post is meant neither to criticise nor applaud DISA or DCO. I believe these types of problems are common in the Defense IT community (and probably across the government). My intent is to show an example of the confusion between “acquiring a system” and “providing a service”, that I described in my previous post and why the difference matters.