
Scott Hetrick, Director of Technical Services
OK, so I kind of like being the good guy. Granted, there’s nothing too heroic about showing people the problems that sit right on the surface, but that’s exactly what I do. I’ve spent the bulk of my career visiting countless data centers of all types and sizes, and it astounds me every time how many people are not aware of potentially devastating problems until systems are down, data is lost, and business is impacted. And, I don’t mean impacted in a good way.
Service Level Agreements (SLAs), when written, are an attempt to ensure that downtime and data loss are minimized to a tolerable level that will not negatively impact business processes and revenue. However, in my experience, far too often the written agreement can’t even be achieved or, worse yet, is never drafted – it’s just an “understanding,” if that.
The impact of not meeting an SLA differs from organization to organization; however, one thing runs constant, there are monetary, productivity, and morale consequences. You may be one of those organizations that has an external SLA with a client and you know the exact dollar value of your “fine,” should you underachieve, not to mention the risk of losing the client completely. On the other hand, you might have internal agreements, written or understood, which require staff to spend tremendous amounts of overtime to meet, should something bad happen. What’s the cost of overtime, lost productivity, and negative morale? Hopefully, you are one of the lucky organizations that know the answers to these questions and has the systems in place to meet a clearly defined SLA. I would like to meet you, because, unfortunately, you are a rare client.
I always ask clients if they test restores. To date, no one we’ve found has a documented procedure for testing recovery. You don’t know if software upgrades have corrupted anything. You don’t know if you’ve backed up bad data. You don’t know if tape is bad. What happens when a restore doesn’t work?
When it comes to SLAs, what is the confidence level of the typical CIO I meet? Frankly, not too confident. Even when he/she hears the department say, “Yes, we’re covered,” the story I hear too frequently is the CIO discovering SLAs cannot be met when something bad happens – discovering this AFTER something bad happens. And, what reasons are most often given by departments for missing SLAs? “Well, budgets got cut.” “We have new staff.” “We were never correctly trained on the products we have.” I hear these excuses repeatedly and have yet to find a CIO willing to accept any of them.
The unfortunate part of my customer interaction is witnessing the disconnect between business managers and the IT department. Surely there are organizations that do everything by the book, but more times than not, it’s not the case. For example, I worked with a client who had an Exchange failure. The system admin, a very capable individual, worked to fix it with the understanding that he had 24 hours to do so. Meanwhile, the CEO was calling his manager exactly every 14 minutes for a status update. Needless to say, there were unfortunate ramifications for individuals due to the miscommunication. What was the SLA with regard to their Exchange system? It was unknown, because no one was on the same page.
Read the rest of this entry »