How to Track Service Level Agreements in Cloud Computing
A service level agreement (SLA) is a contractural obligation between you and your cloud computing services provider. Negotiating SLAs is often a dance between IT and the provider.
Some service levels are nonnegotiable, such as a mission-critical application, which means that if that application needs to be available except for one hour per month, you can’t agree to a compromise. If that’s the case and the provider can’t meet the service level, you should reconsider the cloud option. Other SLAs have more wiggle room.
IT and the service provider must work together to establish these SLAs. Typical SLAs include the following:
Response times (possibly varying by transaction)
Availability on any given day
Overall uptime target
Agreed-on response times and procedures in the event a service goes down
The agreement theoretically gives you some assurance that the provider will meet certain service levels.
But, buyer beware! You need to determine the following:
Downtime: Depending on how critical your applications running in a cloud are, you will need a certain level of availability. Is 99.9 percent enough for you? Or, do you require five nines? How does the provider plan to ensure that it will meet its SLA? What failover and disaster recovery mechanisms does the provider have in place? Are you comfortable with them?
You need to read the fine print. Does the SLA include planned maintenance, or is that separate? If so, how does planned maintenance affect you?
How the lines of responsibility are drawn: You don’t want to be in a situation where the SaaS provider is pointing a finger at the infrastructure provider, saying it wasn’t their fault.
Cost of downtime: What does it mean to your operations if the cloud is down? Service providers might compensate simply based on the number of hours systems are down. What about the cost to your business?
Past incidents: Has your provider struggled with excessive downtime in the past? Check the record. Also look at service desk metrics, including
Time to identify problem: Did a problem exist for a long time before it was reported? Is performance varying widely without warning? If this is true, it means that the monitoring system isn’t performing well and should be reviewed.
Time to diagnose: Time between an event report and the identification of the cause of the problem.
Time to fix: Time between diagnosis and system repair or resumption of service.
Ideally, you can see the operations of your service provider.
The SLA information you should capture from your provider is part of the overall key performance indicators (KPIs) for your company.