Having returned from my Winter Sun break I can now publish some findings from our data.
Our average uptime across all clients was 99.75% - this equates to 3.6 minutes per day on average of downtime. Our worst weekly uptime was 97.82% (best was obviously 100%!!)
Now, consider the reasons for outages here; typically a large Rails deployment will essentially take down a site for a short period of time. If we take out the periods where deployments were scheduled we see slightly different figures:
Average uptime soars to 99.91% - fractionally more than one minute per day of downtime (mostly scheduled infrastructure work) with a worst uptime of 99.42%.
As you will have seen on our website we gather data, schedule updates and roll out changes based on data on a monthly rolling cycle. Thus, identifying deployments is easy and not just because we maintain deployment audit trails. Business forecasting and scheduling gets easier with the monthly cycle. All those large marketing campaigns leading up to busy periods are 100% safe - any downtime is worked around the main retail peaks including a change freeze if necessary and certainly scheduled out of hours otherwise.
What of the other outages? We've seen a mixture of:
- scheduled infrastructure maintenance (scheduled outside of business hours)
- post deployment issues requiring patches (something we strive to eliminate)
- upstream connectivity issues (new pipes now in use)
- load balancer outages (resolved by Engine Yard)
- mongrels (software component) restarting due to memory leakage (also resolved through new deployment process)
So, I think we have a good basis to build on in terms of reliability. '08 is looking pretty neat already - nearly at the end of Jan and so far we are exceeding last years averages (99.97%).