Friday 6 July 2007

Service Availability

As a service provider we have to make sure our service is available for our customers to use whenever they want to use it.

When we opened Esendex Australia a couple of years ago, we did so confident in the knowledge that we were running a 24/365 system. Unfortunately that confidence was a little misplaced.

The service was running but we'd got in the habit of running all those little system maintenance jobs in the early hours when our UK customers were generally in bed or had low volume requirements. Not so those pesky Australians, they insisted on using the system during their office hours! The expected a very responsive system and weren't always getting it.

We soon moved things around and everyone is happy but providing 24/365 availability does require a cultural shift in an organisations approach.

The development team are used to this as an approach, the architecture of the various service components that handle our message processing and routing are designed to be updated live without impacting service.

Our DBA (DataBase Administrator) on the other hand as an especially tough job keeping the databases optimised while also keeping them constantly online, or at least that's what he tells me ;).

Earlier this year we realised we needed an external monitoring service to give us a customers' eye view on our service. We have always had internal monitoring system running all the time, alerting the relevant people. This is an internal system and the danger is you make assumptions that are not correct for customers.

We settled on Alertra as they seemed to provide both the breadth of monitoring points and the depth of protocol monitoring we needed. Their alterting system also seemed pretty reliable.

We've setup monitoring on our key service access points, and thanks to Alertra's rather nifty Public Uptime Statistcs I can share the current status with you now.

Pretty good results, though not necessarily the 100% we were hoping for. It turns out we were really thankful for the external monitoring because we did have an outage that our internal monitoring didn't catch.

We host our own DNS (Domain Name Service) servers and it turns our we had an issue with the configuration. So while our service was happily alive and our internal monitoring was happily reporting all was good, some of our customers couldn't find our servers.

We've now added monitoring of our DNS servers so that base is covered.

It has shown us that there is no room for complacency and that a service truly is the sum of it's parts.

No comments: