Part of the reason I've been so quite on the blogging front of late is that I've got carried away with a bit of development spiking. Nothing that goes straight into production I hasten to add. I'm too far away from our development processes to be trusted with access to the live source tree, and anyway Nicholas (Development Manager) probably wouldn't let me.
I am in the enviable position of being able to create proptotypes, test things out and hand them over to the professionals to make them a reality. It's one of these projects that became SLAP (Service Location and Announcement Protocol).
One of the key technical challenges facing an organisation like ours is scaling our applications out to meet demand while at the same time keeping everything highly available.
Our architecure comprises a series of software agents that pass messages between each other routing and manipulating traffic based on a series of rules. For example
- A outbound bulk SMS will head straight to an SMS Router for it to decide based on the state of a set of Mobile Carrier connections which is the best current route.
- A premimum SMS message will be passed to a Subscription checking agent to confirm that the subscriber is valid, and then possibly onto a network lookup agent to resolve the destination network before heading onto the SMS Router for onward submission via the correct network for billing purposes.
As I'm sure you can imagine running multiple instances of these agents across multiple servers and have them know the state of the agents around them so messages get passed around reliably is no mean feat. Should an Mobile Operator connection become unavailable or we lose a server, the agents should be able to adapt to that.
One option is to use hardware load balancers, as we do in front of our web servers, but this quickly becomes very expensive and very costly to maintain as you end up with multiple layers of load balancing. Also it's very hard to dynamically configure the network of agents to cope with peak periods of traffic.
After reading Scalable Internet Architectures I was inspired to come up with a simpler, and cheaper, solution. SLAP was born which is an alternative, software approach that is deliverying huge benefits to us already.
Nodes broadcast one of 2 message types: ANNOUNCE and LOCATE over UDP which are subscribed to by other nodes on the network. These nodes can be application monitoring agents or clients wishing to consume services offered by the node.
It's simplicity hides a power that has allowed us to architect our system to self configure and repair itself in the event of failures. In subsequent posts I'll describe the specifics of the protocol and our describe our experiences using it.