Building Firebreaks Instead of Fighting Fires
When Google recently shared an anecdotal story from their site reliability engineering (SRE) team, I couldn’t help but admire the systems they had in place to respond to disturbances for their end users — people like myself.
But our work is different and the stakes are often much higher. Our end users are often women and girls from Asia and Africa who may not own their own phone, let alone have access to smart phones or internet. For them, the messaging services we provide or websites we maintain may be their only entry point to vital services and information. Our SRE team works to make sure that’s never at risk: as Jeremy Thurgood, our Resident Rocket Surgeon, states, “SRE is everything that happens behind the service visible to the customer.”
Over the last few years, we’ve moved away from the traditional approach to system administration — where work was managed by discrete actions like “install this thing” or “add this user” — to a system where as much as possible is automated. Essentially, through SRE we utilize automation systems so that deployments of code and servers are as robust and reliable as possible, without 24–7 management. This sounds wonderful, but in order to keep users’ vital data secure, that also means we often have to assure the strength of these systems in countries where servers and infrastructure are wholly unreliable.
This means that a big part of our SRE team’s job is building out a system that is capable of handling hardware failure. Bearnard Hibbins, our SRE team lead, explained, “We are ensuring apps always run regardless of hardware failure. In some of the countries we work in our hosting and service provider are unreliable, so we have built systems to ensure the everything is running smoothly. So if one of the servers fails — you lose everything on that machine — you have it somewhere else — and you can keep going.” It’s all about redundancy, replication, failure detection and recovery.
For our maternal health platforms in Africa this is crucial. Many of our mHealth services, like the South African Department of Health’s MomConnect platform, use message sending systems. With a robust platform, it doesn’t matter which server the messages are sent from. If one server breaks down, messages are still sent since the work is not tied to a particular computer. This reliability builds trust between us and the people we serve, as well as with our partners as we are able to work within existing local systems.
The irony for most successful SRE teams is that the better they are at what they do, the more they go unnoticed. “No one notices there may have been a problem because our work ensured we didn’t hit one” Jeremy added. “Essentially we want to be building firebreaks rather than putting out fires — predicting what problems we’ll have, building out capacity and system architecture, so we can foresee problems,” Bearnard pointed out.
SRE for us is also about visibility. Nelson Koma, an engineer on our team mentioned, “SRE enhances complex system level issues, making those solutions simple enough for use by anyone, not just developers. This enables our project managers to see how a system or service has been performing in terms of availability and capacity.” The team builds tools (like metrics monitoring) to see where the fires are coming from. “We want to be able to solve for potential ‘fires’ and also to be able to have historical metrics for retrospective troubleshooting.”
Jamie Hewland, another engineer on the team, added that SRE teams aren’t only automating existing processes — they also bring in new technologies and new functionalities. Recently, our organization was working to do this with a certificate issuing system. “Previously, if developers wanted to add HTTPS security and encryption to a website they would have to request it for the specific site and then we would configure it manually. Now we have an automated system, and the certificates are renewed and managed automatically. This allows us to run thousands of sites, to turn on HTTPS on any new sites that come along, without having to do much work per site manually. Rather than having to worry about each site — it happens without you noticing.” Something that seems as simple as the certificate issuing system, kept user information on our Girl Effect sites secure.
Aside from data security and local hosting, the SRE team helps us with our commitment to iterative development. We prioritize user experience, and take user feedback seriously. When we do add improvements to our platforms, the SRE team helps us do this amongst many moving parts and interconnected systems — hence the analogy often given to site reliability engineering of changing tires on a moving car. The end user is not affected with any downtime or performance issues when new versions of our system are released. We offer platforms for girls to engage in safe spaces online, or for women to receive quality information on their pregnancy. Timely and smooth service is key to achieving impact.
SRE team members are both problem-solvers and inventors: my SRE colleagues have degrees in electronic engineering, system designing and logic, physics, computer science, and informatics. In Jeremy’s words, “This isn’t really a job you can be specifically trained for; Sometimes you need a piece of infrastructure and there’s nobody else to set it up. Other times, you get tired of doing something manually and want to automate it. Then one morning you wake up and discover that you’re running an SRE team.”
SRE teams help keep things seamless between the end user and our platforms, staying true as much as possible to what our project, service design, partners, and development teams envisioned. In this way, they are not just automating a task, but building trust. And while it’s often said trust takes years to build, I reckon our SRE team can do it in 15 minutes.
Written by Ambika Samarthya-Howard, Head of Communications, with the Site Reliability Engineering Team