I'm an engineer at heart, with over twenty years experience of helping individuals, teams and organisations improve their software delivery. For the past 7 years I've been a consultant with Equal Experts, leading a range of engineering teams in the delivery of large-scale IT services. Problem domains have ranged from public sector organisations (UK’s tax and passport offices) to global, online retailers. I love helping bridge the gaps between user and business needs, engineering disciplines and service operation.
A common risk in most systems, particularly distributed ones, is the unexpected failure of a component part. As a system’s complexity and its number of subsystems grows, so does the likelihood of a subsystem failure. Subsystem failures can compound in such a manner that catastrophic system failure becomes a certainty. The only uncertainty is when the system will fail.
Chaos Engineering addresses the risks inherent in distributed systems that stem from unexpected component failure. It does so by running experiments that explore the impact of sub-system failures by deliberately inducing different types of failure in different components. Outcomes are then analysed and learnings applied to improve the system’s resilience. These learnings deepen our understanding of the system and its failure modes, which aids the identification of new failure scenarios. In addition, planned failures provide a safe environment for teams to improve their incident response and how they conduct subsequent postmortems.
Chaos experiments take many forms, ranging from continuous, automated failure injection (Netflix Chaos Monkey), to one-off Chaos/Game Days, where disruption is manually instigated. The session will explore why you’d run a Chaos Day, and how to know when you and your platform are ready to do so. We’ll share our learnings of the actual mechanics of running one: how do you plan, execute and retrospect a Chaos Day.