Microsoft/Azure | Germany
TALK: Learning from incidents: understanding how things went right
When things go wrong, we tend to focus on mistakes, miscalculations, and deficiencies in design. By limiting our investigations to the details of what went wrong, we ignore a far richer and more interesting source of learning: how things went right.
Research across numerous safety-critical industries such as aviation and medicine is changing what we know about how to build systems and organizations which are resilient to failure. We will look into the findings of that research and discover how we can avoid falling into common traps of investigation which curtail our ability to learn. This research shows us that the best results come when we are able to answer questions such as:
- How does the system normally work?
- How did we recover?
- How do teams adapt to surprising circumstances?
- Where did we get lucky, and what worse outcomes did we avoid?
We will share stories from beyond the boundaries of our own industry in order to show how powerful some of these new investigative techniques can be. We will move beyond a shallow analysis of root causes and remediation items in an effort to build truly resilient engineered systems for the future. You’ll leave this talk with some simple and practical steps you can take in your own team to help you learn not only “what went wrong?” but also “what went right?”