A shared sense of responsibility for the product and its outcomes is the ninth factor in twelve factor DevOps. Let’s examine what that means for the team.
Factor 9: Take Responsibility
A hallmark of the high-performing DevOps culture is a shared sense of ownership. You must move beyond developers having the ownership without accountability and operations having accountability with no ownership. The whole team is responsible for delivering products on time and ensuring that they meet their service-level objectives. In exchange, they gain the freedom to act autonomously. The effort to promote ownership culture is not always easy; some people will only ever see that you’re asking them to take on additional responsibilities for no perceivable benefit.
How do you lay the groundwork for this shared ownership? By now you’ve organized your product team to reduce or eliminate silos, and you’ve aligned them to the business and customer vision. Furthermore, you have solid documentation about the product, its components, and the surrounding infrastructure.
With the common goal and tools in place, it must be established that the team which builds something defines the procedures for fixing it and is responsible for fixing it when it breaks. They cannot throw problems over the fence for another group to solve or say “let X figure out how to run it”. Conversely, they no longer have to wait for an external agent to restore service.
There are a few approaches to dealing with alerts and outage notifications. On one hand, you can send every alert to the product team. This ensures that the people who know the product the best are the first to respond, but it may also not be the best use of their time. You could send all alerts to your service desk, provide them with documentation and runbooks, and expect them to take the first pass at every issue. That, unfortunately, asks them to be experts on everything or slows down a return to normal, and is not feasible.
A hybrid approach where the service desk handles low-severity and non-service-impacting issues and the product team handles service-impacting events and outages may be more appropriate. In this model, your product teams aren’t perpetually on-call, and your service desk doesn’t have to know everything. What they absolutely know is who to call when an issue is outside of their scope.
Once the outage is resolved and normal operations have resumed, a team following 12-factor DevOps will determine why the failure occurred but will avoid ascribing blame to an individual. Even when you can trace an outage to a specific person’s commit, Failures are learning opportunities, and should only be coupled to punishment when strictly necessary. Instead, they will approach the problem together, openly, to resolve it and prevent a recurrence.