Day 2 Operations
Creating a blameless culture for healthy incident managment
Spending most of my career on the operations side, this is my wheelhouse. I spent a solid 15 years carrying around some sort of paging device that could go off at any time without warning and I would have to drop what I was doing, and atteennnn HUT. I’ve spent years working what we called in the Navy mid-check or graveyard shift. Although the pay was handsome, the toll it takes on your mental health and physical health can sometimes be more demanding. The happiest SREs/DevOps/Platform Engineers are the ones that A.) Never get paged B.) get paged rarely C.) Working in a blameless culture and getting paged just means an interesting problem to solve.
Creating that blameless culture is crucial to running large-scale distributed systems. The people that operate the systems are what keep the lights on, you have to keep them happy. Constant fires are not what makes engineers happy unless they are former firefighters, (there’s always an outlier amiright). In this article, I want to talk about how to create a blameless culture and tools available to make on-call suck less. Let's GO
What does blameless mean?
Blameless culture in the context of operations and DevOps is rooted in the idea that mistakes and failures are opportunities for learning and improvement, rather than occasions for assigning fault. This approach fosters an environment of transparency, trust, and continuous learning, where team members feel safe to report issues, share insights, and innovate without fear of retribution. The benefits of a blameless culture from an operations perspective are manifold. It leads to enhanced collaboration, higher resilience, and more rapid recovery from incidents since teams are focused on solving problems together rather than covering up mistakes. This culture supports a shift from reactive to proactive management, where preventative measures and improvements are continually identified and implemented. To cultivate a blameless culture, organizations must start with leadership setting the example, encouraging open communication, and actively promoting a mindset of collective responsibility for outcomes. This involves training on effective incident review practices, such as conducting blameless postmortems, where the focus is on identifying systemic issues and learning points rather than individual errors.
Observability (Logs, Metrics, & Traces) in the context of DevOps and operations, is a foundational pillar for building a blameless culture within an organization. It refers to the capability to monitor systems, understand their internal states, and derive insights from their outputs or behaviors in real time. This comprehensive visibility is crucial for identifying, diagnosing, and resolving issues before they escalate into significant problems. By implementing advanced observability tools and practices, such as logging, tracing, and metrics, teams gain a deep understanding of their system's performance and behavior. This enhanced awareness enables them to proactively address potential issues, optimize performance, and ensure reliability. Moreover, observability fosters an environment where data-driven decisions prevail, allowing for a more objective analysis of incidents and system behaviors. It eliminates the guesswork and biases that can often cloud judgment, ensuring that when things go wrong, the focus remains on understanding the 'how' and 'why' behind an issue rather than the 'who.' Mystery Solved:
The integration of observability into an organization's operations is instrumental in cultivating a blameless culture. It provides the technical backbone for transparency and accountability, where every team member has access to the same information and insights.
This shared understanding encourages open discussions about failures and lessons learned without fear of blame. It empowers teams to collectively analyze failures as systemic issues rather than individual faults, aligning perfectly with the principles of a blameless culture. Observability ensures that the focus is always on improving processes, systems, and team dynamics. It enables continuous learning and improvement cycles, where insights from monitoring and analysis lead to better practices, tools, and approaches. By embedding observability into their culture, organizations not only enhance their operational resilience but also foster a more inclusive, supportive, and innovative working environment. This approach ultimately leads to a more robust, efficient, and dynamic operation, underpinned by a culture that values growth, learning, and collaboration over fault-finding.
Incident Command 🧑🚒
Incident command plays a crucial role in the tech industry, especially when it comes to managing service events or outages effectively. By leveraging military systems like the IMS Incident Management System, organizations can significantly improve their uptime and system reliability.
Having a structured mechanism for handling incidents allows teams to respond promptly and efficiently. The IMS Incident Management System, modeled after military command structures, provides a clear hierarchy of roles and responsibilities during an incident. This ensures that the right people are involved at each level and that there is a centralized decision-making process.
One of the key benefits of implementing an incident command system is the ability to maintain clear and effective communication channels. With defined roles such as Incident Commander, Operations Chief, and Public Information Officer, teams can coordinate their efforts and share critical information promptly. This helps prevent miscommunication and ensures that everyone is on the same page during an incident.
Additionally, incident command systems emphasize the importance of a blameless culture. Instead of focusing on assigning blame, the emphasis is on learning from incidents and preventing similar issues in the future. This shift in mindset encourages open and honest communication, enabling teams to collaborate and solve problems more effectively.
By adopting military-inspired incident command practices and leveraging tools like the IMS Incident Management System, organizations can enhance their incident response capabilities and minimize the impact of service events or outages. Structured mechanisms for incident management not only improve system reliability but also contribute to a healthier and happier work environment for engineers and operations teams.
Here’s the position structure that I have used in the past, can be adjusted to fit your situation but you need to have these four roles set for the incident:
Incident Commander: The person in charge of overall incident management and decision-making. IC's drive to resolution, set time contracts, and their main priority is to fix problems and get back to a steady state. They can but most of the time are not involved in the actual work of fixing the problem.
Executive Liaison: This person normally sits on the incident and gathers notes for executives. VP/C levels and the like tend to add more stress and less value to incidents so it's nice to keep them separated and updated accordingly. They can also work with the IC to drive resolution but are primarily there to fill in the execs.
Yeoman/Scribe: Provides administrative support and documentation for the incident management team. Creates timeline of events and notes time contracts. This is an important job and one that I suck at because I am so ADHD. Put your best note-takers on this job, or it will make things difficult in the postmortem
Engineers/Analysts- These are the boots on the ground fixing the issue. As an IC the best thing you can do is keep them focused on the task at hand and set time contracts. When they say we need to upgrade server A, get times, and follow up to make sure that the ball continues to move forward. Don’t get in their way but also don't let them veer off the path.
This structured approach ensures clear roles and responsibilities within the incident management process, facilitating effective communication, decision-making, and coordination during service events or outages.
This system was learned thanks to one of my favorite on-site training courses as an SRE by Black Rock 3. It goes like this:
Current Status- Where is the ball
Actions Taken- Who Kicked the Ball
Needs- Who needs the ball
Next Steps- Who is getting the ball next
This system is highly effective for communication during an outage or service event. It is simple, straightforward, and provides everything necessary for effective communication during chaotic times. When conducting tabletop exercises, it is important to prioritize practicing communication. This is often the area where most organizations face the greatest challenges, but once it is improved, operations run much more smoothly.
Incident command is crucial in managing service events or outages effectively because it provides a structured mechanism for handling incidents, ensuring prompt and efficient response. By establishing clear roles and responsibilities, incident command facilitates effective communication, coordination, and decision-making during critical situations. It also promotes a blameless culture by focusing on learning from incidents and preventing future issues. Through incident command, organizations can minimize the impact of service disruptions, maintain system reliability, and create a healthier work environment for their teams.
On Call- Managing Mental Health
Managing mental health amidst the demanding schedules of on-call rotations or night shifts is crucial for maintaining not only personal well-being but also professional effectiveness. The nature of these roles, with their unpredictable demands and potential for life interruption, can take a significant toll on one’s mental and emotional health. However, by adopting proactive strategies, it's possible to mitigate these challenges and maintain a balance that supports both personal well-being and professional commitment.
Move or Exercise | Control your schedule | Therapy
Firstly, exercise plays a pivotal role in managing mental health under such demanding conditions. Regular physical activity is not just beneficial for physical health; it's also a powerful stress reliever and mood booster. Incorporating a routine of consistent exercise, whether it’s a brisk walk, a cycle around the park, or a session at the gym, can significantly reduce the stress and anxiety often associated with on-call responsibilities. My good days have exercise or a least a long walk in them. My bad days have a lot of low movement. It helps in clearing the mind, improving focus, and enhancing sleep quality, which is essential for those with irregular schedules.
Being ruthless with sleep hygiene and controlling your schedule are equally vital strategies. Prioritizing sleep is not just about quantity but also quality. This means creating a conducive sleep environment, maintaining a consistent bedtime routine, and minimizing sleep disruptions. For those on night shifts or irregular schedules, this might involve blackout curtains, using sleep masks, or establishing a 'wind-down' routine before bed. I always slept better when I ate before bed too while on night shift.
Controlling your schedule outside of work hours is also critical. This involves setting boundaries around work, and ensuring there is time set aside for rest, hobbies, and social activities. It’s about making conscious choices to ensure work doesn’t consume all aspects of life, allowing for recovery and personal time. I would have PMS ask me to join 11 am meetings after a window and I would politely decline and send them my availability which was normally around 11-2 am for meetings. Those who don’t control their schedule are always the busiest and in my experience, least productive on the team.
Lastly, seeking professional support through therapy can provide a structured way to deal with the stresses and challenges of demanding job roles. Therapy offers a confidential space to explore feelings, develop coping strategies, and gain insights into managing stress and anxiety more effectively. It can be a valuable tool in maintaining mental health, offering perspectives and techniques that might not be immediately apparent. I started my therapy because I was having issues with anger due to sleep deprivation. Now I look at it as my therapist has 11 years of data to use when diagnosing and working with me and all my shit. Therapy is like changing the oil on your brain, maintenance.
In conclusion, creating a blameless culture is crucial for running large-scale distributed systems effectively. By embracing a blameless culture, organizations can foster transparency, trust, and continuous learning, where mistakes and failures are seen as opportunities for growth.
Observability, including the use of logs, metrics, and traces, plays a vital role in cultivating a blameless culture by providing comprehensive visibility and promoting data-driven decision-making. Implementing incident command practices and structured mechanisms for incident management further enhances system reliability and encourages collaboration.
Additionally, prioritizing mental health and well-being, through strategies like exercise, sleep management, and seeking therapy, is essential for maintaining personal well-being and professional effectiveness in demanding roles. By incorporating these principles and practices, organizations can cultivate a culture that values learning, collaboration, and resilience, ultimately leading to more robust and efficient operations.