‍

A detailed list of steps to take to make sure it doesn't happen again

How can we improve for the next time?

‍

The fact that Google has numerous internal systems and procedural machinery operating in the background is also highlighted in this incident report. These, in my opinion, are best practices for every business. For instance, we can tell that they have automated service monitoring and alerting capabilities because they have a record of the start of the outage and the pager alarm that was sent to the team. In addition, they include change management capabilities, which allow them to monitor who made what modifications and eventually attempt to undo them. This is crucial, in my opinion; without this visibility into modifications, it will take a long time to identify the initial cause of the problem, never mind attempting to reverse it. Additionally, they did not sugarcoat the reality that testing was omitted and the configuration push was not the safest.

‍

Therefore, I strongly advise looking at Google's Incident Report, which is mentioned in the episode notes below, if you ever find yourself in a position where you must write an Incident Report. It would be advisable to consider the possibility of replicating their internal systems and procedural machinery within your own setting.

‍

Guide to Writing Incident Reports on Slack

Problem Synopsis

Timetable

Primary Cause

‍

Settlement and recuperation

Remedial and Prophylactic Actions

‍

Latest blogs

Evaluating Best Alternatives for Opsgenie in 2025

Best and Top 5 Alternatives to Pingdom in 2025

How to create Rotation or Shift Schedule on Calendar easily