This year, Canada has experienced an unprecedented wildfire season. Starting earlier, growing faster, affecting atypical regions, and impacting large population centers like Kelowna, British Columbia's third-largest city, and Yellowknife, the capital of the Northwest Territories, where the entire city of over 20,000 citizens, were evacuated.
My neighbour is an ex-wildfire commander, current firefighter, and safety trainer for heavy industry. True to my motto of "Staying Curious", I had the pleasure of sharing notes with him, and to my surprise, our worlds are not that different. Threats, whether cyber or wildfire, need to be managed in much of the same manner: prevent them, detect them, respond to them, and recover from them. However, as I started to learn the philosophy of wildfire management, I started to realize a stark difference between our worlds: a different balance.
In the world of wildfires, beyond minor education, fire bans, and proactive maintenance, there is little prevention we can do: the attack surface is unfathomably large, the climate is changing, and the threat is, by definition, wild and untameable. Thus, the unsung heroes of wildfires accept the inevitable and focus on detective and rapid response capabilities. However, in our cyber-world, our ratios are inversed with far more weight on the preventive and detective capabilities. And that got me thinking: Do we have it all wrong?
I have been a blue teamer for my entire career, and all education, training, certifications, and certainly “magic bean marketing” gave me the impression that we can and must stop all cyber breaches. However, we haven’t, we aren’t, and sadly it seems like we won’t. And yet, we act like we will, pontificating as armchair incident responders, “if that CISO had only done X or had only purchased Y”, none of this would have happened. Yet it has and it will because we are stuck trying to play a finite game of defence in an infinite game of cyber Whack-O-Mole. And therein lies the problem with the blue team: we believe we can win if we try a bit harder with a bit more money.
Likely an overgeneralization, but the blue team measures success based on how many attacks that fancy next-generation AI-powered silver-bullet thwarts. But is that the right metric to measure the success of the blue team? Certainly, from the old-school castle and moat perspective yes, but if a single circumvention of the blue team’s fortifications takes down the company and recovery is prolonged or worse, then maybe we need to realign the blue team's mission with the full needs of the business: to continue to exist.
“Assume breach”, It’s not a new catchphrase, but one I think the greater population of IT stakeholders and business risk owners need to embrace. It is a shift from the belief that we can buy and design our way out of cyber risk and moves us toward the belief that cyberattacks, like wildfires, will happen regardless of our preventative efforts and that the business needs to be back online as soon as possible. So, what can we do?
The first step is engagement and an honest discussion with the broader IT team on how to achieve the actual goals of the business during an incident. Our incident response plans are developed and commanded by the cyber team, yet much of the Incident response leg work such as containment, eradication and incident recovery, is heavily dependent on the IT administrators, not the cyber team. Yet, when building out incident response capabilities, the IT team is seldom leading the conversation.
Consequentially, botched incident responses happen. Poor incident response plans and lack of collaboration with non-cyber stakeholders, such as the IT team, have resulted in “cure is worse than the cold” containment scenarios and pre-mature recovery, resulting in re-infection and prolonged business interruptions. This is because our cyber risk culture has put all the eggs in the preventative and detective basket and neglected the response and recovery capabilities. Let’s fix this.
First and foremost, rebalancing incident response is not necessarily trying to defund prevention and detection capabilities but rather ensuring that when those layers fail, the business is prepared to respond and recover in a timely manner. After all from the business' perspective, the data breach is a nuisance compared to the downtime of business operations.
This is where we need to make changes in our culture and strategies to increase the potential of minimizing the blast radius when the inevitable does happen and more importantly, expedite the recovery of the business back to a nominal state.
To oversimplify, improving responsive capabilities is about ensuring there is a plan of what to do and that it is engrained in the muscle memory of the team. Collaborative development of an incident response plan with the greater IT team and periodic tabletop exercises is the best way to keep muscle memory sharp and identify gaps in the plan to continuously improve the capability. As Mirai’s Human Risk Lead always says: “You don’t rise to the occasion; you fall to your training”.
Organizations can start by ensuring their incident response plan is not shelfware but rather a living document that the team relies on and continuously practices and updates. Performance metrics such as time-to-contain, percentage of impacted systems after containment and of course time-to-recover can be defined and measured within tabletop exercises.
Once the fire is put out, and usually before, the business stakeholders are demanding the business gets back online ASAP. While most companies do have some form of backup, recovering from an incident is not the time to learn whether the backup process works. Beyond the obvious inconvenience of malicious actors deleting the backups, there is also the less obvious threat to the business: the backed-up data is not immediately recoverable, prolonging the business interruption.
Anecdotally, in a recent ransomware incident, a company had offline backups, meaning they didn’t need to pay. However, the existence of the backups did not equate to quick recovery. This is because the ransomware had infected most of the systems, requiring an entire rebuild of the environment. Having the data was great, but the lack of documentation and modern deployment solutions resulted in an almost 3-month outage of business-critical systems.
When considering how organizations can improve recovery capabilities, they should first ensure their complex environments are documented. This can be done through traditional Visio and Word docs or through modernizing the environment and introducing DevOps concepts such as Infrastructure as Code and Containerization. Such concepts separate the data from the system, making recovery of data and systems discreet, repeatable and fast.
On the measurement side of the house, like incident response tabletop exercises, practice makes progress, and utilizing Infrastructure as Code and Containerization greatly simplifies the deployment of servers, operating systems, and applications in a repeatable manner. Going through tabletop exercises and testing the redeployment of business-critical systems before you need to, is a surefire way to know how prepared your organization is for the inevitable bad day.
In closing, I want to be clear: We should not be trading one silver bullet solution for another. Prevention is critical and will keep the business operating most days. However, when that bad day comes, and statistically speaking it will, responsive and recovery capabilities are going to be the difference between a nuisance data breach and catastrophic business interruption.