When automation hits the fan: a tale of CTI misadventure

Disclaimer

The following story is a work of fiction. Any resemblance to actual persons, living or dead, or actual events is purely coincidental. …Or is it?

In the midst of the COVID-19 pandemic, when remote work had become the norm, I found myself in the heart of a Cyber Threat Intelligence (CTI) team within a large Italian corporation. The pandemic forced us to adapt quickly to new working conditions, adding layers of complexity to our already challenging roles. As the world was grappling with a global health crisis, we were battling a different kind of threat—a digital one, lurking in the shadows of cyberspace.

Our team was tasked with a critical mission: to analyze and validate an overwhelming number of Indicators of Compromise (IoCs) that the company collected from various sources, including Open Source Intelligence (OSINT) and Closed Source Intelligence (CLOSINT). These IoCs, once vetted and categorized, were then distributed to the relevant systems, such as proxies and firewalls for domain, URL, and IP blocking, as well as Endpoint Detection and Response (EDR) systems and anti-malware solutions for hash-based detection.

Given the sheer volume of indicators we had to process daily, and the perpetual shortage of skilled analysts, we were constantly on the lookout for ways to streamline our workflow. So, when one of my analysts proposed the idea of using a semi-automated script he had developed, it seemed like a solution to our problems—or so we thought.

The Allure of Automation

The script in question was a Python-based tool designed to accept a list of IoCs as input and, after validating their reliability using APIs from various services like VirusTotal, RecordedFuture, and IntelX, return a list with confidence scores. The logic was simple: any IoC with a score greater than 70% was considered trustworthy enough to be automatically sent to the appropriate systems without further human intervention, while the rest would require additional analysis by a specialist.

Initial Success and the Honeymoon Period

After several weeks of testing, the script’s reliability appeared to be solid. We subjected it to a variety of scenarios, feeding it IoCs from different sources and cross-referencing the results with manual analysis. The script performed admirably, consistently producing results that aligned with our expectations.

The team unanimously decided to put the script into production, and it didn’t take long to see the benefits. The time it took to process IoCs dropped dramatically, allowing us to handle the ever-growing influx of data more efficiently. The manual workload on our analysts decreased, freeing them up to focus on more complex and strategic tasks, such as threat hunting and in-depth analysis of advanced persistent threats (APTs).

There was a palpable sense of relief and even excitement in the team. We had found a way to cope with the relentless demands of our job, and the script seemed to be the answer to our prayers. Automation was our new best friend—or so we believed.

The Dark Side of Automation

However, as is often the case with seemingly perfect solutions, the honeymoon period didn’t last. One fateful day, a series of unfortunate events unfolded that completely changed my perspective on the dangers of over-reliance on automation.

The Feed That Triggered a Catastrophe

It all started with a feed provided by a reputable Italian agency, which contained indicators for blocking a recent campaign of the RemCos malware—a Remote Access Trojan (RAT) that had been making headlines in the cybersecurity world. The agency was considered highly reliable, and their indicators were typically spot-on. As a result, these IoCs were processed by our beloved automated script without much scrutiny.

What happened next was a textbook example of how small oversights and errors can snowball into a full-blown disaster.

The Perfect Storm: A Series of Unfortunate Events

Several factors contributed to the catastrophe that ensued:

Human Error at the Source:
- The analyst at the agency had mistakenly retrieved from the sandbox not just the malicious file hashes but all the file hashes that the malware had interacted with during its execution. This included critical Windows system files, such as NTFS.DLL and USER32.DLL.
- In a rush to disseminate the indicators, the analyst failed to filter out these benign files, assuming that the sandbox’s output was clean and reliable. This error was the first link in a chain of events that would soon lead to disaster.
Source Credibility Bias:
- The high authority of the source contributed to an inflated confidence score for these indicators. Our script, designed to weigh the credibility of the source heavily in its scoring algorithm, pushed these IoCs above the 70% threshold, effectively marking them as “safe” for automatic distribution.
- This decision bypassed the usual checks and balances that would have caught the inclusion of non-malicious files, highlighting the dangers of over-reliance on trusted sources without additional verification.
API Failures and Their Consequences:
- On that particular day, we experienced an issue with our VirusTotal API keys, which prevented the script from performing a complete verification of all the hashes. VirusTotal’s API is a critical component in our validation process, as it aggregates data from multiple antivirus engines and other security tools, providing a comprehensive assessment of a file’s threat level.
- The API failure left the script blind to the actual danger level of the hashes it was processing. Under normal circumstances, this would have flagged the benign system files as safe, but the lack of verification meant that the script proceeded as if everything was in order.
Flawed Error Handling:
- The script’s error-handling mechanism had a critical flaw: it ignored the lack of verification due to the API failure and continued to process the indicators as if nothing was wrong. This was a classic case of a silent failure—an error that goes unnoticed because the system doesn’t raise an appropriate alarm.
- As a result, the system treated the harmless Windows files as confirmed threats, marking them for quarantine or deletion by our EDR systems.

The Catastrophic Outcome

The confluence of these factors led to our EDR system receiving a list of hashes that included not only the malicious ones but also critical Windows system files. The EDR, trusting the automated process, attempted to quarantine or delete these files. The result was nothing short of disastrous.

Within minutes, over 2,000 endpoints across the organization, all operating in full remote mode due to the pandemic, went into Blue Screen of Death (BSOD). The affected machines became completely unbootable, leaving employees, who were already struggling with the challenges of remote work, completely locked out of their systems.

The Immediate Aftermath: Chaos and Recovery

The incident threw the entire organization into chaos. IT teams scrambled to identify the problem, but the root cause wasn’t immediately clear. Was it a widespread malware attack? A critical system failure? It took hours of frantic investigation to trace the issue back to the automated script and the flawed IoCs that had been fed into it.

Once the cause was identified, the next challenge was recovery. With thousands of machines rendered inoperable, the task of restoring them was monumental. The IT department had to guide remote workers through a complex recovery process, which involved booting from external media, restoring system files, and in some cases, completely re-imaging the affected systems.

The impact on business operations was severe. Critical projects were delayed, and productivity took a significant hit. Moreover, the incident led to a loss of trust—not just in our automated processes but in our entire cybersecurity infrastructure. The incident raised serious questions about our reliance on automation and the robustness of our validation processes.

Lessons Learned: The Human Element in Automation

In the days following the incident, we conducted a thorough post-mortem analysis to understand how things had gone so terribly wrong. The conclusions we reached have profoundly shaped my approach to cybersecurity ever since.

1. Automation Needs Oversight

No matter how sophisticated an automated process might be, it should never be left to run entirely without human supervision. In our case, the script’s failure to properly handle API errors and the misplaced trust in the authority of the source contributed to a disaster that could have been avoided with more stringent oversight.

Automated systems are excellent at processing large volumes of data quickly, but they lack the intuition and critical thinking that human analysts bring to the table. It’s essential to strike a balance between speed and accuracy, ensuring that automation complements, rather than replaces, human judgment.

2. Human Error is Inevitable

Even the most reputable sources can make mistakes, and those errors can have far-reaching consequences. It’s essential to maintain a healthy level of skepticism and always cross-check information, especially when it comes to something as critical as system security.

In this case, a simple mistake by an external analyst—one that could happen to anyone—led to a catastrophic chain of events. This underscores the importance of having multiple layers of validation and not relying solely on the authority of a source.

3. Error Handling is Crucial

In complex systems, robust error handling is not just a nice-to-have; it’s a necessity. Our script’s failure to handle API errors correctly was a key factor in the incident. Ensuring that automated processes can gracefully handle and respond to unexpected issues is critical to preventing small glitches from escalating into major incidents.

This includes proper logging, alerting, and fallback mechanisms that can catch and address issues before they cause widespread damage. In our case, a simple check to ensure that all required API calls had been successfully completed could have prevented the entire incident.

4. Don’t Over-Rely on Automation

While automation can dramatically increase efficiency, it’s important to recognize its limitations. The rush to implement automated solutions should never come at the cost of rigorous testing and validation. In our case, the desire to streamline processes led to a blind spot that cost us dearly.

Automation should be seen as a tool to augment human capabilities, not

as a replacement for them. There are certain tasks—particularly those involving complex decision-making and nuanced analysis—where human intervention is still irreplaceable.

5. Adequate Staffing is Essential

The root cause of our reliance on automation was a shortage of skilled analysts. This incident underscored the importance of having enough qualified personnel to handle critical tasks. Automation should augment human capabilities, not replace them.

In cybersecurity, where the stakes are high and the consequences of failure can be severe, having a well-staffed team of skilled analysts is essential. Automation can help manage the workload, but it should never be a substitute for the expertise and insight that experienced professionals bring to the table.

6. The Importance of Continuous Improvement

Finally, this incident highlighted the need for continuous improvement in our processes and tools. The cybersecurity landscape is constantly evolving, and what works today might not be sufficient tomorrow. Regular reviews, updates, and improvements to both automated systems and manual processes are essential to staying ahead of threats.

In the aftermath of the incident, we implemented several changes to our processes, including more rigorous testing of automated tools, improved error handling mechanisms, and additional training for our analysts on the potential pitfalls of automation.

Conclusion

This incident served as a harsh reminder of the dangers of over-automation, particularly in fields as sensitive and complex as cybersecurity. While automation can be an incredibly powerful tool, it should never be treated as a silver bullet. The human element remains crucial, and no amount of automation can replace the need for skilled, thoughtful analysis.

As we continue to navigate an increasingly digital world, the lessons learned from this experience are more relevant than ever. The balance between efficiency and security is a delicate one, and it’s imperative that we approach automation with both caution and respect for its limitations.

In the end, this event taught me that while technology can help us do our jobs better and faster, it’s our responsibility to ensure that we don’t let convenience override the need for careful, deliberate decision-making: our ultimate goal should always be to protect the systems and data entrusted to us, and sometimes that means taking a step back from automation to ensure we’re doing things the right way.