All Articles

How to Benchmark Agentic AI in the SOC: A Practical Guide

Learn how to benchmark agentic AI solutions in your SOC effectively. This guide provides a practical approach to evaluating performance in real-world environments.

Ron Marsiano
URL copied

SOC teams today are under pressure. Alert volumes are overwhelming, investigations are piling up, and teams are short on resources. Many SOCs are forced to suppress detection rules or delay investigations just to keep pace. As the burden grows, agentic AI solutions are gaining traction as a way to reduce manual work, scale expertise, and speed up decision-making.

At the same time, not all solutions deliver the same value. With new tools emerging constantly, security leaders need a way to assess which systems actually improve outcomes. Demos may look impressive, but what matters is how well the system works in real environments, with your team, your tools, and your workflows.

This guide outlines a practical approach to benchmarking agentic AI in the SOC. The goal is to evaluate performance where it counts: in your daily operations, not in a sales environment.

What Agentic AI Means in the SOC

Agentic AI refers to systems that reason and act with autonomy across the investigation process. Rather than following fixed rules or scripts, they are designed to take a goal, such as understanding an alert or verifying a threat, and figure out the steps to achieve it. That includes retrieving evidence, correlating data, assessing risk, and initiating a response.

These systems are built for flexibility. They interpret data, ask questions, and adjust based on what they find. In the SOC, that means helping analysts triage alerts, investigate incidents, and reduce manual effort. But because they adapt to their environment, evaluating them requires more than a checklist. You have to see how they work in context.

1. Start by Understanding Your Own SOC

Before diving into use cases, you need a clear understanding of your current environment. What tools do you rely on? What types of alerts are flooding your queue? Where are your analysts spending most of their time? And just as importantly, where are they truly needed?

Ask:

  • What types of alerts do you want to automate?
  • How long does it currently take to acknowledge and investigate those alerts?
  • Where are your analysts delivering critical value through judgment and expertise?
  • Where is their time being drained by manual or repetitive tasks?
  • Which tools and systems hold key context or history that investigations rely on?

This understanding helps scope the problem and identify where agentic AI can make the most impact. For example, a user-reported phishing email that follows a predictable structure is a strong candidate for automation.

On the other hand, a suspicious identity-based alert involving cross-cloud access, irregular privileges, and unfamiliar assets may be better suited for manual investigation. These cases require analysts to think creatively, assess multiple possibilities, and make decisions based on a broader organizational context.

Benchmarking is only meaningful when it reflects your reality. Generic tests or template use cases won’t surface the same challenges your team faces daily. Evaluations must mirror your data, your processes, and your decision logic.

Otherwise, you’ll face a painful gap between what the system shows in a demo and what it delivers in production. Your SOC is not a demo environment, and your organization isn’t interchangeable with anyone else’s. You need a system that can operate effectively in your real world, not just in theory but in practice.

2. Build the Benchmark Around Real Use Cases

Once you understand where you need automation and where you don’t, the next step is selecting the right use cases to evaluate. Focus on alert types that occur frequently and drain analyst time. Avoid artificial scenarios that make the system look good but don’t test it meaningfully.

Shape the evaluation around:

  • The alerts you want to offload
  • The tools already integrated into your environment
  • The logic your analysts use to escalate or resolve investigations

If the system can’t navigate your real workflows or access the data that matters, it won’t deliver value even if it performs well in a controlled setting.

3. Understand Where Your Context Lives

Accurate investigations depend on more than just alerts. Critical context often lives in ticketing systems, identity providers, asset inventories, previous incident records, or email gateways.

Your evaluation should examine:

  • Which systems store the data your analysts need during an investigation
  • Whether the agentic system integrates directly with those systems
  • How well it surfaces and applies relevant context at decision points

It’s not enough for a system to be technically integrated. It needs to pull the right context at the right time. Otherwise, workflows may complete, but analysts still need to jump in to validate or fill gaps manually.

4. Keep Analysts in the Loop

Agentic systems are not meant to replace analysts. Their value comes from working alongside humans: surfacing reasoning, offering speed, and allowing feedback that improves performance over time.

Your evaluation should test:

  • Whether the system explains what it’s doing and why
  • If analysts can give feedback or course-correct
  • How easily logic and outcomes can be reviewed or tuned

When it comes to accuracy, two areas matter most:

  • False negatives: when real threats are missed or misclassified
  • False positives: when harmless activity is escalated unnecessarily

False negatives are a direct risk to the organization. False positives create long-term fatigue.

Critically, you should also evaluate how the system evolves over time. Is it learning from analyst feedback? Is it getting better with repeated exposure to similar cases? A system that doesn’t improve will struggle to generalize and scale across different use cases. Without measurable learning and adaptation, you can’t count on consistent value beyond the initial deployment.

5. Measure Time Saved in the Right Context

Time savings is often used to justify automation, but it only matters when tied to actual analyst workload. Don’t just look at how fast a case is resolved: consider how often that case type occurs and how much effort it typically requires.

To evaluate this, measure:

  • How long it takes today to investigate each alert type
  • How frequently those alerts happen
  • Whether the system fully resolves them or only assists

Use a simple formula to estimate potential impact:

  • Time Saved = Alert Volume × MTTR
    (where MTTR = MTTA + MTTI)

This provides a grounded view of where automation will drive real efficiency. MTTA (mean time to acknowledge) and MTTI (mean time to investigate) help capture the full response timeline and show how much manual work can be offloaded.

Some alerts are rare but time-consuming. Others are frequent and simple. Prioritize high-volume, moderately complex workflows. These are often the best candidates for automation with meaningful long-term value. Avoid chasing flashy edge cases that won’t significantly impact operational burden.

6. Prioritize Reliability

It doesn’t matter how powerful a system is if it fails regularly or requires constant oversight. Reliability is the foundation of trust, and trust is what drives adoption.

Track:

  • How often do workflows complete without breaking
  • Whether results are consistent across similar inputs
  • How often manual recovery is needed

If analysts don’t trust the output, they won’t use it. And if they constantly have to step in, the system becomes another point of friction, not relief.

Final Thoughts

Agentic AI has the potential to reshape SOC operations. But realizing that potential depends on how well the system performs in your real-world conditions. The strongest solutions adapt to your environment, support your team, and deliver consistent value over time.

When evaluating, focus on:

  • Your actual alert types, workflows, and operational goals
  • The tools and systems that store the context your team depends on
  • Analyst involvement, feedback loops, and decision transparency
  • Real time savings tied to the volume and complexity of your alerts
  • Reliability and trust in day-to-day performance

The best system is the one that fits — not just in theory, but in the reality of your SOC.

URL copied

The security industry spent years debating when attackers would gain capabilities once out of reach — nation-state-level offensive tooling, zero-day discovery at scale, exploits built and iterated in minutes.

That gap was real. And it gave organizations the impression that the decision about which AI to bring into security operations, and how to do it right, could wait until the picture was clearer.

Mythos ended that assumption. 

Not because of the model's size or strength, but because by the time Anthropic announced it, Mythos had already found thousands of high-severity vulnerabilities across every major operating system and browser in use today, without being told where to look. The decision not to release is the signal everyone was looking for.

That changes the implementation question. It was never acceptable to deploy AI badly in the SOC. Now it's not acceptable to deploy it slowly either. The organizations that will come out on top in the next 12 months are the ones that move fast and get it right, and most of the industry is about to discover that those aren't the same thing.

Level set: defenders have always been behind

The average breach lifecycle was already 258 days before AI-assisted attacks became the norm. This has nothing to do with the capabilities of analysts. Human-speed defense against machine-speed offense was always a losing equation.

Mythos-class models will almost certainly expand this breach lifecycle delta. 

Most Implementations Are Getting It Wrong

87% of organizations experienced an AI-driven cyberattack in the past year. Security teams know they need AI. Most are already moving. But most implementations are failing for the same reason, and it is not the technology. It is a missing critical datapoint.

You. The context that shapes your business.

Most AI SOC tools treat every organization as interchangeable. They integrate with your SIEM, your EDR, your threat intel platforms, and assume that is enough. It is not. What determines whether AI actually works in your environment has nothing to do with the list of integrations. It is the organizational context that no integration can capture.

How is your organization structured? Where does data actually live versus where it is supposed to live? Who owns what, and how does that map to investigation and response when something goes wrong? How do escalation paths work in practice, not on paper? And critically, how do you enable the business without interrupting it?

The difference shows up clearly in practice. A heavily regulated enterprise running investigations across proprietary internal platforms looks nothing like a technology company. The organizational context that shapes every investigation, every escalation decision, and every response action is invisible to a system that only sees tool outputs. 

Closing that gap is the foundational requirement that most implementations skip entirely.

Org Context Is Not a One-Time Setup

This is where most implementations fail, even when they start well.

Organizational context is not a configuration you complete on day one. Your organization is a living thing. Teams change. Tools get added. Processes evolve. New subsidiaries appear. Risk posture shifts with every acquisition, every regulatory update, every new product line the business launches. 

An AI system that ingested your context six months ago and stopped learning is already drifting from your reality. It is making decisions based on an organization that no longer exists.

The right model is not a one-off ingestion. It is a continuous learning system that stays embedded in how your organization actually operates, tracks how investigations unfold, incorporates analyst feedback, and updates its understanding as your environment changes. 

Not a snapshot. 

A persistent model of your specific organization that evolves with it.

What Good Implementation Actually Looks Like

First, AI systems needs to understand how your organization actually operates. Not how it is documented, but how investigations really unfold, where data actually lives, and how decisions get made under pressure. The gap between what is written down and what actually happens is where most AI systems fail.

Second, that understanding cannot be static. Organizations change constantly. New teams, new tools, new processes, new risk priorities. Any system that relies on a snapshot of your environment will drift from reality and degrade over time. The AI working in your environment needs to keep learning it, not just learn it once.

Third, it needs to operate within that context, not around it. Producing technically correct outputs is not enough. The system needs to produce outcomes that are actionable within your organization as it exists today. That means working within your existing workflows, tools, and constraints without asking you to change how you operate to accommodate it.

That is the standard. Systems built around this model behave differently from the start. They do not ask organizations to adapt to them. They adapt to the organization. That distinction is where most implementations succeed or fail, and it is where the industry is slowly converging.

The Only Durable Path

The organizations getting AI right in the SOC aren't the ones with the longest integration lists or the biggest models. They're the ones that treated organizational context as the foundation rather than the afterthought, and built systems that keep learning their environment rather than freezing it in place on day one.

That is a harder implementation. It requires more from the vendor and more from the buyer. But Mythos made the timeline for getting there non-negotiable. The organizations that move fast on the wrong implementation will spend the next year rebuilding. The ones that move slowly on the right one will spend it exposed. The only durable path is moving quickly on the version that actually holds up. Systems built on continuous organizational context, deployed now rather than after the next incident, force the question.

The gap that used to buy time for deliberation is gone. What's left is the quality of the decision you make in its absence.

AI
Fast and Right: Implementing AI in the SOC After Mythos
April 23, 2026
12
min read

Mythos ended the debate on whether AI belongs in the SOC. The new question is how to deploy it right and why organizational context is the foundation most implementations skip.

Ron Marsiano

Anthropic was right (and responsible) to release Mythos first to cybersecurity researchers and a select group of organizations through Project Glasswing. It is a genuinely remarkable model. And the security community should take it seriously. What is available to defenders today will be in the hands of attackers in a few months. That window is closing fast.

Mythos raises the ceiling on what AI can do in cybersecurity tasks. It discovers zero-day vulnerabilities in codebases that previous models could not find. It reverse-engineers complex systems. It constructs sophisticated, multi-path exploits at scale. The capabilities that were previously accessible only to well-funded nation-state actors can now be replicated by a far broader set of threat actors. No longer do you need teams of expert reverse engineers and months of reconnaissance.

The threat landscape is structurally shifting. We will be determined by our ability to shift our defense in kind. Quickly.

Where AI in defense needs to go first

The industry is converging, rightly, on vulnerability research and remediation as the priority. Scanning your own codebase with the same class of models that attackers are using is a clear first step. In many cases, defenders actually have an asymmetric advantage here, as we have better access to our own code than attackers do.

The harder problem is remediation. We already carry significant backlogs of unresolved, sometimes exploitable, vulnerabilities. Unlike an attacker who has nothing to lose, defenders cannot afford mistakes. Our systems are in production. Downtime has real costs. The asymmetry of attacker agility versus defender accountability is where the gap widens.

AI-assisted vulnerability remediation at scale is necessary. But it is not a solved problem, and any honest assessment of the landscape has to acknowledge that.

What this means for security operations

The idea of static detections designed to discover dynamic adversaries is fundamentally misguided. The future is better trip wires and an assume-breach mentality.

For SOC teams, the implications are direct. The scale and complexity of attacks is accelerating. We should expect a higher volume of sophisticated attacks that actively evade detection, that do not conform to known signatures or behavioral patterns, and that are designed from the ground up to stay invisible.

This breaks the model that most SOC programs are built on. The idea of maintaining a library of static detections to catch dynamic adversaries has always had limits. Those limits are now being exposed in real time.

What we need instead is the ability to detect a high volume of low-fidelity signals, such as anomalies in endpoint behavior, data access patterns, email activity, network flows, and identity. This requires teams to investigate each one as if it were the leading edge of a sophisticated breach. Not because every alert is a nation-state intrusion, but rather, we should expect that a higher percentage now may be.

The question is no longer whether to adopt AI in security operations. This is clearly needed. We cannot scale defenses solely on human labor. 

The question is how to do it in a way that actually works inside the operational reality that security teams live in.

The real challenge is operational reality

Enterprises have legacy and custom tools, established processes, compliance and audit requirements, escalation paths, and oversight obligations that are not optional. AI cannot simply replace this infrastructure. It has to work within it.

You cannot properly scale your defenses without giving AI access to your organizational context, including your tools, your processes, your detection logic, and your escalation criteria. AI agents need to be able to investigate with the consistency and rigor of an incredible IR analyst, operate transparently, and support human oversight at the points where it matters.

This is precisely what we built Legion to do: meet organizations where they are. Our platform learns your existing tools, processes and context and makes them accessible to the latest frontier models (now Mythos, and every model in the future). From that we create structured, repeatable workflows where consistency is required or fully agentic investigations that require depth and judgment. Every action is auditable. Human-in-the-loop controls are configurable. And the system integrates across your entire stack.

My conclusion - Assume breach, investigate everything, build for the attacker that has already found the vulnerability you have not patched yet, and is using Mythos-level models to stay ahead of your detections.

AI
The Future is Not Better Detections
April 14, 2026
10
min read

In the wake of Mythos and Project Glasswing, security operations teams need AI that meets them where they are.

Ely Abramovitch

Picture a senior analyst mid-investigation. Eight browser tabs open across CrowdStrike, VirusTotal, Defender, and Microsoft Entra. She's running a hunting query in one window, checking an IP reputation score in another. And somewhere in between, she's documenting. Taking screenshots, copying log entries into a case note, capturing context before it slips away.

This is the job. Investigations today aren't just about finding the threat. They're about moving across tools, pulling together evidence from a dozen different sources, and building a record that another analyst, or an auditor, or a manager, can actually follow. The documentation isn't a distraction from the work. It is part of the work.

Everyone in security has lived that. 

Which raises a question that's been easy to ignore until now: if we wouldn't accept an analyst who said "trust me, I looked at it"- why are we accepting that from AI agents?

Evidence Has Always Been the Standard

The reason SOC analysts document isn't distrust. It's precision. A good investigation has always meant showing your work. The summary an analyst writes is their claim, the insight they've drawn from what they saw. The screenshot is the fact. Undisputable evidence, captured at the moment of discovery. Together they tell the full story: here is what I found, and here is the proof.

Evidence gathering has always been a core part of the job. Screenshots and logs aren't bureaucratic overhead. They're how you distinguish signal from noise, how you close out audit findings, how you hand off a case without losing context.

You Wouldn't Accept "Trust Me" From an Analyst. Stop Accepting It From AI

We hold human analysts to a clear standard. When an analyst closes a case, we expect to see their work. The exact screen they reviewed, the exact query they ran, the exact result that informed their decision. A summary of what they found is a claim. The screenshot is the proof.

We should hold AI agents to the same standard.

Today, most AI SOC give you a verdict and a reason. The agent processed the alert, evaluated the indicators, and concluded it was malicious. But if you ask what it actually saw, you're directed to API logs and structured JSON responses. That's not evidence. That's a reconstruction built after the fact, from data that was never meant to be read by a human auditor in the first place.

The gap between what an AI agent did and what you can actually verify is where hallucination risk lives. A summary can sound confident and still be wrong. Without visual evidence captured at the moment of the decision, you have no way to know what the system actually encountered.

Legion operates differently. Instead of calling APIs, Legion navigates your source systems directly through the browser, the same way a human analyst would. It opens the actual system, reads the actual screen, and captures a screenshot of exactly what it sees at every step. The summary is the claim. The screenshot is the fact.

That's the standard we believe AI investigations should meet. And it's the only architecture that meets it. 

How Legion Automates Evidence Gathering

Legion Evidence Gathering captures visual proof of every action Legion takes as it navigates your source systems, automatically, in real time.

Take a malware investigation spanning CrowdStrike, VirusTotal, and Defender. Legion opens the originating ticket, reads the case, and begins investigating. As it moves through each tool, it takes a screenshot at every step. The CrowdStrike detection page as it appeared. The VirusTotal result in context. The Defender hunting query and its output. Every interface, exactly as Legion saw it.

By the time an analyst opens the case, the full evidence gallery is already there. Screenshots organized sequentially, labeled by tool, timestamped, and ready to review. Not just a summary. Not just a log. The complete picture: the analysis and the visual evidence behind every conclusion.

And it stays there. Every investigation Legion runs is stored and searchable. When an auditor asks a question, when a peer analyst picks up a handoff, when someone needs to understand why a decision was made, you go back to the session and everything is right there. Every step. Every screen. Nothing reconstructed. Nothing missing.

Different alert types. Different toolchains. The same complete evidence gallery, every time.

AI SOC Tools SOC Analyst Legion
How evidence is captured API responses logged Manual screenshots and log exports taken during investigation Screenshots captured automatically / API response logged
Session replay Not available Not available Full replay of every investigation, step by step, exactly as Legion navigated it
Time to evidence Automatic evidence creation 20–30 minutes Automatic evidence creation
Time to understand Significant analyst effort Easy to consume Easy to consume
Consistency Consistent data capture Varies by analyst Consistent data capture

This Is What Accountable AI Looks Like

We've always known what a good investigation looks like. You show your work. You back your conclusions with evidence. You leave a record that someone else can follow. Legion applies that same standard to every automated investigation it runs, without exception and without manual effort. The bar doesn't move because the analyst is an AI. It stays exactly where it's always been.

See Legion Evidence Gathering in action. Request a Demo

AI SOC Accountability Starts With Evidence
April 3, 2026
10
min read

Legion automates evidence gathering during AI-driven investigations, capturing screenshots from live security tools at every step, so every conclusion is backed by visual proof.

Gili Diamant