All Articles

How to Benchmark Agentic AI in the SOC: A Practical Guide

Learn how to benchmark agentic AI solutions in your SOC effectively. This guide provides a practical approach to evaluating performance in real-world environments.

Ron Marsiano
URL copied

SOC teams today are under pressure. Alert volumes are overwhelming, investigations are piling up, and teams are short on resources. Many SOCs are forced to suppress detection rules or delay investigations just to keep pace. As the burden grows, agentic AI solutions are gaining traction as a way to reduce manual work, scale expertise, and speed up decision-making.

At the same time, not all solutions deliver the same value. With new tools emerging constantly, security leaders need a way to assess which systems actually improve outcomes. Demos may look impressive, but what matters is how well the system works in real environments, with your team, your tools, and your workflows.

This guide outlines a practical approach to benchmarking agentic AI in the SOC. The goal is to evaluate performance where it counts: in your daily operations, not in a sales environment.

What Agentic AI Means in the SOC

Agentic AI refers to systems that reason and act with autonomy across the investigation process. Rather than following fixed rules or scripts, they are designed to take a goal, such as understanding an alert or verifying a threat, and figure out the steps to achieve it. That includes retrieving evidence, correlating data, assessing risk, and initiating a response.

These systems are built for flexibility. They interpret data, ask questions, and adjust based on what they find. In the SOC, that means helping analysts triage alerts, investigate incidents, and reduce manual effort. But because they adapt to their environment, evaluating them requires more than a checklist. You have to see how they work in context.

1. Start by Understanding Your Own SOC

Before diving into use cases, you need a clear understanding of your current environment. What tools do you rely on? What types of alerts are flooding your queue? Where are your analysts spending most of their time? And just as importantly, where are they truly needed?

Ask:

  • What types of alerts do you want to automate?
  • How long does it currently take to acknowledge and investigate those alerts?
  • Where are your analysts delivering critical value through judgment and expertise?
  • Where is their time being drained by manual or repetitive tasks?
  • Which tools and systems hold key context or history that investigations rely on?

This understanding helps scope the problem and identify where agentic AI can make the most impact. For example, a user-reported phishing email that follows a predictable structure is a strong candidate for automation.

On the other hand, a suspicious identity-based alert involving cross-cloud access, irregular privileges, and unfamiliar assets may be better suited for manual investigation. These cases require analysts to think creatively, assess multiple possibilities, and make decisions based on a broader organizational context.

Benchmarking is only meaningful when it reflects your reality. Generic tests or template use cases won’t surface the same challenges your team faces daily. Evaluations must mirror your data, your processes, and your decision logic.

Otherwise, you’ll face a painful gap between what the system shows in a demo and what it delivers in production. Your SOC is not a demo environment, and your organization isn’t interchangeable with anyone else’s. You need a system that can operate effectively in your real world, not just in theory but in practice.

2. Build the Benchmark Around Real Use Cases

Once you understand where you need automation and where you don’t, the next step is selecting the right use cases to evaluate. Focus on alert types that occur frequently and drain analyst time. Avoid artificial scenarios that make the system look good but don’t test it meaningfully.

Shape the evaluation around:

  • The alerts you want to offload
  • The tools already integrated into your environment
  • The logic your analysts use to escalate or resolve investigations

If the system can’t navigate your real workflows or access the data that matters, it won’t deliver value even if it performs well in a controlled setting.

3. Understand Where Your Context Lives

Accurate investigations depend on more than just alerts. Critical context often lives in ticketing systems, identity providers, asset inventories, previous incident records, or email gateways.

Your evaluation should examine:

  • Which systems store the data your analysts need during an investigation
  • Whether the agentic system integrates directly with those systems
  • How well it surfaces and applies relevant context at decision points

It’s not enough for a system to be technically integrated. It needs to pull the right context at the right time. Otherwise, workflows may complete, but analysts still need to jump in to validate or fill gaps manually.

4. Keep Analysts in the Loop

Agentic systems are not meant to replace analysts. Their value comes from working alongside humans: surfacing reasoning, offering speed, and allowing feedback that improves performance over time.

Your evaluation should test:

  • Whether the system explains what it’s doing and why
  • If analysts can give feedback or course-correct
  • How easily logic and outcomes can be reviewed or tuned

When it comes to accuracy, two areas matter most:

  • False negatives: when real threats are missed or misclassified
  • False positives: when harmless activity is escalated unnecessarily

False negatives are a direct risk to the organization. False positives create long-term fatigue.

Critically, you should also evaluate how the system evolves over time. Is it learning from analyst feedback? Is it getting better with repeated exposure to similar cases? A system that doesn’t improve will struggle to generalize and scale across different use cases. Without measurable learning and adaptation, you can’t count on consistent value beyond the initial deployment.

5. Measure Time Saved in the Right Context

Time savings is often used to justify automation, but it only matters when tied to actual analyst workload. Don’t just look at how fast a case is resolved: consider how often that case type occurs and how much effort it typically requires.

To evaluate this, measure:

  • How long it takes today to investigate each alert type
  • How frequently those alerts happen
  • Whether the system fully resolves them or only assists

Use a simple formula to estimate potential impact:

  • Time Saved = Alert Volume × MTTR
    (where MTTR = MTTA + MTTI)

This provides a grounded view of where automation will drive real efficiency. MTTA (mean time to acknowledge) and MTTI (mean time to investigate) help capture the full response timeline and show how much manual work can be offloaded.

Some alerts are rare but time-consuming. Others are frequent and simple. Prioritize high-volume, moderately complex workflows. These are often the best candidates for automation with meaningful long-term value. Avoid chasing flashy edge cases that won’t significantly impact operational burden.

6. Prioritize Reliability

It doesn’t matter how powerful a system is if it fails regularly or requires constant oversight. Reliability is the foundation of trust, and trust is what drives adoption.

Track:

  • How often do workflows complete without breaking
  • Whether results are consistent across similar inputs
  • How often manual recovery is needed

If analysts don’t trust the output, they won’t use it. And if they constantly have to step in, the system becomes another point of friction, not relief.

Final Thoughts

Agentic AI has the potential to reshape SOC operations. But realizing that potential depends on how well the system performs in your real-world conditions. The strongest solutions adapt to your environment, support your team, and deliver consistent value over time.

When evaluating, focus on:

  • Your actual alert types, workflows, and operational goals
  • The tools and systems that store the context your team depends on
  • Analyst involvement, feedback loops, and decision transparency
  • Real time savings tied to the volume and complexity of your alerts
  • Reliability and trust in day-to-day performance

The best system is the one that fits — not just in theory, but in the reality of your SOC.

URL copied

Anthropic was right (and responsible) to release Mythos first to cybersecurity researchers and a select group of organizations through Project Glasswing. It is a genuinely remarkable model. And the security community should take it seriously. What is available to defenders today will be in the hands of attackers in a few months. That window is closing fast.

Mythos raises the ceiling on what AI can do in cybersecurity tasks. It discovers zero-day vulnerabilities in codebases that previous models could not find. It reverse-engineers complex systems. It constructs sophisticated, multi-path exploits at scale. The capabilities that were previously accessible only to well-funded nation-state actors can now be replicated by a far broader set of threat actors. No longer do you need teams of expert reverse engineers and months of reconnaissance.

The threat landscape is structurally shifting. We will be determined by our ability to shift our defense in kind. Quickly.

Where AI in defense needs to go first

The industry is converging, rightly, on vulnerability research and remediation as the priority. Scanning your own codebase with the same class of models that attackers are using is a clear first step. In many cases, defenders actually have an asymmetric advantage here, as we have better access to our own code than attackers do.

The harder problem is remediation. We already carry significant backlogs of unresolved, sometimes exploitable, vulnerabilities. Unlike an attacker who has nothing to lose, defenders cannot afford mistakes. Our systems are in production. Downtime has real costs. The asymmetry of attacker agility versus defender accountability is where the gap widens.

AI-assisted vulnerability remediation at scale is necessary. But it is not a solved problem, and any honest assessment of the landscape has to acknowledge that.

What this means for security operations

The idea of static detections designed to discover dynamic adversaries is fundamentally misguided. The future is better trip wires and an assume-breach mentality.

For SOC teams, the implications are direct. The scale and complexity of attacks is accelerating. We should expect a higher volume of sophisticated attacks that actively evade detection, that do not conform to known signatures or behavioral patterns, and that are designed from the ground up to stay invisible.

This breaks the model that most SOC programs are built on. The idea of maintaining a library of static detections to catch dynamic adversaries has always had limits. Those limits are now being exposed in real time.

What we need instead is the ability to detect a high volume of low-fidelity signals, such as anomalies in endpoint behavior, data access patterns, email activity, network flows, and identity. This requires teams to investigate each one as if it were the leading edge of a sophisticated breach. Not because every alert is a nation-state intrusion, but rather, we should expect that a higher percentage now may be.

The question is no longer whether to adopt AI in security operations. This is clearly needed. We cannot scale defenses solely on human labor. 

The question is how to do it in a way that actually works inside the operational reality that security teams live in.

The real challenge is operational reality

Enterprises have legacy and custom tools, established processes, compliance and audit requirements, escalation paths, and oversight obligations that are not optional. AI cannot simply replace this infrastructure. It has to work within it.

You cannot properly scale your defenses without giving AI access to your organizational context, including your tools, your processes, your detection logic, and your escalation criteria. AI agents need to be able to investigate with the consistency and rigor of an incredible IR analyst, operate transparently, and support human oversight at the points where it matters.

This is precisely what we built Legion to do: meet organizations where they are. Our platform learns your existing tools, processes and context and makes them accessible to the latest frontier models (now Mythos, and every model in the future). From that we create structured, repeatable workflows where consistency is required or fully agentic investigations that require depth and judgment. Every action is auditable. Human-in-the-loop controls are configurable. And the system integrates across your entire stack.

My conclusion - Assume breach, investigate everything, build for the attacker that has already found the vulnerability you have not patched yet, and is using Mythos-level models to stay ahead of your detections.

AI
The Future is Not Better Detections
April 14, 2026
10
min read

In the wake of Mythos and Project Glasswing, security operations teams need AI that meets them where they are.

Ely Abramovitch

Picture a senior analyst mid-investigation. Eight browser tabs open across CrowdStrike, VirusTotal, Defender, and Microsoft Entra. She's running a hunting query in one window, checking an IP reputation score in another. And somewhere in between, she's documenting. Taking screenshots, copying log entries into a case note, capturing context before it slips away.

This is the job. Investigations today aren't just about finding the threat. They're about moving across tools, pulling together evidence from a dozen different sources, and building a record that another analyst, or an auditor, or a manager, can actually follow. The documentation isn't a distraction from the work. It is part of the work.

Everyone in security has lived that. 

Which raises a question that's been easy to ignore until now: if we wouldn't accept an analyst who said "trust me, I looked at it"- why are we accepting that from AI agents?

Evidence Has Always Been the Standard

The reason SOC analysts document isn't distrust. It's precision. A good investigation has always meant showing your work. The summary an analyst writes is their claim, the insight they've drawn from what they saw. The screenshot is the fact. Undisputable evidence, captured at the moment of discovery. Together they tell the full story: here is what I found, and here is the proof.

Evidence gathering has always been a core part of the job. Screenshots and logs aren't bureaucratic overhead. They're how you distinguish signal from noise, how you close out audit findings, how you hand off a case without losing context.

You Wouldn't Accept "Trust Me" From an Analyst. Stop Accepting It From AI

We hold human analysts to a clear standard. When an analyst closes a case, we expect to see their work. The exact screen they reviewed, the exact query they ran, the exact result that informed their decision. A summary of what they found is a claim. The screenshot is the proof.

We should hold AI agents to the same standard.

Today, most AI SOC give you a verdict and a reason. The agent processed the alert, evaluated the indicators, and concluded it was malicious. But if you ask what it actually saw, you're directed to API logs and structured JSON responses. That's not evidence. That's a reconstruction built after the fact, from data that was never meant to be read by a human auditor in the first place.

The gap between what an AI agent did and what you can actually verify is where hallucination risk lives. A summary can sound confident and still be wrong. Without visual evidence captured at the moment of the decision, you have no way to know what the system actually encountered.

Legion operates differently. Instead of calling APIs, Legion navigates your source systems directly through the browser, the same way a human analyst would. It opens the actual system, reads the actual screen, and captures a screenshot of exactly what it sees at every step. The summary is the claim. The screenshot is the fact.

That's the standard we believe AI investigations should meet. And it's the only architecture that meets it. 

How Legion Automates Evidence Gathering

Legion Evidence Gathering captures visual proof of every action Legion takes as it navigates your source systems, automatically, in real time.

Take a malware investigation spanning CrowdStrike, VirusTotal, and Defender. Legion opens the originating ticket, reads the case, and begins investigating. As it moves through each tool, it takes a screenshot at every step. The CrowdStrike detection page as it appeared. The VirusTotal result in context. The Defender hunting query and its output. Every interface, exactly as Legion saw it.

By the time an analyst opens the case, the full evidence gallery is already there. Screenshots organized sequentially, labeled by tool, timestamped, and ready to review. Not just a summary. Not just a log. The complete picture: the analysis and the visual evidence behind every conclusion.

And it stays there. Every investigation Legion runs is stored and searchable. When an auditor asks a question, when a peer analyst picks up a handoff, when someone needs to understand why a decision was made, you go back to the session and everything is right there. Every step. Every screen. Nothing reconstructed. Nothing missing.

Different alert types. Different toolchains. The same complete evidence gallery, every time.

AI SOC Tools SOC Analyst Legion
How evidence is captured API responses logged Manual screenshots and log exports taken during investigation Screenshots captured automatically / API response logged
Session replay Not available Not available Full replay of every investigation, step by step, exactly as Legion navigated it
Time to evidence Automatic evidence creation 20–30 minutes Automatic evidence creation
Time to understand Significant analyst effort Easy to consume Easy to consume
Consistency Consistent data capture Varies by analyst Consistent data capture

This Is What Accountable AI Looks Like

We've always known what a good investigation looks like. You show your work. You back your conclusions with evidence. You leave a record that someone else can follow. Legion applies that same standard to every automated investigation it runs, without exception and without manual effort. The bar doesn't move because the analyst is an AI. It stays exactly where it's always been.

See Legion Evidence Gathering in action. Request a Demo

AI SOC Accountability Starts With Evidence
April 3, 2026
10
min read

Legion automates evidence gathering during AI-driven investigations, capturing screenshots from live security tools at every step, so every conclusion is backed by visual proof.

Gili Diamant

SOC investigations range widely. Some are highly repeatable: every step defined, every decision documented. These work well and can be fully automated. But some investigations eventually reach a point where that breaks down: where the next step depends on what you just found, and the judgment and intuition to know what it means.

You can see it clearly the moment you try to write it down. Some processes flow neatly from start to finish. But as soon as you move into more complex investigations, the cracks appear. You find yourself pulled into a spiral of edge cases, tool variations, and fallback paths. You add branches. Then branches on branches. And after all that effort, you almost always end up in the same place: where no rule applies, and only judgment, reasoning and intuition can take you further.

The Part You Can Never Quite Capture

SOC investigations don't all look the same. Some are fully deterministic: a user notification when an outgoing email gets blocked, no reasoning required. For these, consistency matters. The same steps, the same outcome, every time. Others are the opposite: novel threats with no fixed path, no known pattern, where only experience, intuition, and judgment can tell you what to do next. And many fall somewhere in between, where you start with structure and hit a point where judgment has to take over.

But even those flows have a ceiling. Take a phishing investigation. You can document the triage steps pretty cleanly: check the sender, analyze the headers, detonate the attachment, check the URLs. That part is routine and capturable. But the moment you find something suspicious, the investigation shifts. Now you need to reason about scope: is this part of a campaign, and who else was hit? That question has no fixed answer. You might search for other emails with the same subject, but any decent campaign will vary the lures across targets, changing subjects, sender names, and payload links to evade detection. You cannot match on a single field and call it done. You need to iterate: follow one thread, see what it reveals, adjust your search, go again. You are reading the environment in real time, making judgment calls at every step based on what the last one uncovered.

Those judgment points show up on every shift, on every alert that goes beyond the routine. Someone has to reason through them in the moment, with whatever context they have, under whatever pressure exists right now. Until 3am. Until a less experienced analyst picks it up. Until alert volume means there simply isn't time to think it through properly.

That reasoning is not pre-programmed. It emerges from the finding itself. It is what a senior analyst does instinctively, and until now there has been no way to replicate it at scale. Legion Investigator is built for that moment.

Your Environment. Your Logic. Your Investigator.

Legion Investigator is a goal-oriented AI agent that sits inside your investigation workflow at exactly the moments where reasoning takes over from execution, extending Legion's coverage across the full spectrum of SOC investigations, from fully deterministic workflows to complex open-ended investigations. You define its goal, you choose which tools and actions it is permitted to use, and you decide where it acts autonomously and where it checks in first. 

Which category a given investigation falls into is sometimes obvious. But often it is a deliberate choice, one that should be yours to make based on your team's needs, your risk tolerance, and how much consistency versus flexibility the situation calls for. Where on that spectrum each investigation runs is yours to decide. Every boundary is one you set in advance and can trust will be respected. This is what makes Investigator the kind of AI enterprises can actually adopt: not just powerful, but designed from the ground up to operate within your constraints, your processes, and your level of trust.

Most AI SOC tools bring their own model of how investigations should work. Legion Investigator learns from how yours actually do. It builds its understanding from your team's recorded investigation sessions, the decisions they make, the paths they take, and the patterns that emerge across real cases in your environment. Over time, Legion builds a structured knowledge base specific to your organization, capturing your processes, your tooling, and your team's accumulated expertise. That knowledge is not just stored. It is actively used to improve your captured workflows and feeds directly into how Investigator reasons, prioritizes, and investigates.

And when we say your tools, we mean all of them. Legion Investigator works the way your analysts work, through the browser, with no integrations and no APIs required. Your SIEM, your EDR, your threat intelligence platforms, your homegrown applications, your legacy dashboards, your on-prem and cloud environments. You don’t rebuild your stack to fit the tool. The tool fits your stack.

The way it works reflects how investigations actually flow. An investigation might start in your SIEM with a set of routine queries, structured, reliable, repeatable. But when it reaches one of those decision points, you hand off to an Investigator with a goal: find the scope of breach, enrich the full context of what we have so far, identify what else was impacted across endpoints and cloud assets.

The Investigator takes that goal and works toward achieving it. It invokes the relevant tools, interprets what comes back, recalculates what to do next, and invokes again. It keeps going, step by step, until the goal is met. Not a single tool call with a result handed back to you. A full reasoning loop that runs until the work is done, across your security tools, your homegrown applications, and any AI agents already running in your environment. Investigator acts as the orchestrator, pulling in whatever is needed to get there.

Multiple Investigators can work together across a single investigation. One handles enrichment. Another determines scope of breach. A third drives containment based on what was actually found, not what was anticipated when the playbook was written.

And because trust matters, Investigator operates within guardrails. It works only with the tools and actions it’s been given permission to use. For anything higher risk, it asks before acting. You stay in control by setting the boundaries in advance and knowing they’ll be respected.

What This Changes

Legion Investigator opens up three things that weren't possible before.

Pick up where deterministic processes end 

For investigations where you have structured steps, you can now embed an Investigator at exactly the points where structure runs out. The routine parts stay routine.The investigator reasons further, and by the time you step in, the groundwork is already done.

Handle your long tail of alerts

For the long tail of investigations where you never had a well-defined flow to begin with, you can now hand them off end to end. The Investigator handles enrichment before you even open the case, drives containment the moment scope is confirmed, and picks up every judgment point in between.  Give the Investigator the goal, set the guardrails, and let it run. No playbook required.

Every investigation, regardless of how well-defined it is, can now be handled with the depth of your best analyst, on every alert, on every shift. And for the first time, you control where on that spectrum each investigation runs. More structure where consistency matters. More autonomy where judgment, experience, and intuition are required. The balance is yours to set, and yours to change.

This is not about replacing analysts. It never was. There will always be moments that require human judgment, experience, and instinct, and no AI should pretend otherwise. What changes is everything around those moments. The analyst becomes the commander: setting goals, defining boundaries, sending investigators out into the environment to gather, reason, and report back. The calls that matter stay with you. The work that surrounds them no longer has to. Not because we built a smarter AI. Because we built one that learned from you.

AI
Part of the Investigation That No Playbook Can Capture
March 12, 2026
12
min read

Introducing Legion AI Investigator: AI that reasons where playbooks can't. Define the goal, set the guardrails, and let it investigate across your tools — no integrations required.

Ely Abramovitch