All Articles

Legion and Optiv Partner to Deliver Agentic Security That Understands How Enterprises Work

Legion Security is now an Optiv Authorized Partner. Enterprise security teams can now deploy agentic AI for security operations that understands and optimizes agentic workflows without integrations, black boxes, or needing to ask teams to change how they work.

Marcia Dempster
URL copied

Demand for agentic security that actually works in complex enterprise environments has never been higher, and today we're excited to take a meaningful step forward in meeting it

We're excited to announce that Legion Security has partnered with Optiv to become an Authorized Partner to help enterprises stop talking about the same-old-problem, and start putting AI to work. Security teams are under pressure that doesn't need a lot of explaining. Analysts, engineers, and practitioners are being asked to do more with less; more alerts, more tools, more threat surface, and fewer people to manage it all. AI was supposed to be the great equalizer, and the promise of the AI SOC was compelling: automate the noise, free up your people, let machines handle the volume.

The reality has been… more complicated.

Most AI security tools were built generically for a generic security team in a generic enterprise. One problem with this is… what is an average security team? Every large organization has processes that are entirely their own: workflows built around a specific stack, custom tools that were built and tuned over long stretches, tribal knowledge accumulated over years, investigation procedures tuned to their environment, their risk tolerance, their regulators, their customers.

Heavy API integrations try to stitch it together but end up slow, brittle, and context-poor (at best). And agents that operate inside a black box create exactly the kind of trust deficit that makes security leaders hesitate to hand anything off at all.

This is the gap Legion was built to close.

A Different Approach to Agentic Security in the Enterprise

The premise of Legion is straightforward: nobody knows your security operations like you do. Our platform doesn't arrive with assumptions about how your team should work. Instead, it observes and learns from how your team actually works; across your tools, your workflows, your most repetitive processes and your most bespoke ones, and then uses that knowledge to build optimized AI agents that operate within the context of your organization.

We don’t require integrations for full contextual awareness. We’re an open book (no black box) that leans on our browser-based approach to see what your analysts see and do, learns what they know, and earns YOUR trust before taking action.

The result is agentic security that can actually scale in the enterprise — not by replacing how teams work, but by amplifying it.

The Imperative for Partnering with Optiv

Becoming an Optiv Authorized Partner matters because of what Optiv represents to the enterprise security buyer. Optiv works with organizations that have mature, complex security programs; exactly the kind of environment where Legion's approach of learning from bespoke processes is most valuable.

Enterprise security leaders look to trusted advisors to help them evaluate fit, plan implementation, and optimize outcomes over time. Optiv's position in the market as an integrator with deep relationships and deep domain expertise makes them uniquely positioned to bring best-in-breed solutions to the organizations that need it most and to help them get maximum value from it.

This partnership reflects something we're hearing consistently in the market: enterprises want agentic security, but they want it on their terms. They want AI that understands their environment before it acts in it. They want partners who can help them think through where automation should start, how to build confidence in the system over time, and how to expand from their first use cases into a broader program.

That's exactly what this partnership is designed to deliver.

What It Signals More Broadly

The Optiv partnership is a data point in a larger trend. Channel partners; the integrators, MSSPs, and advisors who sit closest to enterprise security buyers, are increasingly being asked about agentic security. Their clients want to know what's real, what's ready, and what actually works in complex environments.

For Legion, this is an important milestone in building the ecosystem that enterprise agentic security requires. We're grateful to the Optiv team for their partnership and excited about what we'll build together. And for enterprise security leaders who have been watching the agentic security space and wondering what a path to trusted AI adoption actually looks like, we'd love to show you.

Interested in learning how Legion Security and Optiv can help your organization automate, scale, and elevate your security posture? Get in touch.

URL copied

Legion Security is Now Available on Google Cloud Marketplace

Security operations were built around human investigators. Skilled analysts, working manually across dozens of tools, piecing together evidence, making judgment calls, closing cases. But as alert volumes outpaced human capacity, institutional knowledge became a bottleneck, and the complexity of the modern enterprise made scaling impossible. The industry responded with more headcount, more tools, more automation. None of it solved the fundamental problem.

Legion introduces a different operating model entirely. 

What Legion Does

Legion observes how your analysts operate when running real investigations, learning your organizational context, tools, past cases, playbooks, runbooks and all other tribal knowledge in order to understand what an optimal investigation looks like for your environment. This is then turned into an easily editable and audible workflow which can be automated when you’re ready. Powered by Google Cloud's Gemini models, each workflow is executed by AI agents that reason through the evidence and provide a verdict and even remediate. This is all accomplished with no manual playbook writing or need to document predefined rules.

But legion goes well beyond workflow creation. As Legion builds trust in its performance, teams can choose to keep a human in the loop to approve every decision or have Legion operate fully autonomously reducing MTTR eliminating MTTA, allowing analysts to focus on more novel investigations that are becoming more and more common in the world of AI. 

Memory: The Compounding Advantage

Every investigation Legion conducts makes it smarter. A persistent memory layer continuously captures context from previous cases, your SOC knowledge base, and direct analyst feedback, feeding all of it back into future investigations and decisions. Institutional knowledge that once lived in the heads of your most experienced analysts becomes a permanent, improving organizational asset. The more Legion works, the better it gets. That's not a feature. That's a compounding strategic advantage.

Zero Integrations. Immediate Value.

Most security automation platforms fail at the same hurdle: integrations. Enterprises face months of API work, custom connectors, and professional services before anything runs in production, or are forced to adopt entirely new tools and processes, something most complex enterprises simply can't do.

Legion operates natively in the browser, which means it works across your entire security stack, from threat intel platforms to legacy internal tools, without any API configuration. If your analysts can open it in a browser, Legion can learn from it, generate workflows from it, and execute investigations through it.

Proven Results at Scale

The impact Legion delivers isn't theoretical:

As the head of Security at Virgin Money put it, Legion is “like evolving from handcrafted systems to precision manufacturing aligned to our flow (except) faster, repeatable and secure”.

Legion works with the worlds largest enterprises and delivers strong results: 

  • A large insurance organization automated 24,000 investigations and cut mean time to respond from 20 minutes to 2 minutes.
  • WELL Health Technologies reduced investigation times by 81%, allowing existing analysts to handle significantly higher alert volumes without additional headcount.
  • The University of Tulsa cut investigation times in half, enabling their team to overcome capacity limits with the staff they already had.

Across deployments, Legion reduces mean time to investigate by up to 85% and response times by up to 90%.

Built on Google Cloud

Legion's integration with Google Cloud goes deeper than the Marketplace listing. The platform runs on Google Cloud infrastructure and leverages Gemini models to power its AI reasoning, combining Legion's browser-native architecture with Google Cloud's security, scale, and model quality.

For organizations already invested in Google Cloud and Google SecOps, Legion extends that ecosystem directly into the analyst workflow.

Who It's For

Legion is purpose-built for enterprise security operations teams, CISOs, VPs of Information Security, SOC Directors, and Security Operations Managers at organizations running in-house SOCs. If your team is dealing with any of the following, Legion was built for you:

  • Alert volumes that have outpaced your team's capacity
  • Analyst burnout from manual, repetitive investigation work
  • Institutional knowledge that walks out the door when senior analysts do
  • Automation gaps caused by complex integration requirements

Available Now on Google Cloud Marketplace

Legion Security is available today on Google Cloud Marketplace, allowing customers to apply their spend toward their annual Google contract and simplify procurement. For security teams ready to move beyond the limits of traditional operations, this is where that transformation begins.

Engineering
Legion Security Is Now Available on Google Cloud Marketplace
May 31, 2026
12
min read

Legion is officially on the Google Cloud Marketplace.

Gili Diamant

Introduction

When automating modern security investigations with LLM-based agents, to conduct multi-step security investigations, a typical workflow begins with an alert - say, a reported phishing email - and the agent iteratively queries tools such as Microsoft Defender Threat Explorer, Splunk, or CrowdStrike to gather evidence, assess scope, and recommend containment actions.

At each step, the agent receives query results containing raw IOCs: sender addresses, embedded URLs, source IPs, recipient domains, and device hostnames. It must reason about these indicators, decide whether to refine its search or conclude the investigation, and return its findings as structured output.

This works well for short investigations. But as the number of steps grows, an issue emerges: the agent's context window fills with repeated, verbose indicator values, the model begins echoing raw IOCs inconsistently, and the structured outputs it produces become increasingly fragile.

[Figure 1: High-level architecture of an AI-driven investigation agent.]

Consider a phishing investigation that proceeds through four steps:

  1. Initial query: Search for emails from a reported sender to a specific recipient. The results contain the sender's email address and a handful of URLs.
  2. Scope expansion: Search for all emails from the same sender across the organization. The results return 22 emails with SharePoint URLs, tracking links, and font-file references.
  3. URL analysis: Search by specific URLs found in step 2. Additional domains and redirects surface.
  4. Conclusion: The agent summarizes its findings and lists all relevant IOCs.

By step 4, the agent's prompt contains the full history of steps 1 through 3 - including every raw URL, email address, and domain mentioned in each step's results and the agent's own reasoning. Some of these URLs are long tracking links with base64-encoded parameters, easily exceeding 200 characters each.

This accumulation created three concrete problems.

  1. Token bloat. Raw IOC values, particularly URLs with tracking parameters and encoded payloads, consumed a disproportionate share of the context window. A single newsletter email might contain 30+ URLs, each repeated in the query results, the agent's reasoning, and the indicators list, tripling the token cost per IOC, per step.
  2. Over-reporting. When asked to list relevant indicators, the model would frequently dump every IOC it had ever seen into the response - even when the current step involved only one or two. In one case, an agent listed all 145 email addresses from its registry when the current query concerned a single sender.
  3. Structural fragility. Query results from security tools sometimes contained comma-separated URL lists embedded in strings. When the model attempted to reproduce these in its JSON output, it produced malformed structures - unescaped commas, broken string boundaries, and invalid nesting. In our baseline evaluation, only approximately 80% of model responses parsed as valid JSON.

Our Approach

We addressed these problems with a three-part system:

  1. A unified IOC manager that extracts and indexes indicators.
  2. An IOC prompt adjustment that instructs the model on how to use indexed references.
  3. A preprocessing step that cleans malformed tool output before it reaches the model.

IOC Extraction and Indexing

The core of the system is an IOC manager that maintains a registry of all indicators encountered during an investigation. When new text enters the pipeline - whether from tool query results or from the agent's own prior reasoning - the manager scans it using a set of type-specific patterns covering URLs, email addresses, IPv4 addresses, file hashes, hostnames, and domains.

Each newly discovered IOC is assigned a compact symbolic reference following a consistent naming convention: the first email becomes EMAIL01, the first URL becomes URL01, the second domain becomes DOMAIN02, and so on. The original value is stored in the registry, and all occurrences in the text are replaced with the corresponding reference.

Extraction order matters. URLs are processed first, because a URL contains both a domain and potentially an IP address. By extracting URLs before domains and IPs, we prevent the system from fragmenting a single indicator into multiple overlapping entries. 

The manager also performs selective extraction. In a typical prompt, the first section contains static task instructions - tool descriptions, output format specifications, and investigation guidelines. IOC extraction is applied only to the dynamic sections (step history and query results), leaving instruction text unchanged. This prevents false positives from example IOCs embedded in the prompt template.

Deduplication is handled through a value-to-reference mapping. If the same IOC appears in step 1 and again in step 3, it receives the same reference both times, ensuring consistent tracking across the entire investigation.

[Figure 3: The IOC extraction pipeline.]

IOC Prompt Adjustment

Extraction alone is not sufficient. Even when the input prompt uses symbolic references, the model may revert to generating raw IOC values in its output, particularly if the system prompt or prior conversation history contains raw values, or if the model has seen the actual value during context processing.

To address this, we developed an IOC prompt adjustment, a compact structured appendix appended to the user prompt that explicitly instructs the model on how to handle IOCs. The adjustment establishes three rules:

  1. In reasoning: Always use symbolic references. Never write raw email addresses, URLs, IP addresses, or domains. Instead of writing a raw sender address followed by a description of the campaign, use the corresponding reference identifier throughout.
  2. In the indicators field: Distinguish between known and new IOCs. For indicators already present in the registry, use the symbolic reference. For indicators being reported for the first time, newly discovered in the current step's results, use the actual value, so it can be added to the registry for subsequent steps.
  3. Relevance filtering: Only include IOCs that are directly relevant to the current investigation step. Do not copy all registry entries into every response.

The IOC prompt adjustment includes a populated copy of the current IOC registry, mapping each reference to its actual value, so the model can look up identifiers when constructing its reasoning. It also provides correct and incorrect examples, a validation checklist, and explicit rejection criteria.

We tested two versions of the IOC prompt adjustment: a comprehensive version with extensive examples and redundant emphasis, and an optimized version that distills the same rules more concisely. Both achieved equivalent compliance rates, suggesting that clarity of instruction matters more than volume of repetition.

Input Preprocessing

The second component addresses a problem upstream of the model: malformed tool output. Security tool APIs sometimes return URL lists as comma-separated values within a single string field, rather than as properly structured arrays. When passed through to the model as-is, these malformed strings caused structured output generation failures.

Our preprocessing step detects comma-separated URL patterns in query results and reformats them into clean, numbered lists before the text reaches the model. This small transformation, applied before IOC extraction, resolved the structured output validity issue independently of the other components.

Evaluation

Setup

We evaluated the system using 10 real-world investigation traces captured from production. Each trace represents a complete phishing investigation conducted through several security tools, containing the system prompt, user prompt with step history, and the raw query results that the model must reason about.

For each trace, we ran 10 iterations with the same prompt configuration, measuring two metrics:

  • JSON validity: Whether the model's response parsed as valid a structured output.
  • IOC reference compliance: Whether the response used symbolic references exclusively in its reasoning field (no raw IOC values) and correctly distinguished between known references and new actual values in its indicators field.

We tested four configurations to isolate the contribution of each component.

Results

Configuration JSON Validity IOC Compliance
Baseline (no IOC system) ~80% 0%
IOC Prompt Adjustment only 100% 100%
IOC Manager + Prompt Adjustment 100% 100%
URL Cleaning only (no prompt adjustment) ~100% 0%
Full system (Manager + Prompt Adjustment + URL Cleaning) 100% 100%

JSON validity. The baseline configuration produced valid JSON in approximately 80% of responses. Adding URL preprocessing alone brought this to nearly 100%, confirming that malformed tool output - not model capability - was the root cause of parsing failures. All configurations that included the enforcer achieved 100% validity.

IOC compliance. Without the IOC prompt adjustment, the model never spontaneously adopted symbolic references, compliance was 0% regardless of whether the input text had been processed by the IOC manager. With the prompt adjustment, compliance jumped to 100% across all traces and iterations. This held for both the comprehensive and optimized prompt adjustment variants.

Component independence. The results reveal a clean separation of concerns: URL preprocessing fixes JSON validity, the IOC manager fixes IOC compliance, and provides the underlying registry and extraction infrastructure that makes both possible.

Qualitative Observations

Beyond the quantitative metrics, we observed several qualitative improvements:

  • Reduced prompt size. Replacing verbose URLs (some exceeding 200 characters) with compact references, meaningfully reduced token consumption in the step history, particularly for investigations involving newsletter or marketing emails with numerous tracking links.
  • Consistent cross-step tracking. The registry ensured that the same IOC received the same reference throughout the investigation, this is particularly helpful with IOCs referenced throughout multiple steps in the investigation.
  • Focused indicator reporting. With the IOC manager's relevance-filtering instruction, the model stopped dumping entire registries into its responses. Indicator lists became proportional to the current step's scope rather than the investigation's total history.

Discussion

Why Extraction Without the IOC Prompt Adjustment Fails

A natural question is why input-side extraction alone does not work. If the prompt already contains a symbolic reference instead of a raw email address, why does the model still generate raw values in its output?

The answer lies in how LLMs process context. The model has access to the full prompt, including sections where the actual IOC value may still appear, task-specific inputs, quoted alert descriptions, or the registry itself. More fundamentally, the model's training distribution contains overwhelmingly more examples of raw IOC values than of symbolic reference systems. Without explicit instruction, the model defaults to the more familiar pattern.

This finding has a broader implication for LLM-based agent design: transforming the input is necessary but not sufficient when you need the model to adopt a non-default output convention. Explicit behavioral instruction, the IOC prompt adjustment, bridges the gap.

Limitations

Our evaluation has several limitations worth noting. All traces were drawn from a single investigation type (phishing via specific security tools). While the IOC types encountered are representative of broader security operations, there are additional evaluations to be done.

The evaluation was conducted with a single model (chatGPT4.1). Different models may exhibit different compliance characteristics, and the prompt mechanism may need tuning for models with different instruction-following tendencies.

Finally, our compliance metric is binary - a response either uses references correctly or it does not. A more granular metric could capture partial compliance and might reveal subtler performance trends across model versions or investigation complexities.

Conclusion

We presented an IOC indexing system for AI-driven security investigations that addresses three interrelated problems: token bloat from repeated raw indicator values, inconsistent IOC tracking across investigation steps, and structural fragility in model-generated structured outputs.

The system combines automated IOC extraction with symbolic reference assignment, explicit behavioral guidance through prompt engineering, and input preprocessing to handle malformed tool output. Across 100 evaluation runs on 10 production investigation traces, the full system achieved 100% JSON validity and 100% IOC reference compliance, up from approximately 80% and 0% respectively at baseline.

The key insight is that managing IOCs in the context of LLM-based agents requires intervention at both the input and output stages. Extraction and indexing normalize the input, but only explicit prompt-level guidance ensures the model adopts the reference convention in its generated output. Neither component alone is sufficient; together, they eliminate the problem entirely.

As SOC automation platforms handle increasingly complex, multi-step investigations, structured approaches to managing the information that flows through the agent's context window become essential. IOC indexing is one instance of a more general pattern: giving the agent a well-organized working memory that scales with investigation complexity rather than against it.

Cybersecurity
IOC Chaos
May 26, 2026
12
min read

When LLMs investigate security alerts autonomously, they encounter dozens, sometimes hundreds of IOCs: emails, URLs, IPs, domains, and hostnames. Left unmanaged, these inflate token costs, produce inconsistent references, and break structured output. Our IOC indexing system replaces raw indicators with compact symbolic references the model reuses throughout its reasoning. Across 100 evaluation runs, it took JSON validity from ~80% to 100% and IOC reference compliance to 100%."

Nadav Kahanowich

Abstract & Data Summary

We gathered and manually annotated a dataset of 196 hard triage decisions from real-world security investigations, covering a wide range of outcomes, including benign, malicious, and false positives. After cleaning the dataset by removing mock runs and cases with missing information or incorrect workflow execution, the remaining 163 examples were grouped into use case categories to form a high-quality cohort. We then evaluated LLMs on the dataset overall and per use-case category and found that Gemini 3 Pro performs best overall, though the best LLM varies by use case category. 

Model performance by use case category:

Use case category Best model(s)
Phishing Gemini 3 Pro
Account Takeover Sonnet 4 / GPT-4.1
Network Opus 4.5 / GPT-4.1
Overall Gemini 3 Pro

If you’d like to understand our full research methodology, read on.

*Note: since this blog was authored, several new model families have been released. While the results have remained broadly stable, particularly among the best and worst performers, updated research may be required for a nuanced understanding of the performance differences amongst the rest.

Data Collection

The dataset was constructed from security investigations from eight US-based customers.The evaluation is conducted in a secure, federated way, without mixing customer data, only reporting summary statistics from each customer tenant.

To create a challenging evaluation, we over-weighted cases in which the analyst dis-agreed with the model - so the error rate is inflated here.

The investigations were conducted automatically according to predefined, customer-specific workflows, each of which contained at least one triage decision node. A triage decision node is a decision point within a workflow, where an LLM chooses a decision from among a list of provided decision options, given the information that was gathered in the workflow up until that point.  

At each decision node, the LLM used in production selected a classification decision from a list of workflow-specific decision options and provided the reasoning for its decision, based on a summary of the steps completed until that point in the investigation.

For each investigation containing at least one decision node, we collected the following information from production session logs:

  • A summary of the workflow steps up until the decision node, including tool name, step description, and step outputs
  • Organization-specific knowledge, written by the customer and containing a title, description, and data
  • The set of available decision options at the decision node
  • The model's selected decision in production, as well as the reasoning and detailed reasoning for the decision
  • The decision option selected by the customer
  • Feedback text written by the customer for the decision

Here is an example workflow diagram:

Quality Control

An expert cybersecurity analyst annotated the 196 decision examples with reasoning tags to explain the production and customer decisions, and label whether disagreements are explained by an analyst-error, mistaken reasoning by the AI or missing data / steps in the workflow. 

Term Definition
Good Reasoning The LLM reasoned correctly about the decision options given the input data
Bad Reasoning The LLM reasoned incorrectly about the decision options given the input data
Workflow Ran Correctly The workflow had complete inputs and outputs, and did not get interrupted
Workflow Ran Incorrectly The workflow had empty or partial data, or was interrupted
Customer was Aligned Customer agreed with the expert analyst decision
Customer was Mistaken Customer disagreed with the expert analyst decision
Missing Information The LLM made the correct decision given the available information, but information relevant to making the decision was missing from its input

Examples tagged with "Workflow ran correctly but missing information" or "Workflow ran incorrectly" were removed from the dataset. Two additional examples with the use case titled "Workshop" were removed, as these were mock runs. For the remaining examples, the workflow ran correctly and was not missing information.

Triage Decision Distribution

By Label

Across the filtered dataset, the workflows contained 27 distinct normalized decision labels, which we grouped into the following buckets: False Positive, True Positive, Requires Review, and Other. The distribution of the labels is shown below: 

Distribution of tags in triage dataset

False Positive 91, Requires Review 32, True Positive 27, Other 13.
False Positive Requires Review True Positive Other

The final evaluation dataset contains data from eight customers. The table below shows the number of annotated decision examples per customer and the tools used in each environment.

Customer Tools used in each environment # of decisions made
SOC Environment #1 Defender, CrowdStrike, Splunk, VirusTotal, AbuseIPDB, URLScan 48
SOC Environment #2 IP Quality Score, Zscaler, Confluence, ServiceNow, Splunk, Proofpoint TAP, Microsoft Entra, Cortex XSOAR, VMRay, Wiz, VirusTotal, URLScan 32
SOC Environment #3 Defender, Google SecOps, Silent Push, Microsoft Entra, IPLocation, Axonius, IPinfo, Nodedata 29
SOC Environment #4 Confluence, ServiceNow, Excel Online, Cortex XSOAR, VirusTotal, AbuseIPDB, URLScan 27
SOC Environment #5 Defender, Zscaler, Cortex XSOAR, Microsoft Entra, Abnormal Security, IPLocation, VirusTotal, AbuseIPDB, Shodan, IPinfo 16
SOC Environment #6 Defender, CiscoTalos, MxToolBox, AbuseIPDB, TeamDynamix, URLScan 5
SOC Environment #7 Splunk, Microsoft Entra, Mimecast, VirusTotal, AbuseIPDB, Spur 3
SOC Environment #8 Joe Sandbox, Proofpoint TAP, VirusTotal 3
Total 163


Use Case Distribution

We consolidated the use cases into 3 categories to consolidate our findings. Below is the map from the consolidated categories to the original use cases, as well as the distribution of the dataset over the consolidated categories. 

Use cases Counts
Phishing 96
Account Takeover 38
Network/Infrastructure 26

Confusion Matrix

Below is a confusion matrix between the expert analyst annotations and the recommendations our system makes. We prompt the models to be careful and escalate when they are not sure. 

Confusion matrix

Selection (Actual) vs Recommendation (Predicted)

Predicted
FP TP Requires
Review
Other
Actual False Positive 6065.9% 1617.6% 1516.5% 00.0%
True Positive 26.9% 2793.1% 00.0% 00.0%
Requires Review 13.1% 412.5% 2784.4% 00.0%
Other 17.7% 00.0% 00.0% 1292.3%

Results

Over all use cases (including those without a use case name), Gemini 3 Pro had the highest performance at 74.8%, with GPT-4.1 and Opus 4.5 tied for second.

Triage performance by model

Phishing Results:

On the phishing use cases, Gemini 3 Pro performed the best, followed by Opus 4.5. 

Triage performance by model

Account Takeover Results:

Sonnet 4 and GPT-4.1 were tied for best on the account takeover use cases. 

Triage performance by model

Network Results:

Opus 4.5 and GPT-4.1 were tied for best on the network use cases.

Triage performance by model

Conclusion & Recommendation

We gathered and annotated 163 triage decisions from real-world security investigations. We characterized the use case distribution, and grouped the use cases according to common categories. We then benchmarked large language models across each use case category and the full dataset. We found that Gemini 3 Pro performs best overall. Per use case category, Gemini 3 Pro gives the best performance on phishing, Sonnet 4 and GPT-4.1 are tied for best on account takeover, and Opus 4.5 and GPT-4.1 are tied for best on network. Based on our results, we recommend that security teams test models for different scenarios to find the solution that works best for their use case, different models are good at different things and the only way to know which model works best for your use-cases it to run formal evaluation - or, you can trust us! Our research team in Legion is constantly evaluating new models and improvements to our triage pipelines.

AI
Benchmarking Large Language Models for Automated Security Triage
May 25, 2026
12
min read

We benchmarked leading LLMs on 163 real-world security triage decisions across phishing, account takeover, and network use cases. See which models performed best and why the answer depends on your use case

Alyssa Mensch