Part of the Investigation That No Playbook Can Capture

Introducing Legion AI Investigator: AI that reasons where playbooks can't. Define the goal, set the guardrails, and let it investigate across your tools — no integrations required.

Ely Abramovitch

March 12, 2026

0 min read

SOC investigations range widely. Some are highly repeatable: every step defined, every decision documented. These work well and can be fully automated. But some investigations eventually reach a point where that breaks down: where the next step depends on what you just found, and the judgment and intuition to know what it means.

You can see it clearly the moment you try to write it down. Some processes flow neatly from start to finish. But as soon as you move into more complex investigations, the cracks appear. You find yourself pulled into a spiral of edge cases, tool variations, and fallback paths. You add branches. Then branches on branches. And after all that effort, you almost always end up in the same place: where no rule applies, and only judgment, reasoning and intuition can take you further.

The Part You Can Never Quite Capture

SOC investigations don't all look the same. Some are fully deterministic: a user notification when an outgoing email gets blocked, no reasoning required. For these, consistency matters. The same steps, the same outcome, every time. Others are the opposite: novel threats with no fixed path, no known pattern, where only experience, intuition, and judgment can tell you what to do next. And many fall somewhere in between, where you start with structure and hit a point where judgment has to take over.

But even those flows have a ceiling. Take a phishing investigation. You can document the triage steps pretty cleanly: check the sender, analyze the headers, detonate the attachment, check the URLs. That part is routine and capturable. But the moment you find something suspicious, the investigation shifts. Now you need to reason about scope: is this part of a campaign, and who else was hit? That question has no fixed answer. You might search for other emails with the same subject, but any decent campaign will vary the lures across targets, changing subjects, sender names, and payload links to evade detection. You cannot match on a single field and call it done. You need to iterate: follow one thread, see what it reveals, adjust your search, go again. You are reading the environment in real time, making judgment calls at every step based on what the last one uncovered.

Those judgment points show up on every shift, on every alert that goes beyond the routine. Someone has to reason through them in the moment, with whatever context they have, under whatever pressure exists right now. Until 3am. Until a less experienced analyst picks it up. Until alert volume means there simply isn't time to think it through properly.

That reasoning is not pre-programmed. It emerges from the finding itself. It is what a senior analyst does instinctively, and until now there has been no way to replicate it at scale. Legion Investigator is built for that moment.

Your Environment. Your Logic. Your Investigator.

Legion Investigator is a goal-oriented AI agent that sits inside your investigation workflow at exactly the moments where reasoning takes over from execution, extending Legion's coverage across the full spectrum of SOC investigations, from fully deterministic workflows to complex open-ended investigations. You define its goal, you choose which tools and actions it is permitted to use, and you decide where it acts autonomously and where it checks in first.

Which category a given investigation falls into is sometimes obvious. But often it is a deliberate choice, one that should be yours to make based on your team's needs, your risk tolerance, and how much consistency versus flexibility the situation calls for. Where on that spectrum each investigation runs is yours to decide. Every boundary is one you set in advance and can trust will be respected. This is what makes Investigator the kind of AI enterprises can actually adopt: not just powerful, but designed from the ground up to operate within your constraints, your processes, and your level of trust.

Most AI SOC tools bring their own model of how investigations should work. Legion Investigator learns from how yours actually do. It builds its understanding from your team's recorded investigation sessions, the decisions they make, the paths they take, and the patterns that emerge across real cases in your environment. Over time, Legion builds a structured knowledge base specific to your organization, capturing your processes, your tooling, and your team's accumulated expertise. That knowledge is not just stored. It is actively used to improve your captured workflows and feeds directly into how Investigator reasons, prioritizes, and investigates.

And when we say your tools, we mean all of them. Legion Investigator works the way your analysts work, through the browser, with no integrations and no APIs required. Your SIEM, your EDR, your threat intelligence platforms, your homegrown applications, your legacy dashboards, your on-prem and cloud environments. You don’t rebuild your stack to fit the tool. The tool fits your stack.

The way it works reflects how investigations actually flow. An investigation might start in your SIEM with a set of routine queries, structured, reliable, repeatable. But when it reaches one of those decision points, you hand off to an Investigator with a goal: find the scope of breach, enrich the full context of what we have so far, identify what else was impacted across endpoints and cloud assets.

The Investigator takes that goal and works toward achieving it. It invokes the relevant tools, interprets what comes back, recalculates what to do next, and invokes again. It keeps going, step by step, until the goal is met. Not a single tool call with a result handed back to you. A full reasoning loop that runs until the work is done, across your security tools, your homegrown applications, and any AI agents already running in your environment. Investigator acts as the orchestrator, pulling in whatever is needed to get there.

Multiple Investigators can work together across a single investigation. One handles enrichment. Another determines scope of breach. A third drives containment based on what was actually found, not what was anticipated when the playbook was written.

And because trust matters, Investigator operates within guardrails. It works only with the tools and actions it’s been given permission to use. For anything higher risk, it asks before acting. You stay in control by setting the boundaries in advance and knowing they’ll be respected.

What This Changes

Legion Investigator opens up three things that weren't possible before.

Pick up where deterministic processes end

For investigations where you have structured steps, you can now embed an Investigator at exactly the points where structure runs out. The routine parts stay routine.The investigator reasons further, and by the time you step in, the groundwork is already done.

Handle your long tail of alerts

For the long tail of investigations where you never had a well-defined flow to begin with, you can now hand them off end to end. The Investigator handles enrichment before you even open the case, drives containment the moment scope is confirmed, and picks up every judgment point in between. Give the Investigator the goal, set the guardrails, and let it run. No playbook required.

Every investigation, regardless of how well-defined it is, can now be handled with the depth of your best analyst, on every alert, on every shift. And for the first time, you control where on that spectrum each investigation runs. More structure where consistency matters. More autonomy where judgment, experience, and intuition are required. The balance is yours to set, and yours to change.

This is not about replacing analysts. It never was. There will always be moments that require human judgment, experience, and instinct, and no AI should pretend otherwise. What changes is everything around those moments. The analyst becomes the commander: setting goals, defining boundaries, sending investigators out into the environment to gather, reason, and report back. The calls that matter stay with you. The work that surrounds them no longer has to. Not because we built a smarter AI. Because we built one that learned from you.

‍

Legion Security is Now Available on Google Cloud Marketplace

Security operations were built around human investigators. Skilled analysts, working manually across dozens of tools, piecing together evidence, making judgment calls, closing cases. But as alert volumes outpaced human capacity, institutional knowledge became a bottleneck, and the complexity of the modern enterprise made scaling impossible. The industry responded with more headcount, more tools, more automation. None of it solved the fundamental problem.

Legion introduces a different operating model entirely.

What Legion Does

Legion observes how your analysts operate when running real investigations, learning your organizational context, tools, past cases, playbooks, runbooks and all other tribal knowledge in order to understand what an optimal investigation looks like for your environment. This is then turned into an easily editable and audible workflow which can be automated when you’re ready. Powered by Google Cloud's Gemini models, each workflow is executed by AI agents that reason through the evidence and provide a verdict and even remediate. This is all accomplished with no manual playbook writing or need to document predefined rules.

But legion goes well beyond workflow creation. As Legion builds trust in its performance, teams can choose to keep a human in the loop to approve every decision or have Legion operate fully autonomously reducing MTTR eliminating MTTA, allowing analysts to focus on more novel investigations that are becoming more and more common in the world of AI.

Memory: The Compounding Advantage

Every investigation Legion conducts makes it smarter. A persistent memory layer continuously captures context from previous cases, your SOC knowledge base, and direct analyst feedback, feeding all of it back into future investigations and decisions. Institutional knowledge that once lived in the heads of your most experienced analysts becomes a permanent, improving organizational asset. The more Legion works, the better it gets. That's not a feature. That's a compounding strategic advantage.

Zero Integrations. Immediate Value.

Most security automation platforms fail at the same hurdle: integrations. Enterprises face months of API work, custom connectors, and professional services before anything runs in production, or are forced to adopt entirely new tools and processes, something most complex enterprises simply can't do.

Legion operates natively in the browser, which means it works across your entire security stack, from threat intel platforms to legacy internal tools, without any API configuration. If your analysts can open it in a browser, Legion can learn from it, generate workflows from it, and execute investigations through it.

Proven Results at Scale

The impact Legion delivers isn't theoretical:

As the head of Security at Virgin Money put it, Legion is “like evolving from handcrafted systems to precision manufacturing aligned to our flow (except) faster, repeatable and secure”.

Legion works with the worlds largest enterprises and delivers strong results:

A large insurance organization automated 24,000 investigations and cut mean time to respond from 20 minutes to 2 minutes.
WELL Health Technologies reduced investigation times by 81%, allowing existing analysts to handle significantly higher alert volumes without additional headcount.
The University of Tulsa cut investigation times in half, enabling their team to overcome capacity limits with the staff they already had.

Across deployments, Legion reduces mean time to investigate by up to 85% and response times by up to 90%.

Built on Google Cloud

Legion's integration with Google Cloud goes deeper than the Marketplace listing. The platform runs on Google Cloud infrastructure and leverages Gemini models to power its AI reasoning, combining Legion's browser-native architecture with Google Cloud's security, scale, and model quality.

For organizations already invested in Google Cloud and Google SecOps, Legion extends that ecosystem directly into the analyst workflow.

Who It's For

Legion is purpose-built for enterprise security operations teams, CISOs, VPs of Information Security, SOC Directors, and Security Operations Managers at organizations running in-house SOCs. If your team is dealing with any of the following, Legion was built for you:

Alert volumes that have outpaced your team's capacity
Analyst burnout from manual, repetitive investigation work
Institutional knowledge that walks out the door when senior analysts do
Automation gaps caused by complex integration requirements

Available Now on Google Cloud Marketplace

Legion Security is available today on Google Cloud Marketplace, allowing customers to apply their spend toward their annual Google contract and simplify procurement. For security teams ready to move beyond the limits of traditional operations, this is where that transformation begins.

‍

Engineering

Legion Security Is Now Available on Google Cloud Marketplace

May 31, 2026

min read

Legion is officially on the Google Cloud Marketplace.

Gili Diamant

Abstract & Data Summary

We gathered and manually annotated a dataset of 196 hard triage decisions from real-world security investigations, covering a wide range of outcomes, including benign, malicious, and false positives. After cleaning the dataset by removing mock runs and cases with missing information or incorrect workflow execution, the remaining 163 examples were grouped into use case categories to form a high-quality cohort. We then evaluated LLMs on the dataset overall and per use-case category and found that Gemini 3 Pro performs best overall, though the best LLM varies by use case category.

Model performance by use case category:

Use case category	Best model(s)
Phishing	Gemini 3 Pro
Account Takeover	Sonnet 4 / GPT-4.1
Network	Opus 4.5 / GPT-4.1
Overall	Gemini 3 Pro

If you’d like to understand our full research methodology, read on.

*Note: since this blog was authored, several new model families have been released. While the results have remained broadly stable, particularly among the best and worst performers, updated research may be required for a nuanced understanding of the performance differences amongst the rest.

Data Collection

The dataset was constructed from security investigations from eight US-based customers.The evaluation is conducted in a secure, federated way, without mixing customer data, only reporting summary statistics from each customer tenant.

To create a challenging evaluation, we over-weighted cases in which the analyst dis-agreed with the model - so the error rate is inflated here.

The investigations were conducted automatically according to predefined, customer-specific workflows, each of which contained at least one triage decision node. A triage decision node is a decision point within a workflow, where an LLM chooses a decision from among a list of provided decision options, given the information that was gathered in the workflow up until that point.

At each decision node, the LLM used in production selected a classification decision from a list of workflow-specific decision options and provided the reasoning for its decision, based on a summary of the steps completed until that point in the investigation.

For each investigation containing at least one decision node, we collected the following information from production session logs:

A summary of the workflow steps up until the decision node, including tool name, step description, and step outputs
Organization-specific knowledge, written by the customer and containing a title, description, and data
The set of available decision options at the decision node
The model's selected decision in production, as well as the reasoning and detailed reasoning for the decision
The decision option selected by the customer
Feedback text written by the customer for the decision

Here is an example workflow diagram:

Quality Control

An expert cybersecurity analyst annotated the 196 decision examples with reasoning tags to explain the production and customer decisions, and label whether disagreements are explained by an analyst-error, mistaken reasoning by the AI or missing data / steps in the workflow.

Term	Definition
Good Reasoning	The LLM reasoned correctly about the decision options given the input data
Bad Reasoning	The LLM reasoned incorrectly about the decision options given the input data
Workflow Ran Correctly	The workflow had complete inputs and outputs, and did not get interrupted
Workflow Ran Incorrectly	The workflow had empty or partial data, or was interrupted
Customer was Aligned	Customer agreed with the expert analyst decision
Customer was Mistaken	Customer disagreed with the expert analyst decision
Missing Information	The LLM made the correct decision given the available information, but information relevant to making the decision was missing from its input

Examples tagged with "Workflow ran correctly but missing information" or "Workflow ran incorrectly" were removed from the dataset. Two additional examples with the use case titled "Workshop" were removed, as these were mock runs. For the remaining examples, the workflow ran correctly and was not missing information.

Triage Decision Distribution

By Label

Across the filtered dataset, the workflows contained 27 distinct normalized decision labels, which we grouped into the following buckets: False Positive, True Positive, Requires Review, and Other. The distribution of the labels is shown below:

Distribution of tags in triage dataset

False Positive Requires Review True Positive Other

The final evaluation dataset contains data from eight customers. The table below shows the number of annotated decision examples per customer and the tools used in each environment.

Customer	Tools used in each environment	# of decisions made
SOC Environment #1	Defender, CrowdStrike, Splunk, VirusTotal, AbuseIPDB, URLScan	48
SOC Environment #2	IP Quality Score, Zscaler, Confluence, ServiceNow, Splunk, Proofpoint TAP, Microsoft Entra, Cortex XSOAR, VMRay, Wiz, VirusTotal, URLScan	32
SOC Environment #3	Defender, Google SecOps, Silent Push, Microsoft Entra, IPLocation, Axonius, IPinfo, Nodedata	29
SOC Environment #4	Confluence, ServiceNow, Excel Online, Cortex XSOAR, VirusTotal, AbuseIPDB, URLScan	27
SOC Environment #5	Defender, Zscaler, Cortex XSOAR, Microsoft Entra, Abnormal Security, IPLocation, VirusTotal, AbuseIPDB, Shodan, IPinfo	16
SOC Environment #6	Defender, CiscoTalos, MxToolBox, AbuseIPDB, TeamDynamix, URLScan	5
SOC Environment #7	Splunk, Microsoft Entra, Mimecast, VirusTotal, AbuseIPDB, Spur	3
SOC Environment #8	Joe Sandbox, Proofpoint TAP, VirusTotal	3
Total	—	163

Use Case Distribution

We consolidated the use cases into 3 categories to consolidate our findings. Below is the map from the consolidated categories to the original use cases, as well as the distribution of the dataset over the consolidated categories.

Use cases	Counts
Phishing	96
Account Takeover	38
Network/Infrastructure	26

‍

Confusion Matrix

Below is a confusion matrix between the expert analyst annotations and the recommendations our system makes. We prompt the models to be careful and escalate when they are not sure.

Confusion matrix

Selection (Actual) vs Recommendation (Predicted)

		Predicted
		FP	TP	Requires Review	Other
Actual	False Positive	6065.9%	1617.6%	1516.5%	00.0%
	True Positive	26.9%	2793.1%	00.0%	00.0%
	Requires Review	13.1%	412.5%	2784.4%	00.0%
	Other	17.7%	00.0%	00.0%	1292.3%

Results

Over all use cases (including those without a use case name), Gemini 3 Pro had the highest performance at 74.8%, with GPT-4.1 and Opus 4.5 tied for second.

Triage performance by model

Phishing Results:

On the phishing use cases, Gemini 3 Pro performed the best, followed by Opus 4.5.

Triage performance by model

Account Takeover Results:

Sonnet 4 and GPT-4.1 were tied for best on the account takeover use cases.

Triage performance by model

‍

Network Results:

Opus 4.5 and GPT-4.1 were tied for best on the network use cases.

Triage performance by model

‍

Conclusion & Recommendation

We gathered and annotated 163 triage decisions from real-world security investigations. We characterized the use case distribution, and grouped the use cases according to common categories. We then benchmarked large language models across each use case category and the full dataset. We found that Gemini 3 Pro performs best overall. Per use case category, Gemini 3 Pro gives the best performance on phishing, Sonnet 4 and GPT-4.1 are tied for best on account takeover, and Opus 4.5 and GPT-4.1 are tied for best on network. Based on our results, we recommend that security teams test models for different scenarios to find the solution that works best for their use case, different models are good at different things and the only way to know which model works best for your use-cases it to run formal evaluation - or, you can trust us! Our research team in Legion is constantly evaluating new models and improvements to our triage pipelines.

‍

Benchmarking Large Language Models for Automated Security Triage

May 25, 2026

min read

We benchmarked leading LLMs on 163 real-world security triage decisions across phishing, account takeover, and network use cases. See which models performed best and why the answer depends on your use case

Alyssa Mensch

The security industry spent years debating when attackers would gain capabilities once out of reach — nation-state-level offensive tooling, zero-day discovery at scale, exploits built and iterated in minutes.

That gap was real. And it gave organizations the impression that the decision about which AI to bring into security operations, and how to do it right, could wait until the picture was clearer.

Mythos ended that assumption.

Not because of the model's size or strength, but because by the time Anthropic announced it, Mythos had already found thousands of high-severity vulnerabilities across every major operating system and browser in use today, without being told where to look. The decision not to release is the signal everyone was looking for.

That changes the implementation question. It was never acceptable to deploy AI badly in the SOC. Now it's not acceptable to deploy it slowly either. The organizations that will come out on top in the next 12 months are the ones that move fast and get it right, and most of the industry is about to discover that those aren't the same thing.

Level set: defenders have always been behind

The average breach lifecycle was already 258 days before AI-assisted attacks became the norm. This has nothing to do with the capabilities of analysts. Human-speed defense against machine-speed offense was always a losing equation.

Mythos-class models will almost certainly expand this breach lifecycle delta.

Most Implementations Are Getting It Wrong

87% of organizations experienced an AI-driven cyberattack in the past year. Security teams know they need AI. Most are already moving. But most implementations are failing for the same reason, and it is not the technology. It is a missing critical datapoint.

You. The context that shapes your business.

Most AI SOC tools treat every organization as interchangeable. They integrate with your SIEM, your EDR, your threat intel platforms, and assume that is enough. It is not. What determines whether AI actually works in your environment has nothing to do with the list of integrations. It is the organizational context that no integration can capture.

How is your organization structured? Where does data actually live versus where it is supposed to live? Who owns what, and how does that map to investigation and response when something goes wrong? How do escalation paths work in practice, not on paper? And critically, how do you enable the business without interrupting it?

The difference shows up clearly in practice. A heavily regulated enterprise running investigations across proprietary internal platforms looks nothing like a technology company. The organizational context that shapes every investigation, every escalation decision, and every response action is invisible to a system that only sees tool outputs.

Closing that gap is the foundational requirement that most implementations skip entirely.

Org Context Is Not a One-Time Setup

This is where most implementations fail, even when they start well.

Organizational context is not a configuration you complete on day one. Your organization is a living thing. Teams change. Tools get added. Processes evolve. New subsidiaries appear. Risk posture shifts with every acquisition, every regulatory update, every new product line the business launches.

An AI system that ingested your context six months ago and stopped learning is already drifting from your reality. It is making decisions based on an organization that no longer exists.

The right model is not a one-off ingestion. It is a continuous learning system that stays embedded in how your organization actually operates, tracks how investigations unfold, incorporates analyst feedback, and updates its understanding as your environment changes.

Not a snapshot.

A persistent model of your specific organization that evolves with it.

What Good Implementation Actually Looks Like

First, AI systems needs to understand how your organization actually operates. Not how it is documented, but how investigations really unfold, where data actually lives, and how decisions get made under pressure. The gap between what is written down and what actually happens is where most AI systems fail.

Second, that understanding cannot be static. Organizations change constantly. New teams, new tools, new processes, new risk priorities. Any system that relies on a snapshot of your environment will drift from reality and degrade over time. The AI working in your environment needs to keep learning it, not just learn it once.

Third, it needs to operate within that context, not around it. Producing technically correct outputs is not enough. The system needs to produce outcomes that are actionable within your organization as it exists today. That means working within your existing workflows, tools, and constraints without asking you to change how you operate to accommodate it.

That is the standard. Systems built around this model behave differently from the start. They do not ask organizations to adapt to them. They adapt to the organization. That distinction is where most implementations succeed or fail, and it is where the industry is slowly converging.

The Only Durable Path

The organizations getting AI right in the SOC aren't the ones with the longest integration lists or the biggest models. They're the ones that treated organizational context as the foundation rather than the afterthought, and built systems that keep learning their environment rather than freezing it in place on day one.

That is a harder implementation. It requires more from the vendor and more from the buyer. But Mythos made the timeline for getting there non-negotiable. The organizations that move fast on the wrong implementation will spend the next year rebuilding. The ones that move slowly on the right one will spend it exposed. The only durable path is moving quickly on the version that actually holds up. Systems built on continuous organizational context, deployed now rather than after the next incident, force the question.

The gap that used to buy time for deliberation is gone. What's left is the quality of the decision you make in its absence.

Fast and Right: Implementing AI in the SOC After Mythos

April 23, 2026

min read

Mythos ended the debate on whether AI belongs in the SOC. The new question is how to deploy it right and why organizational context is the foundation most implementations skip.

Ron Marsiano