Your AI SOC Should Automate How Your Team Actually Works
Your AI SOC should automate your team's real workflows, not theoretical playbooks. Learn how Legion's approach helps security teams improve processes for faster, more effective security operations.

The vision of a fully automated SOC is finally within reach, piece by piece. But as we see it, building automation around a theoretical SOC, not a real one, is the wrong approach for enterprise companies.
Some focus on Tier 1 alerts: the repeatable, mundane tasks like standard phishing playbooks. Others tackle Tier 2+ investigations, where human judgment is still essential. Both are valid, but they miss the core reality of how a SOC operates today.
The goal of any AI SOC analyst isn't to replace your team, but to automate and improve the way they actually work.
Right now, your analysts are stuck in browser tabs, pivoting between consoles, copying data, and piecing together the truth manually. This isn’t scalable or efficient. It's why we founded Legion. Our vision is simple: your SOC lives in the browser. Your AI SOC analyst should build automation that reflects exactly that.
How SOC Workflows Actually Work Today
The modern SOC runs on people, browsers, and disconnected tools. Here's what that looks like in practice:
- Data Ingestion: Data (IPs, threat intel, logs, etc) is pulled from multiple sources and correlated
- Detection Engineering: Rules are written, tested, and updated based on what was missed or what created noise.
- Alert Triage: Analysts spend their day pulling data from different systems to figure out if an alert is real or just noise.
- Threat Hunting: Proactive hunts are a mix of experience and manual queries. Results are often shared ad hoc in Slack or documents, rarely in a repeatable format.
- Deeper Investigations: When an alert is valid, the manual pivot begins. Analysts jump between logs, threat intel feeds, and internal assets to gain context. Every jump between tools and content loses context.
- Remediation Actions: Depending on the validity of the alert, remediation actions are completed, and/or the ticket is closed out.
- Reporting & Incident Summarization: Building an incident timeline and report is a manual process of collecting screenshots, logs, and notes stitched together by hand.
- Process Hand-Offs: Shift changes and escalations often drop critical context because investigations aren’t documented in a structured way.

Author: Filip Stojkovski, Cybersec Automation
The main point is that most SOC workflows today are repetitive but lack standardization. Even if organizations have created playbooks within their SOAR or workflow automation tools, they are likely outdated or incorrect because automation is not handled by the analysts. The engineers do it.
An Honest Introduction to Automating the SOC
I’ve spent the last 90+ days digging into the AI SOC Analyst and SOAR market, talking to customers, analysts, and more.
Automating the SOC is not an easy problem to solve. Anyone who tells you their tool will work magically out of the box on day one is selling you a fantasy, and Legion is not here to tell you it will either.
- Some alerts are predictable, but many are context-dependent and demand human judgment.
- Integrations break. APIs can make things easier, but still need to be managed.
- And through it all, your good analysts remain your most valuable asset. Automation should make them faster and more effective, not try to replace them.
Legion's approach is built on this reality.
How Legion Security Automates SOC Workflows
Legion’s approach is built on one simple principle: the SOC lives in the browser. Analysts do their real work inside SaaS consoles, cloud admin panels, EDR dashboards, and threat intel portals, all in the browser. That’s where detections are reviewed, logs are queried and analyzed, and decisions are made.
Instead of forcing your team into an abstract "playbook tool" built on API connections, Legion instruments the browser itself. This gives you a clear view of what an analyst clicks, searches, copies, and correlates. This is the actual audit trail of how investigations and responses are conducted. This visibility is (we believe) the best way for automating workflows that actually match how your team operates.
Legion breaks this down into three practical, trust-based modes:
- Recording Mode: Legion captures every step your best analysts take. It watches how they handle triage, pull context, enrich data, and close cases. This builds a bank of proven workflows, not theoretical runbooks. These recordings become reusable playbooks grounded in real analyst behavior.
- Guided Mode: Next, Guided Mode turns those recordings into automations. When a new alert comes in, the analyst runs the investigation AI-in-the-loop, where Legion completes the investigation and provides recommendations for next steps at each decision node. Junior analysts don’t have to start from scratch. The guidance is readily available, right inside their workflow. This closes skill gaps and standardizes how your team works.
- Autonomous Mode: Finally, Legion can run trusted workflows end-to-end in Autonomous Mode. But only for well-understood, repeatable scenarios you've already vetted. When a ticket is opened, Legion executes the steps your team already does manually. There's no black-box decision-making or surprise actions outside what you’ve already proven works.

By focusing on how your real analysts work and only automating what they’ve shown to be effective, Legion enables you to build true automation that adapts and improves over time.
Use Cases for the Legion AI SOC Analyst
- Workflow Documentation: Create comprehensive workflow maps of how your SOC analysts handle alert triage and investigations.
- Alert Triage & Investigations: Automate noisy Tier 1 triage, enrich alerts with context, and auto-close junk. These can include cloud, identity, phishing, vulnerability management, and more. Because we are not limited by integrations, Legion can automate any SOC workflow.
- Reporting & Incident Summarization: Generate incident timelines and report on key metrics such as MTTA/MTTR.
- Process Improvement: Spot process gaps and bottlenecks, and optimize workflows across analysts.
- SOC Training: Don’t let your tribal knowledge leave with your best analysts. By mapping out your processes, your junior analysts can train by “looking over the shoulder” of Legion in guided mode.

Final Thoughts
SOC automation shouldn’t be magic (even if it feels like it sometimes). It's grounded in observing, guiding, and learning from your real workflows.
Legion’s AI SOC analyst doesn’t pretend to replace humans. It records what your best people do, guides new analysts, and automates the repeatable. Over time, your analysts can focus on improving workflows, upleveling their security skills, improving detections, and more. Automate your SOC the way your team actually works with Legion.
Abstract & Data Summary
We gathered and manually annotated a dataset of 196 hard triage decisions from real-world security investigations, covering a wide range of outcomes, including benign, malicious, and false positives. After cleaning the dataset by removing mock runs and cases with missing information or incorrect workflow execution, the remaining 163 examples were grouped into use case categories to form a high-quality cohort. We then evaluated LLMs on the dataset overall and per use-case category and found that Gemini 3 Pro performs best overall, though the best LLM varies by use case category.
Model performance by use case category:
If you’d like to understand our full research methodology, read on.
*Note: since this blog was authored, several new model families have been released. While the results have remained broadly stable, particularly among the best and worst performers, updated research may be required for a nuanced understanding of the performance differences amongst the rest.
Data Collection
The dataset was constructed from security investigations from eight US-based customers.The evaluation is conducted in a secure, federated way, without mixing customer data, only reporting summary statistics from each customer tenant.
To create a challenging evaluation, we over-weighted cases in which the analyst dis-agreed with the model - so the error rate is inflated here.
The investigations were conducted automatically according to predefined, customer-specific workflows, each of which contained at least one triage decision node. A triage decision node is a decision point within a workflow, where an LLM chooses a decision from among a list of provided decision options, given the information that was gathered in the workflow up until that point.
At each decision node, the LLM used in production selected a classification decision from a list of workflow-specific decision options and provided the reasoning for its decision, based on a summary of the steps completed until that point in the investigation.
For each investigation containing at least one decision node, we collected the following information from production session logs:
- A summary of the workflow steps up until the decision node, including tool name, step description, and step outputs
- Organization-specific knowledge, written by the customer and containing a title, description, and data
- The set of available decision options at the decision node
- The model's selected decision in production, as well as the reasoning and detailed reasoning for the decision
- The decision option selected by the customer
- Feedback text written by the customer for the decision
Here is an example workflow diagram:

Quality Control
An expert cybersecurity analyst annotated the 196 decision examples with reasoning tags to explain the production and customer decisions, and label whether disagreements are explained by an analyst-error, mistaken reasoning by the AI or missing data / steps in the workflow.
Examples tagged with "Workflow ran correctly but missing information" or "Workflow ran incorrectly" were removed from the dataset. Two additional examples with the use case titled "Workshop" were removed, as these were mock runs. For the remaining examples, the workflow ran correctly and was not missing information.
Triage Decision Distribution
By Label
Across the filtered dataset, the workflows contained 27 distinct normalized decision labels, which we grouped into the following buckets: False Positive, True Positive, Requires Review, and Other. The distribution of the labels is shown below:
The final evaluation dataset contains data from eight customers. The table below shows the number of annotated decision examples per customer and the tools used in each environment.
Use Case Distribution
We consolidated the use cases into 3 categories to consolidate our findings. Below is the map from the consolidated categories to the original use cases, as well as the distribution of the dataset over the consolidated categories.
Confusion Matrix
Below is a confusion matrix between the expert analyst annotations and the recommendations our system makes. We prompt the models to be careful and escalate when they are not sure.
Results
Over all use cases (including those without a use case name), Gemini 3 Pro had the highest performance at 74.8%, with GPT-4.1 and Opus 4.5 tied for second.
Phishing Results:
On the phishing use cases, Gemini 3 Pro performed the best, followed by Opus 4.5.
Account Takeover Results:
Sonnet 4 and GPT-4.1 were tied for best on the account takeover use cases.
Network Results:
Opus 4.5 and GPT-4.1 were tied for best on the network use cases.
Conclusion & Recommendation
We gathered and annotated 163 triage decisions from real-world security investigations. We characterized the use case distribution, and grouped the use cases according to common categories. We then benchmarked large language models across each use case category and the full dataset. We found that Gemini 3 Pro performs best overall. Per use case category, Gemini 3 Pro gives the best performance on phishing, Sonnet 4 and GPT-4.1 are tied for best on account takeover, and Opus 4.5 and GPT-4.1 are tied for best on network. Based on our results, we recommend that security teams test models for different scenarios to find the solution that works best for their use case, different models are good at different things and the only way to know which model works best for your use-cases it to run formal evaluation - or, you can trust us! Our research team in Legion is constantly evaluating new models and improvements to our triage pipelines.

We benchmarked leading LLMs on 163 real-world security triage decisions across phishing, account takeover, and network use cases. See which models performed best and why the answer depends on your use case
The security industry spent years debating when attackers would gain capabilities once out of reach — nation-state-level offensive tooling, zero-day discovery at scale, exploits built and iterated in minutes.
That gap was real. And it gave organizations the impression that the decision about which AI to bring into security operations, and how to do it right, could wait until the picture was clearer.
Mythos ended that assumption.
Not because of the model's size or strength, but because by the time Anthropic announced it, Mythos had already found thousands of high-severity vulnerabilities across every major operating system and browser in use today, without being told where to look. The decision not to release is the signal everyone was looking for.
That changes the implementation question. It was never acceptable to deploy AI badly in the SOC. Now it's not acceptable to deploy it slowly either. The organizations that will come out on top in the next 12 months are the ones that move fast and get it right, and most of the industry is about to discover that those aren't the same thing.
Level set: defenders have always been behind
The average breach lifecycle was already 258 days before AI-assisted attacks became the norm. This has nothing to do with the capabilities of analysts. Human-speed defense against machine-speed offense was always a losing equation.
Mythos-class models will almost certainly expand this breach lifecycle delta.
Most Implementations Are Getting It Wrong
87% of organizations experienced an AI-driven cyberattack in the past year. Security teams know they need AI. Most are already moving. But most implementations are failing for the same reason, and it is not the technology. It is a missing critical datapoint.
You. The context that shapes your business.
Most AI SOC tools treat every organization as interchangeable. They integrate with your SIEM, your EDR, your threat intel platforms, and assume that is enough. It is not. What determines whether AI actually works in your environment has nothing to do with the list of integrations. It is the organizational context that no integration can capture.
How is your organization structured? Where does data actually live versus where it is supposed to live? Who owns what, and how does that map to investigation and response when something goes wrong? How do escalation paths work in practice, not on paper? And critically, how do you enable the business without interrupting it?
The difference shows up clearly in practice. A heavily regulated enterprise running investigations across proprietary internal platforms looks nothing like a technology company. The organizational context that shapes every investigation, every escalation decision, and every response action is invisible to a system that only sees tool outputs.
Closing that gap is the foundational requirement that most implementations skip entirely.
Org Context Is Not a One-Time Setup
This is where most implementations fail, even when they start well.
Organizational context is not a configuration you complete on day one. Your organization is a living thing. Teams change. Tools get added. Processes evolve. New subsidiaries appear. Risk posture shifts with every acquisition, every regulatory update, every new product line the business launches.
An AI system that ingested your context six months ago and stopped learning is already drifting from your reality. It is making decisions based on an organization that no longer exists.
The right model is not a one-off ingestion. It is a continuous learning system that stays embedded in how your organization actually operates, tracks how investigations unfold, incorporates analyst feedback, and updates its understanding as your environment changes.
Not a snapshot.
A persistent model of your specific organization that evolves with it.
What Good Implementation Actually Looks Like
First, AI systems needs to understand how your organization actually operates. Not how it is documented, but how investigations really unfold, where data actually lives, and how decisions get made under pressure. The gap between what is written down and what actually happens is where most AI systems fail.
Second, that understanding cannot be static. Organizations change constantly. New teams, new tools, new processes, new risk priorities. Any system that relies on a snapshot of your environment will drift from reality and degrade over time. The AI working in your environment needs to keep learning it, not just learn it once.
Third, it needs to operate within that context, not around it. Producing technically correct outputs is not enough. The system needs to produce outcomes that are actionable within your organization as it exists today. That means working within your existing workflows, tools, and constraints without asking you to change how you operate to accommodate it.
That is the standard. Systems built around this model behave differently from the start. They do not ask organizations to adapt to them. They adapt to the organization. That distinction is where most implementations succeed or fail, and it is where the industry is slowly converging.
The Only Durable Path
The organizations getting AI right in the SOC aren't the ones with the longest integration lists or the biggest models. They're the ones that treated organizational context as the foundation rather than the afterthought, and built systems that keep learning their environment rather than freezing it in place on day one.
That is a harder implementation. It requires more from the vendor and more from the buyer. But Mythos made the timeline for getting there non-negotiable. The organizations that move fast on the wrong implementation will spend the next year rebuilding. The ones that move slowly on the right one will spend it exposed. The only durable path is moving quickly on the version that actually holds up. Systems built on continuous organizational context, deployed now rather than after the next incident, force the question.
The gap that used to buy time for deliberation is gone. What's left is the quality of the decision you make in its absence.
.png)
Mythos ended the debate on whether AI belongs in the SOC. The new question is how to deploy it right and why organizational context is the foundation most implementations skip.
Anthropic was right (and responsible) to release Mythos first to cybersecurity researchers and a select group of organizations through Project Glasswing. It is a genuinely remarkable model. And the security community should take it seriously. What is available to defenders today will be in the hands of attackers in a few months. That window is closing fast.
Mythos raises the ceiling on what AI can do in cybersecurity tasks. It discovers zero-day vulnerabilities in codebases that previous models could not find. It reverse-engineers complex systems. It constructs sophisticated, multi-path exploits at scale. The capabilities that were previously accessible only to well-funded nation-state actors can now be replicated by a far broader set of threat actors. No longer do you need teams of expert reverse engineers and months of reconnaissance.
The threat landscape is structurally shifting. We will be determined by our ability to shift our defense in kind. Quickly.
Where AI in defense needs to go first
The industry is converging, rightly, on vulnerability research and remediation as the priority. Scanning your own codebase with the same class of models that attackers are using is a clear first step. In many cases, defenders actually have an asymmetric advantage here, as we have better access to our own code than attackers do.
The harder problem is remediation. We already carry significant backlogs of unresolved, sometimes exploitable, vulnerabilities. Unlike an attacker who has nothing to lose, defenders cannot afford mistakes. Our systems are in production. Downtime has real costs. The asymmetry of attacker agility versus defender accountability is where the gap widens.
AI-assisted vulnerability remediation at scale is necessary. But it is not a solved problem, and any honest assessment of the landscape has to acknowledge that.
What this means for security operations
The idea of static detections designed to discover dynamic adversaries is fundamentally misguided. The future is better trip wires and an assume-breach mentality.
For SOC teams, the implications are direct. The scale and complexity of attacks is accelerating. We should expect a higher volume of sophisticated attacks that actively evade detection, that do not conform to known signatures or behavioral patterns, and that are designed from the ground up to stay invisible.
This breaks the model that most SOC programs are built on. The idea of maintaining a library of static detections to catch dynamic adversaries has always had limits. Those limits are now being exposed in real time.
What we need instead is the ability to detect a high volume of low-fidelity signals, such as anomalies in endpoint behavior, data access patterns, email activity, network flows, and identity. This requires teams to investigate each one as if it were the leading edge of a sophisticated breach. Not because every alert is a nation-state intrusion, but rather, we should expect that a higher percentage now may be.
The question is no longer whether to adopt AI in security operations. This is clearly needed. We cannot scale defenses solely on human labor.
The question is how to do it in a way that actually works inside the operational reality that security teams live in.
The real challenge is operational reality
Enterprises have legacy and custom tools, established processes, compliance and audit requirements, escalation paths, and oversight obligations that are not optional. AI cannot simply replace this infrastructure. It has to work within it.
You cannot properly scale your defenses without giving AI access to your organizational context, including your tools, your processes, your detection logic, and your escalation criteria. AI agents need to be able to investigate with the consistency and rigor of an incredible IR analyst, operate transparently, and support human oversight at the points where it matters.
This is precisely what we built Legion to do: meet organizations where they are. Our platform learns your existing tools, processes and context and makes them accessible to the latest frontier models (now Mythos, and every model in the future). From that we create structured, repeatable workflows where consistency is required or fully agentic investigations that require depth and judgment. Every action is auditable. Human-in-the-loop controls are configurable. And the system integrates across your entire stack.
My conclusion - Assume breach, investigate everything, build for the attacker that has already found the vulnerability you have not patched yet, and is using Mythos-level models to stay ahead of your detections.

In the wake of Mythos and Project Glasswing, security operations teams need AI that meets them where they are.



