All Articles

IOC Chaos

When LLMs investigate security alerts autonomously, they encounter dozens, sometimes hundreds of IOCs: emails, URLs, IPs, domains, and hostnames. Left unmanaged, these inflate token costs, produce inconsistent references, and break structured output. Our IOC indexing system replaces raw indicators with compact symbolic references the model reuses throughout its reasoning. Across 100 evaluation runs, it took JSON validity from ~80% to 100% and IOC reference compliance to 100%."

Nadav Kahanowich
URL copied

Introduction

When automating modern security investigations with LLM-based agents, to conduct multi-step security investigations, a typical workflow begins with an alert - say, a reported phishing email - and the agent iteratively queries tools such as Microsoft Defender Threat Explorer, Splunk, or CrowdStrike to gather evidence, assess scope, and recommend containment actions.

At each step, the agent receives query results containing raw IOCs: sender addresses, embedded URLs, source IPs, recipient domains, and device hostnames. It must reason about these indicators, decide whether to refine its search or conclude the investigation, and return its findings as structured output.

This works well for short investigations. But as the number of steps grows, an issue emerges: the agent's context window fills with repeated, verbose indicator values, the model begins echoing raw IOCs inconsistently, and the structured outputs it produces become increasingly fragile.

[Figure 1: High-level architecture of an AI-driven investigation agent.]

Consider a phishing investigation that proceeds through four steps:

  1. Initial query: Search for emails from a reported sender to a specific recipient. The results contain the sender's email address and a handful of URLs.
  2. Scope expansion: Search for all emails from the same sender across the organization. The results return 22 emails with SharePoint URLs, tracking links, and font-file references.
  3. URL analysis: Search by specific URLs found in step 2. Additional domains and redirects surface.
  4. Conclusion: The agent summarizes its findings and lists all relevant IOCs.

By step 4, the agent's prompt contains the full history of steps 1 through 3 - including every raw URL, email address, and domain mentioned in each step's results and the agent's own reasoning. Some of these URLs are long tracking links with base64-encoded parameters, easily exceeding 200 characters each.

This accumulation created three concrete problems.

  1. Token bloat. Raw IOC values, particularly URLs with tracking parameters and encoded payloads, consumed a disproportionate share of the context window. A single newsletter email might contain 30+ URLs, each repeated in the query results, the agent's reasoning, and the indicators list, tripling the token cost per IOC, per step.
  2. Over-reporting. When asked to list relevant indicators, the model would frequently dump every IOC it had ever seen into the response - even when the current step involved only one or two. In one case, an agent listed all 145 email addresses from its registry when the current query concerned a single sender.
  3. Structural fragility. Query results from security tools sometimes contained comma-separated URL lists embedded in strings. When the model attempted to reproduce these in its JSON output, it produced malformed structures - unescaped commas, broken string boundaries, and invalid nesting. In our baseline evaluation, only approximately 80% of model responses parsed as valid JSON.

Our Approach

We addressed these problems with a three-part system:

  1. A unified IOC manager that extracts and indexes indicators.
  2. An IOC prompt adjustment that instructs the model on how to use indexed references.
  3. A preprocessing step that cleans malformed tool output before it reaches the model.

IOC Extraction and Indexing

The core of the system is an IOC manager that maintains a registry of all indicators encountered during an investigation. When new text enters the pipeline - whether from tool query results or from the agent's own prior reasoning - the manager scans it using a set of type-specific patterns covering URLs, email addresses, IPv4 addresses, file hashes, hostnames, and domains.

Each newly discovered IOC is assigned a compact symbolic reference following a consistent naming convention: the first email becomes EMAIL01, the first URL becomes URL01, the second domain becomes DOMAIN02, and so on. The original value is stored in the registry, and all occurrences in the text are replaced with the corresponding reference.

Extraction order matters. URLs are processed first, because a URL contains both a domain and potentially an IP address. By extracting URLs before domains and IPs, we prevent the system from fragmenting a single indicator into multiple overlapping entries. 

The manager also performs selective extraction. In a typical prompt, the first section contains static task instructions - tool descriptions, output format specifications, and investigation guidelines. IOC extraction is applied only to the dynamic sections (step history and query results), leaving instruction text unchanged. This prevents false positives from example IOCs embedded in the prompt template.

Deduplication is handled through a value-to-reference mapping. If the same IOC appears in step 1 and again in step 3, it receives the same reference both times, ensuring consistent tracking across the entire investigation.

[Figure 3: The IOC extraction pipeline.]

IOC Prompt Adjustment

Extraction alone is not sufficient. Even when the input prompt uses symbolic references, the model may revert to generating raw IOC values in its output, particularly if the system prompt or prior conversation history contains raw values, or if the model has seen the actual value during context processing.

To address this, we developed an IOC prompt adjustment, a compact structured appendix appended to the user prompt that explicitly instructs the model on how to handle IOCs. The adjustment establishes three rules:

  1. In reasoning: Always use symbolic references. Never write raw email addresses, URLs, IP addresses, or domains. Instead of writing a raw sender address followed by a description of the campaign, use the corresponding reference identifier throughout.
  2. In the indicators field: Distinguish between known and new IOCs. For indicators already present in the registry, use the symbolic reference. For indicators being reported for the first time, newly discovered in the current step's results, use the actual value, so it can be added to the registry for subsequent steps.
  3. Relevance filtering: Only include IOCs that are directly relevant to the current investigation step. Do not copy all registry entries into every response.

The IOC prompt adjustment includes a populated copy of the current IOC registry, mapping each reference to its actual value, so the model can look up identifiers when constructing its reasoning. It also provides correct and incorrect examples, a validation checklist, and explicit rejection criteria.

We tested two versions of the IOC prompt adjustment: a comprehensive version with extensive examples and redundant emphasis, and an optimized version that distills the same rules more concisely. Both achieved equivalent compliance rates, suggesting that clarity of instruction matters more than volume of repetition.

Input Preprocessing

The second component addresses a problem upstream of the model: malformed tool output. Security tool APIs sometimes return URL lists as comma-separated values within a single string field, rather than as properly structured arrays. When passed through to the model as-is, these malformed strings caused structured output generation failures.

Our preprocessing step detects comma-separated URL patterns in query results and reformats them into clean, numbered lists before the text reaches the model. This small transformation, applied before IOC extraction, resolved the structured output validity issue independently of the other components.

Evaluation

Setup

We evaluated the system using 10 real-world investigation traces captured from production. Each trace represents a complete phishing investigation conducted through several security tools, containing the system prompt, user prompt with step history, and the raw query results that the model must reason about.

For each trace, we ran 10 iterations with the same prompt configuration, measuring two metrics:

  • JSON validity: Whether the model's response parsed as valid a structured output.
  • IOC reference compliance: Whether the response used symbolic references exclusively in its reasoning field (no raw IOC values) and correctly distinguished between known references and new actual values in its indicators field.

We tested four configurations to isolate the contribution of each component.

Results

Configuration JSON Validity IOC Compliance
Baseline (no IOC system) ~80% 0%
IOC Prompt Adjustment only 100% 100%
IOC Manager + Prompt Adjustment 100% 100%
URL Cleaning only (no prompt adjustment) ~100% 0%
Full system (Manager + Prompt Adjustment + URL Cleaning) 100% 100%

JSON validity. The baseline configuration produced valid JSON in approximately 80% of responses. Adding URL preprocessing alone brought this to nearly 100%, confirming that malformed tool output - not model capability - was the root cause of parsing failures. All configurations that included the enforcer achieved 100% validity.

IOC compliance. Without the IOC prompt adjustment, the model never spontaneously adopted symbolic references, compliance was 0% regardless of whether the input text had been processed by the IOC manager. With the prompt adjustment, compliance jumped to 100% across all traces and iterations. This held for both the comprehensive and optimized prompt adjustment variants.

Component independence. The results reveal a clean separation of concerns: URL preprocessing fixes JSON validity, the IOC manager fixes IOC compliance, and provides the underlying registry and extraction infrastructure that makes both possible.

Qualitative Observations

Beyond the quantitative metrics, we observed several qualitative improvements:

  • Reduced prompt size. Replacing verbose URLs (some exceeding 200 characters) with compact references, meaningfully reduced token consumption in the step history, particularly for investigations involving newsletter or marketing emails with numerous tracking links.
  • Consistent cross-step tracking. The registry ensured that the same IOC received the same reference throughout the investigation, this is particularly helpful with IOCs referenced throughout multiple steps in the investigation.
  • Focused indicator reporting. With the IOC manager's relevance-filtering instruction, the model stopped dumping entire registries into its responses. Indicator lists became proportional to the current step's scope rather than the investigation's total history.

Discussion

Why Extraction Without the IOC Prompt Adjustment Fails

A natural question is why input-side extraction alone does not work. If the prompt already contains a symbolic reference instead of a raw email address, why does the model still generate raw values in its output?

The answer lies in how LLMs process context. The model has access to the full prompt, including sections where the actual IOC value may still appear, task-specific inputs, quoted alert descriptions, or the registry itself. More fundamentally, the model's training distribution contains overwhelmingly more examples of raw IOC values than of symbolic reference systems. Without explicit instruction, the model defaults to the more familiar pattern.

This finding has a broader implication for LLM-based agent design: transforming the input is necessary but not sufficient when you need the model to adopt a non-default output convention. Explicit behavioral instruction, the IOC prompt adjustment, bridges the gap.

Limitations

Our evaluation has several limitations worth noting. All traces were drawn from a single investigation type (phishing via specific security tools). While the IOC types encountered are representative of broader security operations, there are additional evaluations to be done.

The evaluation was conducted with a single model (chatGPT4.1). Different models may exhibit different compliance characteristics, and the prompt mechanism may need tuning for models with different instruction-following tendencies.

Finally, our compliance metric is binary - a response either uses references correctly or it does not. A more granular metric could capture partial compliance and might reveal subtler performance trends across model versions or investigation complexities.

Conclusion

We presented an IOC indexing system for AI-driven security investigations that addresses three interrelated problems: token bloat from repeated raw indicator values, inconsistent IOC tracking across investigation steps, and structural fragility in model-generated structured outputs.

The system combines automated IOC extraction with symbolic reference assignment, explicit behavioral guidance through prompt engineering, and input preprocessing to handle malformed tool output. Across 100 evaluation runs on 10 production investigation traces, the full system achieved 100% JSON validity and 100% IOC reference compliance, up from approximately 80% and 0% respectively at baseline.

The key insight is that managing IOCs in the context of LLM-based agents requires intervention at both the input and output stages. Extraction and indexing normalize the input, but only explicit prompt-level guidance ensures the model adopts the reference convention in its generated output. Neither component alone is sufficient; together, they eliminate the problem entirely.

As SOC automation platforms handle increasingly complex, multi-step investigations, structured approaches to managing the information that flows through the agent's context window become essential. IOC indexing is one instance of a more general pattern: giving the agent a well-organized working memory that scales with investigation complexity rather than against it.

URL copied

Legion Security is Now Available on Google Cloud Marketplace

Security operations were built around human investigators. Skilled analysts, working manually across dozens of tools, piecing together evidence, making judgment calls, closing cases. But as alert volumes outpaced human capacity, institutional knowledge became a bottleneck, and the complexity of the modern enterprise made scaling impossible. The industry responded with more headcount, more tools, more automation. None of it solved the fundamental problem.

Legion introduces a different operating model entirely. 

What Legion Does

Legion observes how your analysts operate when running real investigations, learning your organizational context, tools, past cases, playbooks, runbooks and all other tribal knowledge in order to understand what an optimal investigation looks like for your environment. This is then turned into an easily editable and audible workflow which can be automated when you’re ready. Powered by Google Cloud's Gemini models, each workflow is executed by AI agents that reason through the evidence and provide a verdict and even remediate. This is all accomplished with no manual playbook writing or need to document predefined rules.

But legion goes well beyond workflow creation. As Legion builds trust in its performance, teams can choose to keep a human in the loop to approve every decision or have Legion operate fully autonomously reducing MTTR eliminating MTTA, allowing analysts to focus on more novel investigations that are becoming more and more common in the world of AI. 

Memory: The Compounding Advantage

Every investigation Legion conducts makes it smarter. A persistent memory layer continuously captures context from previous cases, your SOC knowledge base, and direct analyst feedback, feeding all of it back into future investigations and decisions. Institutional knowledge that once lived in the heads of your most experienced analysts becomes a permanent, improving organizational asset. The more Legion works, the better it gets. That's not a feature. That's a compounding strategic advantage.

Zero Integrations. Immediate Value.

Most security automation platforms fail at the same hurdle: integrations. Enterprises face months of API work, custom connectors, and professional services before anything runs in production, or are forced to adopt entirely new tools and processes, something most complex enterprises simply can't do.

Legion operates natively in the browser, which means it works across your entire security stack, from threat intel platforms to legacy internal tools, without any API configuration. If your analysts can open it in a browser, Legion can learn from it, generate workflows from it, and execute investigations through it.

Proven Results at Scale

The impact Legion delivers isn't theoretical:

As the head of Security at Virgin Money put it, Legion is “like evolving from handcrafted systems to precision manufacturing aligned to our flow (except) faster, repeatable and secure”.

Legion works with the worlds largest enterprises and delivers strong results: 

  • A large insurance organization automated 24,000 investigations and cut mean time to respond from 20 minutes to 2 minutes.
  • WELL Health Technologies reduced investigation times by 81%, allowing existing analysts to handle significantly higher alert volumes without additional headcount.
  • The University of Tulsa cut investigation times in half, enabling their team to overcome capacity limits with the staff they already had.

Across deployments, Legion reduces mean time to investigate by up to 85% and response times by up to 90%.

Built on Google Cloud

Legion's integration with Google Cloud goes deeper than the Marketplace listing. The platform runs on Google Cloud infrastructure and leverages Gemini models to power its AI reasoning, combining Legion's browser-native architecture with Google Cloud's security, scale, and model quality.

For organizations already invested in Google Cloud and Google SecOps, Legion extends that ecosystem directly into the analyst workflow.

Who It's For

Legion is purpose-built for enterprise security operations teams, CISOs, VPs of Information Security, SOC Directors, and Security Operations Managers at organizations running in-house SOCs. If your team is dealing with any of the following, Legion was built for you:

  • Alert volumes that have outpaced your team's capacity
  • Analyst burnout from manual, repetitive investigation work
  • Institutional knowledge that walks out the door when senior analysts do
  • Automation gaps caused by complex integration requirements

Available Now on Google Cloud Marketplace

Legion Security is available today on Google Cloud Marketplace, allowing customers to apply their spend toward their annual Google contract and simplify procurement. For security teams ready to move beyond the limits of traditional operations, this is where that transformation begins.

Engineering
Legion Security Is Now Available on Google Cloud Marketplace
May 31, 2026
12
min read

Legion is officially on the Google Cloud Marketplace.

Gili Diamant

Abstract & Data Summary

We gathered and manually annotated a dataset of 196 hard triage decisions from real-world security investigations, covering a wide range of outcomes, including benign, malicious, and false positives. After cleaning the dataset by removing mock runs and cases with missing information or incorrect workflow execution, the remaining 163 examples were grouped into use case categories to form a high-quality cohort. We then evaluated LLMs on the dataset overall and per use-case category and found that Gemini 3 Pro performs best overall, though the best LLM varies by use case category. 

Model performance by use case category:

Use case category Best model(s)
Phishing Gemini 3 Pro
Account Takeover Sonnet 4 / GPT-4.1
Network Opus 4.5 / GPT-4.1
Overall Gemini 3 Pro

If you’d like to understand our full research methodology, read on.

*Note: since this blog was authored, several new model families have been released. While the results have remained broadly stable, particularly among the best and worst performers, updated research may be required for a nuanced understanding of the performance differences amongst the rest.

Data Collection

The dataset was constructed from security investigations from eight US-based customers.The evaluation is conducted in a secure, federated way, without mixing customer data, only reporting summary statistics from each customer tenant.

To create a challenging evaluation, we over-weighted cases in which the analyst dis-agreed with the model - so the error rate is inflated here.

The investigations were conducted automatically according to predefined, customer-specific workflows, each of which contained at least one triage decision node. A triage decision node is a decision point within a workflow, where an LLM chooses a decision from among a list of provided decision options, given the information that was gathered in the workflow up until that point.  

At each decision node, the LLM used in production selected a classification decision from a list of workflow-specific decision options and provided the reasoning for its decision, based on a summary of the steps completed until that point in the investigation.

For each investigation containing at least one decision node, we collected the following information from production session logs:

  • A summary of the workflow steps up until the decision node, including tool name, step description, and step outputs
  • Organization-specific knowledge, written by the customer and containing a title, description, and data
  • The set of available decision options at the decision node
  • The model's selected decision in production, as well as the reasoning and detailed reasoning for the decision
  • The decision option selected by the customer
  • Feedback text written by the customer for the decision

Here is an example workflow diagram:

Quality Control

An expert cybersecurity analyst annotated the 196 decision examples with reasoning tags to explain the production and customer decisions, and label whether disagreements are explained by an analyst-error, mistaken reasoning by the AI or missing data / steps in the workflow. 

Term Definition
Good Reasoning The LLM reasoned correctly about the decision options given the input data
Bad Reasoning The LLM reasoned incorrectly about the decision options given the input data
Workflow Ran Correctly The workflow had complete inputs and outputs, and did not get interrupted
Workflow Ran Incorrectly The workflow had empty or partial data, or was interrupted
Customer was Aligned Customer agreed with the expert analyst decision
Customer was Mistaken Customer disagreed with the expert analyst decision
Missing Information The LLM made the correct decision given the available information, but information relevant to making the decision was missing from its input

Examples tagged with "Workflow ran correctly but missing information" or "Workflow ran incorrectly" were removed from the dataset. Two additional examples with the use case titled "Workshop" were removed, as these were mock runs. For the remaining examples, the workflow ran correctly and was not missing information.

Triage Decision Distribution

By Label

Across the filtered dataset, the workflows contained 27 distinct normalized decision labels, which we grouped into the following buckets: False Positive, True Positive, Requires Review, and Other. The distribution of the labels is shown below: 

Distribution of tags in triage dataset

False Positive 91, Requires Review 32, True Positive 27, Other 13.
False Positive Requires Review True Positive Other

The final evaluation dataset contains data from eight customers. The table below shows the number of annotated decision examples per customer and the tools used in each environment.

Customer Tools used in each environment # of decisions made
SOC Environment #1 Defender, CrowdStrike, Splunk, VirusTotal, AbuseIPDB, URLScan 48
SOC Environment #2 IP Quality Score, Zscaler, Confluence, ServiceNow, Splunk, Proofpoint TAP, Microsoft Entra, Cortex XSOAR, VMRay, Wiz, VirusTotal, URLScan 32
SOC Environment #3 Defender, Google SecOps, Silent Push, Microsoft Entra, IPLocation, Axonius, IPinfo, Nodedata 29
SOC Environment #4 Confluence, ServiceNow, Excel Online, Cortex XSOAR, VirusTotal, AbuseIPDB, URLScan 27
SOC Environment #5 Defender, Zscaler, Cortex XSOAR, Microsoft Entra, Abnormal Security, IPLocation, VirusTotal, AbuseIPDB, Shodan, IPinfo 16
SOC Environment #6 Defender, CiscoTalos, MxToolBox, AbuseIPDB, TeamDynamix, URLScan 5
SOC Environment #7 Splunk, Microsoft Entra, Mimecast, VirusTotal, AbuseIPDB, Spur 3
SOC Environment #8 Joe Sandbox, Proofpoint TAP, VirusTotal 3
Total 163


Use Case Distribution

We consolidated the use cases into 3 categories to consolidate our findings. Below is the map from the consolidated categories to the original use cases, as well as the distribution of the dataset over the consolidated categories. 

Use cases Counts
Phishing 96
Account Takeover 38
Network/Infrastructure 26

Confusion Matrix

Below is a confusion matrix between the expert analyst annotations and the recommendations our system makes. We prompt the models to be careful and escalate when they are not sure. 

Confusion matrix

Selection (Actual) vs Recommendation (Predicted)

Predicted
FP TP Requires
Review
Other
Actual False Positive 6065.9% 1617.6% 1516.5% 00.0%
True Positive 26.9% 2793.1% 00.0% 00.0%
Requires Review 13.1% 412.5% 2784.4% 00.0%
Other 17.7% 00.0% 00.0% 1292.3%

Results

Over all use cases (including those without a use case name), Gemini 3 Pro had the highest performance at 74.8%, with GPT-4.1 and Opus 4.5 tied for second.

Triage performance by model

Phishing Results:

On the phishing use cases, Gemini 3 Pro performed the best, followed by Opus 4.5. 

Triage performance by model

Account Takeover Results:

Sonnet 4 and GPT-4.1 were tied for best on the account takeover use cases. 

Triage performance by model

Network Results:

Opus 4.5 and GPT-4.1 were tied for best on the network use cases.

Triage performance by model

Conclusion & Recommendation

We gathered and annotated 163 triage decisions from real-world security investigations. We characterized the use case distribution, and grouped the use cases according to common categories. We then benchmarked large language models across each use case category and the full dataset. We found that Gemini 3 Pro performs best overall. Per use case category, Gemini 3 Pro gives the best performance on phishing, Sonnet 4 and GPT-4.1 are tied for best on account takeover, and Opus 4.5 and GPT-4.1 are tied for best on network. Based on our results, we recommend that security teams test models for different scenarios to find the solution that works best for their use case, different models are good at different things and the only way to know which model works best for your use-cases it to run formal evaluation - or, you can trust us! Our research team in Legion is constantly evaluating new models and improvements to our triage pipelines.

AI
Benchmarking Large Language Models for Automated Security Triage
May 25, 2026
12
min read

We benchmarked leading LLMs on 163 real-world security triage decisions across phishing, account takeover, and network use cases. See which models performed best and why the answer depends on your use case

Alyssa Mensch

The security industry spent years debating when attackers would gain capabilities once out of reach — nation-state-level offensive tooling, zero-day discovery at scale, exploits built and iterated in minutes.

That gap was real. And it gave organizations the impression that the decision about which AI to bring into security operations, and how to do it right, could wait until the picture was clearer.

Mythos ended that assumption. 

Not because of the model's size or strength, but because by the time Anthropic announced it, Mythos had already found thousands of high-severity vulnerabilities across every major operating system and browser in use today, without being told where to look. The decision not to release is the signal everyone was looking for.

That changes the implementation question. It was never acceptable to deploy AI badly in the SOC. Now it's not acceptable to deploy it slowly either. The organizations that will come out on top in the next 12 months are the ones that move fast and get it right, and most of the industry is about to discover that those aren't the same thing.

Level set: defenders have always been behind

The average breach lifecycle was already 258 days before AI-assisted attacks became the norm. This has nothing to do with the capabilities of analysts. Human-speed defense against machine-speed offense was always a losing equation.

Mythos-class models will almost certainly expand this breach lifecycle delta. 

Most Implementations Are Getting It Wrong

87% of organizations experienced an AI-driven cyberattack in the past year. Security teams know they need AI. Most are already moving. But most implementations are failing for the same reason, and it is not the technology. It is a missing critical datapoint.

You. The context that shapes your business.

Most AI SOC tools treat every organization as interchangeable. They integrate with your SIEM, your EDR, your threat intel platforms, and assume that is enough. It is not. What determines whether AI actually works in your environment has nothing to do with the list of integrations. It is the organizational context that no integration can capture.

How is your organization structured? Where does data actually live versus where it is supposed to live? Who owns what, and how does that map to investigation and response when something goes wrong? How do escalation paths work in practice, not on paper? And critically, how do you enable the business without interrupting it?

The difference shows up clearly in practice. A heavily regulated enterprise running investigations across proprietary internal platforms looks nothing like a technology company. The organizational context that shapes every investigation, every escalation decision, and every response action is invisible to a system that only sees tool outputs. 

Closing that gap is the foundational requirement that most implementations skip entirely.

Org Context Is Not a One-Time Setup

This is where most implementations fail, even when they start well.

Organizational context is not a configuration you complete on day one. Your organization is a living thing. Teams change. Tools get added. Processes evolve. New subsidiaries appear. Risk posture shifts with every acquisition, every regulatory update, every new product line the business launches. 

An AI system that ingested your context six months ago and stopped learning is already drifting from your reality. It is making decisions based on an organization that no longer exists.

The right model is not a one-off ingestion. It is a continuous learning system that stays embedded in how your organization actually operates, tracks how investigations unfold, incorporates analyst feedback, and updates its understanding as your environment changes. 

Not a snapshot. 

A persistent model of your specific organization that evolves with it.

What Good Implementation Actually Looks Like

First, AI systems needs to understand how your organization actually operates. Not how it is documented, but how investigations really unfold, where data actually lives, and how decisions get made under pressure. The gap between what is written down and what actually happens is where most AI systems fail.

Second, that understanding cannot be static. Organizations change constantly. New teams, new tools, new processes, new risk priorities. Any system that relies on a snapshot of your environment will drift from reality and degrade over time. The AI working in your environment needs to keep learning it, not just learn it once.

Third, it needs to operate within that context, not around it. Producing technically correct outputs is not enough. The system needs to produce outcomes that are actionable within your organization as it exists today. That means working within your existing workflows, tools, and constraints without asking you to change how you operate to accommodate it.

That is the standard. Systems built around this model behave differently from the start. They do not ask organizations to adapt to them. They adapt to the organization. That distinction is where most implementations succeed or fail, and it is where the industry is slowly converging.

The Only Durable Path

The organizations getting AI right in the SOC aren't the ones with the longest integration lists or the biggest models. They're the ones that treated organizational context as the foundation rather than the afterthought, and built systems that keep learning their environment rather than freezing it in place on day one.

That is a harder implementation. It requires more from the vendor and more from the buyer. But Mythos made the timeline for getting there non-negotiable. The organizations that move fast on the wrong implementation will spend the next year rebuilding. The ones that move slowly on the right one will spend it exposed. The only durable path is moving quickly on the version that actually holds up. Systems built on continuous organizational context, deployed now rather than after the next incident, force the question.

The gap that used to buy time for deliberation is gone. What's left is the quality of the decision you make in its absence.

AI
Fast and Right: Implementing AI in the SOC After Mythos
April 23, 2026
12
min read

Mythos ended the debate on whether AI belongs in the SOC. The new question is how to deploy it right and why organizational context is the foundation most implementations skip.

Ron Marsiano