Picking an AI Model for Cloud Networking: What the Vendors Don’t Tell You

A guide built from 30 real investigations (10 use cases * 3 LLM models) with what to look for if starting from scratch.


Nobody tells you that one model will recommend iptables -D INPUT to remove a fail2ban ban—and that fail2ban will silently lose track of the actual firewall state, still believing the IP is banned while the protection is gone, because the model didn’t know to work through the daemon. Nobody tells you that another model will correctly identify a routing blackhole hypothesis and then use a firewall inspection tool to verify it, get a null result, and conclude there’s no traffic shaping—because iptables and tc are different kernel subsystems and the model conflated them.

You find these things out by running models on your actual problems and watching what happens on your infrastructure.

That’s how this guide was written. I ran 30 network fault investigations across 3 models—Gemini 2.5 Flash, Claude Sonnet 4.6, and Claude Haiku 4.5—using the same system prompt, the same Azure test infrastructure, and 10 fault scenarios ranging from simple NSG denies to multi-fault environments with two independent failures running simultaneously. This article shares what I found and, more usefully, how to think about model selection for cloud and infrastructure work when you’re doing the same thing for your own domain.


The Gap No Vendor Fills

If you go looking for guidance on which AI model to use for cloud network forensics, you won’t find it. What you will find: context window sizes, benchmark scores on standardized tests, pricing tiers, and speed comparisons.

These are all real and worth knowing. They just don’t tell how does this model will behave when it has to make a judgment call in a specific domain?

Does it know that az network vnet subnet update --service-endpoints is a full list replacement, not an append and that omitting an existing endpoint silently deletes it? Does it know to use the effective route table at the NIC rather than a sender-side packet capture when evaluating a routing blackhole hypothesis? Does it know to work through fail2ban when fail2ban owns the iptables rule it’s been asked to remove?

These behavioral properties aren’t documented anywhere, neither on the product page nor in the benchmark. This isn’t a failure on the part of the vendors either. General-purpose LLM models are trained on general-purpose data. The behavioral contract that matters for your domain, with your tools and your failure patterns, only becomes visible when you deploy the model on that domain and observe what it does.

  What vendors publish:              What determines safety in your domain:
  ─────────────────────              ──────────────────────────────────────
  ✓  Context window size             ?  How the model handles adjacent config
  ✓  Benchmark scores                ?  Whether it picks the right tool per hypothesis
  ✓  API format + model ID           ?  Whether it reads reference context in the prompt
  ✓  Pricing tiers                   ?  Whether structured output matches the narrative
  ✓  Deprecation timeline            ?  What a fix does to the surrounding system

  The left column is in the documentation.
  The right column only surfaces once you deploy.

That’s the uncomfortable reality: the only way to understand a model’s behavioral contract for your specific work is to run it on your specific work.


What Criteria Actually Matter

Before getting to what I found, it’s worth naming the criteria that turned out to matter most. They’re the dimensions where the models actually diverged.

Remediation depth. Does the model understand what the fix does to things adjacent to what it’s fixing? A correct diagnosis paired with an incomplete fix is arguably more dangerous than a wrong diagnosis, because the operator has more reason to trust it. The gap between “the model found the right fault” and “the model gave a safe fix” was the single biggest surprise across all 30 runs.

Tool-to-hypothesis accuracy. When the model forms a hypothesis about a specific layer (OS firewall vs. traffic shaping vs. routing), does it reach for the right inspection tool to verify it? If your domain has multiple subsystems that look superficially similar but require different tools, this is the criterion that will trip models up in ways that are hard to catch without domain knowledge.

Multi-symptom investigation behavior. When you give the model two or more symptoms, does it investigate both? What I found: every model found the primary fault and stopped—even when the secondary symptom was explicitly stated in the prompt. This isn’t a model-specific failure; it’s a general investigation closure pattern that requires a deliberate system prompt instruction to override.

Forensic context usage. If you provide pre-incident baselines, session snapshots, or other reference points in the prompt, does the model use them? Or does it investigate manually from scratch? This matters both for speed and for the quality of the evidence trail it can produce.

Structural consistency. Do the structured output fields (hypothesis table, root cause, recommended actions) agree with the narrative body of the same report? This sounds like a formatting concern, but it’s actually a reliability signal — if the hypothesis table says one thing and the narrative says another, downstream processing built on structured fields will quietly work from wrong data.

Criterion The question it answers Why it matters
Remediation depth Does the fix account for what it touches? Correct diagnosis + unsafe fix = more trusted, harder to catch
Tool-to-hypothesis Right inspection tool for each hypothesis type? Wrong tool produces null result → correct hypothesis falsely refuted
Multi-symptom coverage Does it investigate all stated symptoms? Investigation closure on primary fault is universal; secondary requires explicit instruction
Forensic context usage Does it use baselines you provided in the prompt? Bypassing available evidence produces slower, weaker findings
Structural consistency Do structured fields and narrative body agree? Field/narrative divergence silently corrupts downstream automated processing

What I Found Across the Three Models

Here are the behavioral profiles, based on what actually happened across 30 runs. These are observations, not verdicts — and they’re specific to these model versions at the time of testing.

Sonnet 4.6 — goes deepest on remediation

Sonnet was the most consistent at understanding what a fix does to the surrounding system. When the scenario required knowing that a CLI command treats its argument as a full list replacement, Sonnet included all existing configuration values in the fix and explicitly explained why omitting any of them would cause a secondary failure. When the fix required persistence across reboots, Sonnet added the persistence step. When a wildcard NSG rule was the target for removal, Sonnet flagged the SSH lockout risk before recommending the deletion.

It also proved negatives explicitly — not just confirming the fault it found, but documenting why the competing hypotheses were ruled out. For teams that need the investigation to hold up in a post-incident review or drive a process change, this depth matters.

Where Sonnet struggled: it made a category error when verifying a traffic shaping hypothesis, using the firewall inspection tool instead of tc qdisc show. When no iptables rules were found, it concluded there was no traffic shaping—conflating two different kernel subsystems. And like the other models, it stopped investigating after finding the primary fault, even when a secondary symptom was explicitly in the prompt.

Good fit for: Scenarios where remediation completeness matters more than speed. Compound fix commands (anything that writes a configuration list). Post-incident documentation. Scenarios where you need the agent to reason about what the fix does to things it didn’t set out to change.

Gemini 2.5 Flash — fastest when it picks the right instrument first

Gemini’s great strength is speed when everything lines up: right fault layer, right tool, named baseline in the prompt. It used pre-incident baselines immediately when they were referenced — no coaxing — and on simple NSG and baseline-comparison scenarios, it reached correct answers in under 30 seconds with a single effective tool call.

It’s also the only model that expressed genuine uncertainty when it couldn’t find a satisfying answer. The one scenario where a model reached a wrong diagnosis (a routing blackhole misread as an application layer failure), Gemini eventually lowered its confidence to low in the final report. Every other model on every other scenario reported confidence: high, right or wrong.

The flip side of Gemini’s directness: when it picks the wrong instrument early, it goes deep in the wrong direction. On the routing blackhole case, it ran a sender-side packet capture to verify a routing hypothesis—which is the wrong tool for that question, because a sender-side PCAP shows the VM transmitting packets normally with no anomaly, actively obscuring the fault. The Azure fabric drops packets silently after they leave the NIC; the sender sees nothing wrong. It ran 18 commands over 230 seconds and reached a wrong answer. The correct path was one call to the effective route table.

It also has knowledge gaps in domain-specific operational contexts — specifically, anything where a management daemon owns the configuration being modified. It knew fail2ban was responsible for the iptables block. It didn’t connect that to the requirement to work through fail2ban rather than manipulating the iptables rule directly.

Good fit for: Fast investigation on predictable fault types. Scenarios with named baselines. Time-sensitive investigations where the fault layer is known upfront.

Worth being cautious about: Any scenario involving daemon-managed configuration. Routing layer faults (it tends toward PCAP before the effective route table). Any fix command that takes a list as a full replacement.

Haiku 4.5 — honest, capable on simple cases, inconsistent under complexity

Haiku was correct on every single-fault, clear-symptom scenario—and on one scenario (the routing blackhole), it was actually the fastest model by a significant margin, beating both Gemini and Sonnet. It also produces the most honest hypothesis disposition when evidence genuinely isn’t there: rather than false-refuting a hypothesis it couldn’t verify, it marked it UNVERIFIABLE, which is more accurate.

The challenge with Haiku is that its reliability drops as scenario complexity increases. In two consecutive use cases that explicitly named a pre-incident baseline session ID in the prompt, Haiku ignored the baseline and investigated manually—taking 3–10 times as long and producing weaker evidence. This wasn’t a capability gap (it used the baseline tool correctly on other scenarios); it was a context-reading gap. The prompt said the baseline was available. Haiku didn’t act on it.

There’s also a recurring structural inconsistency: the hypothesis log and the narrative root cause section disagree in a pattern that shows up across multiple scenarios. The table marks a hypothesis as confirmed while the narrative describes a different mechanism. For anyone building automated processing on top of Haiku’s structured output, this is a reliability problem that’s hard to detect without reading the full report.

Good fit for: Single-fault NSG and routing scenarios where speed matters. Cost-sensitive workloads with simple, well-defined fault types. Scenarios where you want honest uncertainty signals.

Worth being cautious about: Anything that requires the model to act on reference context in the prompt. Compound fix command semantics. Downstream systems that parse structured output fields without cross-referencing the narrative.


How the three models compare across the five criteria

Criterion Sonnet 4.6 Gemini 2.5 Flash Haiku 4.5
Remediation depth ✓ deepest — accounts for adjacent config, daemon ownership, persistence ⚠ gaps on daemon-managed config and list-replace semantics ⚠ partial — misses atomicity on compound cleanup
Tool-to-hypothesis ⚠ one category error: used firewall inspector to verify a tc qdisc hypothesis ⚠ used sender-side PCAP to evaluate a routing blackhole ✗ irrelevant tool calls in G (AWS probe in an Azure investigation, stale snapshot reads); dismissed explicit latency symptom as measurement artifact without investigating (Q)
Forensic context usage ✓ uses named baselines consistently ✓ uses named baselines immediately ✗ bypassed baseline in two consecutive cases where it was explicitly provided
Structural consistency ⚠ some fix commands use placeholder names requiring secondary lookup ✓ clean ✗ hypothesis table contradicts narrative body across multiple scenarios
Multi-symptom coverage

✓ reliable · ⚠ situational · ✗ unreliable across test cases

Multi-symptom coverage is ✗ for all three. This is not a model selection criterion — it is a system prompt requirement. The model you choose will not fix it.


The One Finding That Applies to All Three

No model reliably investigated secondary faults. Across every two-fault scenario—three of them—every model found the primary fault and stopped. In one case, the secondary symptom was explicitly in the prompt. All three models measured the relevant evidence. None of them identified the root cause of the secondary fault.

  Prompt:  "Primary symptom—[fault description]"
           "Also seeing: [secondary symptom]"        ← explicitly stated

           │                          │
           ▼                          ▼
   Investigated fully          Measured (evidence collected)
   Primary fault found         Then: wrong attribution,
   Report closes               wrong tool, or dismissed as noise
   confidence: high            ─────────────────────────────
                                       │
                                       ▼
                                Secondary fault missed
                                0/3 models · 3/3 scenarios

This isn’t something you can solve by picking a different model. It’s a system prompt problem. If your domain regularly involves concurrent failures, you need an explicit instruction that treats each stated symptom as a separate investigation branch. Without it, the model will stop when it finds a satisfying answer to the primary complaint, regardless of what else you told it.


What Happens When the Model Changes

Here’s the part that’s easy to overlook until it’s too late.

Everything described above is specific to the model versions I tested: Gemini 2.5 Flash, Sonnet 4.6, Haiku 4.5, at the time these experiments ran. When a vendor deprecates a model and recommends a replacement, the migration guide they publish will cover API compatibility—request format, response format, parameter names, model ID changes. It won’t cover whether the replacement model still knows to include existing service endpoints in a list-replacement fix command. It won’t cover whether the replacement model still uses forensic baselines when they’re referenced in the prompt.

The behavioral contract you built through testing is valid for the model you tested it on. When that model changes, the contract changes with it. Sometimes the change makes things better. Sometimes it introduces new failure patterns. You won’t know which until you run your scenarios again.

This is why treating model selection as a one-time decision is risky. The more accurate mental model: you’re entering into an ongoing relationship with a model’s behavioral profile, and that profile can shift without notice.


Building Your Own Understanding

The practical question: if you’re deploying an AI agent for cloud networking or infrastructure work, how do you develop the same kind of behavioral picture for your specific domain?

  Community priors
       │  Start with what practitioners report — narrow to 2–3 candidates
       ▼
  Define failure cases
       │  Hard domain-specific scenarios first; easy cases are not the test
       ▼
  Capture system state snapshot
       │  Before each run — this is what makes remediation scoring possible
       ▼
  Run + score separately
       │  Diagnosis accuracy  ←──────────────────→  Remediation safety
       ▼
  Version-tag every result
       │  Model ID on every run — your behavioral changelog starts here
       ▼
  Migrate before cutover
       │  Re-run hard cases against replacement model before switching traffic
       ▼
  Monitor behavioral signal
         Regular cadence against live model — catch drift before production

Start with what the community already knows. Before running a single test, look at what practitioners in your space are reporting—deployment write-ups, posts from builders who’ve shipped similar systems, community forums. Experienced builders don’t arrive at their first eval with a blank slate. They arrive with hypotheses: this model tends to be stronger on structured retrieval, that one handles multi-step tool chaining better, the third is more conservative about expressing uncertainty. Domain testing validates and refines those priors. It doesn’t replace them. Using community knowledge as your starting point also narrows your candidate list before you invest in building test scenarios—you’re comparing two or three plausible options, not the entire frontier.

Start with the failure cases that matter most in your domain. Not the easy scenarios — the ones that require the model to know something operationally specific. For cloud networking, that’s compound fix commands, daemon-managed configuration, and routing layer diagnosis. For your domain, you probably already know what the operationally tricky scenarios are. Build your test set around those.

Score diagnosis and remediation separately. An evaluation that only asks “did the model find the root cause” will systematically miss the class of failures that matters most in production. In this evaluation, diagnosis was correct 90% of the time. Remediation was safe 73% of the time. A 90% headline would have hidden the gap that actually poses the risk.

Use eval frameworks for the scaffolding; bring your own domain rubrics. Tools like Promptfoo and Braintrust handle the infrastructure work—prompt versioning, result storage, scoring pipelines, regression tracking across model versions. Building that from scratch is significant engineering you don’t need to do. What you bring to the table is the domain-specific scoring logic: the rubric that knows what “correct remediation” means for your specific infrastructure. That judgment can’t be automated without domain knowledge—it’s the one piece the frameworks can’t supply. Think of the split as: framework handles the harness, you handle the oracle.

Capture what correct remediation looks like before you run. The hardest part of scoring remediation safety is that you need to know what state the system was in at test time. What endpoints were already present? What daemon owned the rule being modified? What persistence requirement did the fix need to satisfy? Building a reference snapshot before each test run is the engineering work that makes remediation scoring possible.

Run before you migrate, not after. When a model is deprecated and you need to move to a replacement, run your test scenarios against the replacement before switching production traffic. The scenarios that diverge between old and new model behavior are the ones you need to know about before they affect real infrastructure.

Track model version as a discipline, not an afterthought. Log which model version produced which outputs in every eval run. When behavior shifts—after a deprecation, a silent backend update, or a version bump—you want to diff against the specific version that was working, not against a vague memory of “how it used to behave.” Over time, version-tagged eval results become a behavioral changelog for your domain: a record of what each model version could and couldn’t do reliably, built from your own test cases rather than vendor documentation. This is something experienced builders do as a matter of course, and it pays off when a model update quietly changes behavior you depended on. Without version tagging, you’re debugging in the dark.

Monitor the behavioral signal, not just the API signal. Uptime monitoring and latency dashboards tell you whether the API is responding. They don’t tell you whether the model’s behavioral contract shifted in the latest silent backend update. A small set of representative scenarios run on a regular cadence against the live model is the only way to catch behavioral drift before it shows up as a production incident.


Quick Reference

If you need… Consider… Watch for…
Deepest remediation and side-effect awareness Sonnet 4.6 Fix command completeness sometimes requires a secondary lookup
Fastest time-to-answer on clear, single-fault scenarios Gemini 2.5 Flash Wrong instrument on routing layer; daemon dependency gaps
Honest uncertainty signals and simple-fault speed Haiku 4.5 Hypothesis log / narrative inconsistency; bypasses baseline tools
Multi-fault investigation Any model + explicit system prompt instruction All three stopped at Fault 1 without an explicit mandate
Safe-to-execute remediation Any model + human review confidence: high appeared on every unsafe fix in this evaluation

A Note on What This Isn’t

This is not a ranking. The models aren’t competing on a single axis. Each has a behavioral profile that fits some scenarios better than others, and those profiles are specific to the versions tested and the domain tested in.

What I hope is useful here isn’t the specific rankings—those will shift as models evolve—but the criteria and the approach. The criteria for evaluating a model’s behavioral contract in your domain. The approach of testing diagnosis and remediation separately, capturing system state at test time, and re-running when models change.

The vendors won’t tell you which model is right for cloud networking. Only you can learn that, by running your scenarios and watching what happens.


This guide is a companion to The Contract You Never Signed: What Changes When You Swap AI Models, which covers the behavioral contract problem in depth. Data from: Ghost Agent model-swap evaluation, 30 runs across 10 use cases. Full evaluation record at eval-findings-behavioral-model-swap.md. Infrastructure: Azure eastus, agentic-network-tools on GitHub.