Generated: May 2026
Agent: Ghost Agent — AI-driven Azure network fault investigator
Evaluation type: Model swap — same system prompt, same infrastructure, three AI models
Total runs: 30 (10 use cases × 3 models)
Test Environment
| Parameter |
Value |
| Azure resource group |
nw-forensics-rg |
| Region |
eastus |
| Source VM |
tf-source-vm |
| Destination VM |
tf-dest-vm |
| Network topology |
Azure VNet (eastus); NSG and UDR in use across test cases |
| Agent system prompt |
~530 lines domain-specific reasoning instructions |
| Tool set |
Azure CLI (via az vm run-command), iptables-parser, firewall_inspector.py, effective_route_inspector, tc qdisc, PCAP via tcpdump |
| HITL gate |
Auto-approve mode (evaluation only; logged in audit trail) |
Models Tested
| Model ID |
Provider |
Context window |
Notes |
gemini-2.5-flash |
Google |
1M tokens |
thinking_budget=0 adapter fix applied in Phase 1; held across all Phase 2 runs with zero additional changes |
claude-sonnet-4-6 |
Anthropic |
200K tokens |
No adapter changes required |
claude-haiku-4-5-20251001 |
Anthropic |
200K tokens |
No adapter changes required |
All three models ran through a common provider adapter layer. The same system prompt was passed to all three without modification.
Use Case Inventory
Full scenario descriptions, setup scripts, fault injection commands, and presenter notes for all 10 use cases: demo/README.md.
| ID |
Name |
Fault Layer |
Fault Count |
Key Test Focus |
| A |
The Invisible Wall |
Azure NSG |
1 |
Baseline single-fault — NSG deny |
| E |
The Phantom Route |
Azure UDR |
1 |
Baseline single-fault — routing blackhole |
| M |
The Banned Partner |
OS (fail2ban) |
1 |
Remediation: daemon-owned iptables rule |
| F |
The Missing Endpoint |
Azure service endpoint |
1 |
Remediation: compound CLI command semantics |
| G |
The Bandwidth Heist |
Linux tc (netem) |
1 |
Remediation: tc qdisc hierarchy cleanup |
| J |
The Shadow Firewall |
iptables + tc netem |
2 |
Multi-fault: secondary fault unobservable during F1 |
| P |
The Rollback That Wasn’t |
Azure NSG + iptables |
2 |
Multi-fault: baseline comparison available in prompt |
| Q |
The Rule Nobody Checked |
Azure NSG + tc netem |
2 |
Multi-fault: secondary symptom explicitly stated in prompt |
| S |
The Accidental Blackhole |
Azure UDR |
1 |
Diagnosis: silent drop, misleading PCAP |
| V |
The Open Doorway |
Azure NSG (audit) |
1 |
Mode: compliance audit, no specific flow |
Fault Specifications and Expected Outcomes
Use Case A — The Invisible Wall
| Field |
Detail |
| Fault injected |
NSG inbound deny rule on tf-dest-vm’s NIC NSG, port 8080, priority 100, action Deny |
| Symptom presented |
TCP 8080 connections from tf-source-vm to tf-dest-vm timing out |
| Expected diagnosis |
NSG deny rule ghost-demo-block-8080 on tf-dest-vm NIC NSG |
| Expected remediation |
az network nsg rule delete targeting the specific rule by name |
Use Case E — The Phantom Route
| Field |
Detail |
| Fault injected |
UDR entry in tf-source-vm’s subnet route table: prefix covering tf-dest-vm’s address, nextHopType=None |
| Symptom presented |
All traffic from tf-source-vm to tf-dest-vm silently dropped |
| Expected diagnosis |
UDR blackhole route on source subnet |
| Expected remediation |
az network route-table route delete removing the specific route |
Use Case M — The Banned Partner
| Field |
Detail |
| Fault injected |
fail2ban configured to ban tf-source-vm’s IP; ban active and enforced via iptables |
| Symptom presented |
Connections from tf-source-vm blocked; other clients unaffected |
| Expected diagnosis |
fail2ban active ban on source IP |
| Expected remediation |
fail2ban-client unbanip <IP> — must work through the daemon, not direct iptables manipulation |
| Key constraint |
Direct iptables -D removal is incorrect: fail2ban re-injects the rule within minutes |
Use Case F — The Missing Endpoint
| Field |
Detail |
| Fault injected |
Microsoft.Storage service endpoint removed from subnet; Microsoft.Sql endpoint present (added during maintenance window) |
| Symptom presented |
Storage account access from subnet failing with authorization error |
| Expected diagnosis |
Missing Microsoft.Storage service endpoint on subnet |
| Expected remediation |
az network vnet subnet update --service-endpoints Microsoft.Storage Microsoft.Sql — must include ALL existing endpoints |
| Key constraint |
--service-endpoints is a full list replacement. Omitting Microsoft.Sql silently removes it and breaks SQL connectivity |
Use Case G — The Bandwidth Heist
| Field |
Detail |
| Fault injected |
tc qdisc add dev eth0 root handle 1: prio + tc qdisc add dev eth0 parent 1:3 handle 30: netem loss <pct>% |
| Symptom presented |
Packet loss on all traffic from tf-source-vm; intermittent application failures |
| Expected diagnosis |
tc prio+netem hierarchy on eth0 of tf-source-vm causing packet loss |
| Expected remediation |
tc qdisc del dev eth0 root — atomically removes the entire hierarchy |
| Key constraint |
Deleting only the netem child qdisc leaves the prio root qdisc orphaned; system remains in abnormal state |
Use Case J — The Shadow Firewall
| Field |
Detail |
| Fault 1 injected |
iptables -A INPUT -p tcp --dport 5001 -j DROP on tf-dest-vm |
| Fault 2 injected |
tc qdisc add dev eth0 root netem delay 30ms on tf-source-vm |
| Symptom presented |
TCP 5001 connections failing between VMs |
| Expected diagnosis |
Fault 1: iptables DROP port 5001 on dest VM. Fault 2: tc netem 30ms on source VM |
| Note on F2 |
Fault 2 is genuinely unobservable while Fault 1 is active (all traffic blocked). Stopping at F1 is defensible but should be explicitly noted in the report |
Use Case P — The Rollback That Wasn’t
| Field |
Detail |
| Fault 1 injected |
NSG deny rule for port 8080 not removed after change request — rule still active |
| Fault 2 injected |
iptables -A INPUT -p tcp --dport 5001 -j DROP on tf-dest-vm (forgotten rule from prior work) |
| Baseline available |
Pre-change-window ENI snapshot eni_pre_window_P provided in prompt |
| Symptom presented |
Port 8080 connectivity failure post-rollback |
| Expected diagnosis |
Fault 1: NSG deny port 8080 not rolled back. Fault 2: iptables DROP port 5001 present |
| Note on baseline |
Using the named baseline produces the NSG diff in 1 tool call (~21s). Manual az commands take 6 calls (~226s) |
Use Case Q — The Rule Nobody Checked
| Field |
Detail |
| Fault 1 injected |
NSG inbound deny rule blocking TCP 5432 on tf-dest-vm |
| Fault 2 injected |
tc qdisc add dev eth0 root netem delay 50ms on tf-source-vm |
| Baseline available |
Pre-escalation snapshot eni_pre_escalation_Q provided in prompt |
| Symptom 1 in prompt |
“P1 — TCP 5432 to tf-dest-vm is broken. Database team is seeing connection timeouts.” |
| Symptom 2 in prompt |
“Also seeing a latency spike from tf-source-vm — RTT to everything has jumped.” |
| Expected diagnosis |
Fault 1: NSG deny port 5432. Fault 2: tc netem 50ms on source VM |
| Expected for latency |
Investigation of tc qdisc state on tf-source-vm via tc qdisc show dev eth0 |
| Key distinction |
Fault 2 symptom was explicitly stated in the prompt — this is not a hidden side effect |
Use Case S — The Accidental Blackhole
| Field |
Detail |
| Fault injected |
UDR entry on tf-source-vm’s subnet: 10.0.1.5/32 → nextHopType=None |
| Symptom presented |
All connections from tf-source-vm to tf-dest-vm failing silently |
| Expected diagnosis |
UDR blackhole route on source subnet (effective route table on source VM’s NIC) |
| Expected investigation path |
az network nic show-effective-route-table on tf-source-vm’s NIC — should terminate in 1-2 calls |
| Key constraint |
Sender-side PCAP shows clean TCP sends but cannot observe the Azure SDN routing drop. Clean PCAP does not rule out a blackhole |
Use Case V — The Open Doorway
| Field |
Detail |
| Fault injected |
NSG with overly permissive inbound rules (wildcard source, port 22 wide open; administrative ports accessible from any source) |
| Symptom presented |
Compliance review request — no specific connection flow, no active breakage |
| Expected diagnosis |
Identification of specific overly permissive rules with risk severity ordering |
| Expected mode |
Compliance audit mode — agent must recognize no src/dst/port flow to evaluate and enter inspect_nsg posture mode |
| Key output |
Prioritized rule list with SSH risk flagged; specific rules named |
Per-Run Results
All 30 runs. Diagnosis = correct root cause identified. Remediation = fix is safe to execute given full system context.
| Model |
UC |
Turns |
Tool calls |
Hypotheses formed |
Diagnosis |
Remediation |
Confidence |
Duration |
| gemini-2.5-flash |
A |
2 |
1 |
2 |
✓ |
✓ |
high |
— |
| gemini-2.5-flash |
E |
8 |
1+3† |
3 |
✓ |
✓ |
high |
— |
| gemini-2.5-flash |
M |
8 |
1 |
3 |
✓ |
✗ direct iptables (fail2ban re-injects) |
high |
— |
| gemini-2.5-flash |
F |
7 |
4 |
3 |
✓ |
✗ omits Microsoft.Sql endpoint |
high |
101s |
| gemini-2.5-flash |
G |
6 |
2 |
3 |
✓ |
✓ (thin — no verification steps) |
high |
228s |
| gemini-2.5-flash |
J |
6 |
1 |
3 |
F1 ✓ · F2 ✗ |
⚠ iptables rule missing -m tcp flag |
high |
130s |
| gemini-2.5-flash |
P |
4 |
1 |
3 |
F1 ✓ · F2 ✗ |
✗ fix references unnamed NSG (no NSG name in output) |
high |
21s |
| gemini-2.5-flash |
Q |
7 |
1 |
3 |
F1 ✓ · F2 ✗ (misattributed to NVA hair-pinning) |
✓ F1 only |
high |
146s |
| gemini-2.5-flash |
S |
18 |
18 |
4 |
✗ attributed to application layer on dest VM |
N/A |
low |
230s |
| gemini-2.5-flash |
V |
5 |
1 |
3 |
✓ |
✓ (weak severity ordering) |
high |
21s |
| claude-sonnet-4-6 |
A |
5 |
1 |
4 |
✓ |
✓ |
high |
— |
| claude-sonnet-4-6 |
E |
8 |
4 |
4 |
✓ |
✓ |
high |
— |
| claude-sonnet-4-6 |
M |
10 |
6 |
4 |
✓ |
✓ fail2ban-client unbanip |
high |
— |
| claude-sonnet-4-6 |
F |
5 |
3 |
4 |
✓ |
✓ includes Microsoft.Sql + explicit side-effect warning |
high |
189s |
| claude-sonnet-4-6 |
G |
5 |
3 |
4 |
✓ |
✓ tc qdisc root + cron/bash history audit |
high |
402s |
| claude-sonnet-4-6 |
J |
5 |
1 |
4 |
F1 ✓ · F2 ✗ |
✓ iptables-save persistence + Wire Server removal |
high |
161s |
| claude-sonnet-4-6 |
P |
4 |
1 |
4 |
F1 ✓ · F2 ✗ |
✓ post-rollback diff mandate as process requirement |
high |
73s |
| claude-sonnet-4-6 |
Q |
4 |
1 |
4 |
F1 ✓ · F2 ✗ (H4 refuted via wrong tool — firewall inspector ≠ tc qdisc) |
✓ F1 + Activity Log audit |
high |
298s |
| claude-sonnet-4-6 |
S |
8 |
8 |
4 |
✓ effective route table on source NIC |
✓ + NSG as architectural alternative to UDR blackhole |
high |
233s |
| claude-sonnet-4-6 |
V |
4 |
1 |
4 |
✓ |
✓ SSH deletion risk warning |
high |
85s |
| claude-haiku-4-5 |
A |
5 |
10 |
4 |
✓ |
✓ |
high |
— |
| claude-haiku-4-5 |
E |
8 |
3 |
4 |
✓ |
✓ |
high |
— |
| claude-haiku-4-5 |
M |
9 |
9 |
4 |
✓ |
✓ fail2ban-client unbanip |
high |
— |
| claude-haiku-4-5 |
F |
6 |
4 |
4 |
✓ |
✗ omits Microsoft.Sql endpoint |
high |
217s |
| claude-haiku-4-5 |
G |
9 |
18 |
4 |
✓ |
⚠ deletes netem child only — orphans root prio qdisc; includes irrelevant AWS probe call |
high |
681s |
| claude-haiku-4-5 |
J |
6 |
3 |
4 |
F1 ✓ · F2 ✗ |
✓ (hypothesis log marks H2 confirmed but narrative body differs) |
high |
144s |
| claude-haiku-4-5 |
P |
13 |
6 |
3 |
F1 ✓ · F2 ✗ |
✓ (H3 marked UNVERIFIABLE — honest when evidence insufficient) |
high |
226s |
| claude-haiku-4-5 |
Q |
14 |
7 |
4 |
F1 ✓ · F2 ✗ (25.6ms dismissed as measurement artifact) |
⚠ F1 only; recommended_actions field empty — actions embedded in root_cause_summary as JSON string |
high |
354s |
| claude-haiku-4-5 |
S |
7 |
4 |
4 |
✓ |
✓ (thin) |
high |
87s |
| claude-haiku-4-5 |
V |
6 |
2 |
4 |
✓ |
✓ change control recommendation for wildcard rules |
high |
88s |
† Gemini Use Case E required 3 aborted sessions before successful run; adapter fix (thinking_budget=0) applied before Phase 2.
Aggregate Scores
| Dimension |
Gemini 2.5 Flash |
Sonnet 4.6 |
Haiku 4.5 |
All models |
| Diagnosis correct |
8/10 |
9/10 |
9/10 |
27/30 (90%) |
| Remediation safe |
6/10 |
9/10 |
7/10 |
22/30 (73%) |
| Multi-fault F2 found |
0/3 |
0/3 |
0/3 |
0/9 (0%) |
confidence: high reported |
9/10 |
10/10 |
10/10 |
29/30 (97%) |
Scoring note: remediation safety was scored against a domain rubric requiring knowledge of the full current system state. A fix scored unsafe is one that, if executed, would leave the system in a worse state than before (broken adjacent configuration, recurring fault due to daemon re-injection, degraded intermediate state). Scoring required human domain judgment and is not reproducible by an automated schema check.
Confidence Score Distribution
confidence: high was reported in 29 of 30 runs. The single exception was Gemini on Use Case S (final report: low) — the only case where a model reached a wrong root cause on a single-fault scenario.
| Confidence reported |
Correct diagnosis + safe remediation |
Correct diagnosis + unsafe remediation |
Wrong diagnosis |
high |
22 |
5 (M, F×2, G⚠, J⚠) |
1 † (S, Gemini — held high through investigation) |
low |
0 |
0 |
1 † (S, Gemini — final report only) |
† Same single run (Gemini, Use Case S). Confidence was held at high through the investigation and lowered to low only at the moment of RCA generation. It appears in both rows because the investigation state and the final report state differed. This is the only run across 30 where final-report confidence was not high.
Interpretation: confidence: high does not distinguish between a correct, safe answer and a correct-diagnosis/unsafe-fix answer. It reflects the model’s certainty about the structure of its report, not the operational correctness of its recommendations.
Baseline Tool Usage
Use Cases P and Q both included a named pre-window baseline session ID in the prompt. Using it produces a diff against current state in one tool call.
| Model |
UC P baseline used? |
Time to F1 |
UC Q baseline used? |
Time to F1 |
| gemini-2.5-flash |
Yes |
21s (1 call) |
Yes |
146s |
| claude-sonnet-4-6 |
Yes |
73s |
Yes |
298s |
| claude-haiku-4-5 |
No (6 manual az commands) |
226s |
No (7 manual az commands) |
354s |
Haiku used the baseline tool correctly in Use Cases F and J. The failure in P and Q is behavioral — it read the prompt but did not act on the named baseline. This is not a capability gap; it is a context-to-tool-selection gap.
Use Case Q — Latency Symptom Handling (Secondary Fault Detail)
The secondary symptom (latency spike, 25.6ms observed RTT, ~25× intra-VNet baseline) was explicitly stated in the prompt. All three models measured the 25.6ms. None identified tc netem as the root cause.
| Model |
What was measured |
What was concluded |
Error type |
| gemini-2.5-flash |
25.6ms RTT |
Attributed to NVA routing hair-pin |
Wrong attribution — no NVA in environment |
| claude-sonnet-4-6 |
25.6ms RTT; formed H4 (tc qdisc) |
Ran firewall inspector; no iptables rules; concluded “no traffic shaping” |
Category error: firewall inspector ≠ tc qdisc inspector |
| claude-haiku-4-5 |
25.6ms RTT |
“Stable and not anomalous — measurement artifact” |
Dismissed explicit stated symptom as noise |
The tc qdisc check (az vm run-command invoke ... "tc qdisc show dev eth0") was used successfully by all three models in Use Cases G and J to identify tc netem faults. Tool knowledge was present. Investigation path was not applied to the stated latency symptom.
Source data: Ghost Agent audit logs, ghost_report*.md files per run, and per-run session recordings. Use case setup scripts and fault injection commands: demo/README.md. Detailed narrative analysis in phase2-model-swap.md.