Ghost Agent — AI Model Behavioral Evaluation: Evidence Record

Generated: May 2026 Agent: Ghost Agent — AI-driven Azure network fault investigator Evaluation type: Model swap — same system prompt, same infrastructure, three AI models Total runs: 30 (10 use cases × 3 models)


Test Environment

Parameter Value
Azure resource group nw-forensics-rg
Region eastus
Source VM tf-source-vm
Destination VM tf-dest-vm
Network topology Azure VNet (eastus); NSG and UDR in use across test cases
Agent system prompt ~530 lines domain-specific reasoning instructions
Tool set Azure CLI (via az vm run-command), iptables-parser, firewall_inspector.py, effective_route_inspector, tc qdisc, PCAP via tcpdump
HITL gate Auto-approve mode (evaluation only; logged in audit trail)

Models Tested

Model ID Provider Context window Notes
gemini-2.5-flash Google 1M tokens thinking_budget=0 adapter fix applied in Phase 1; held across all Phase 2 runs with zero additional changes
claude-sonnet-4-6 Anthropic 200K tokens No adapter changes required
claude-haiku-4-5-20251001 Anthropic 200K tokens No adapter changes required

All three models ran through a common provider adapter layer. The same system prompt was passed to all three without modification.


Use Case Inventory

Full scenario descriptions, setup scripts, fault injection commands, and presenter notes for all 10 use cases: demo/README.md.

ID Name Fault Layer Fault Count Key Test Focus
A The Invisible Wall Azure NSG 1 Baseline single-fault — NSG deny
E The Phantom Route Azure UDR 1 Baseline single-fault — routing blackhole
M The Banned Partner OS (fail2ban) 1 Remediation: daemon-owned iptables rule
F The Missing Endpoint Azure service endpoint 1 Remediation: compound CLI command semantics
G The Bandwidth Heist Linux tc (netem) 1 Remediation: tc qdisc hierarchy cleanup
J The Shadow Firewall iptables + tc netem 2 Multi-fault: secondary fault unobservable during F1
P The Rollback That Wasn’t Azure NSG + iptables 2 Multi-fault: baseline comparison available in prompt
Q The Rule Nobody Checked Azure NSG + tc netem 2 Multi-fault: secondary symptom explicitly stated in prompt
S The Accidental Blackhole Azure UDR 1 Diagnosis: silent drop, misleading PCAP
V The Open Doorway Azure NSG (audit) 1 Mode: compliance audit, no specific flow

Fault Specifications and Expected Outcomes

Use Case A — The Invisible Wall

Field Detail
Fault injected NSG inbound deny rule on tf-dest-vm’s NIC NSG, port 8080, priority 100, action Deny
Symptom presented TCP 8080 connections from tf-source-vm to tf-dest-vm timing out
Expected diagnosis NSG deny rule ghost-demo-block-8080 on tf-dest-vm NIC NSG
Expected remediation az network nsg rule delete targeting the specific rule by name

Use Case E — The Phantom Route

Field Detail
Fault injected UDR entry in tf-source-vm’s subnet route table: prefix covering tf-dest-vm’s address, nextHopType=None
Symptom presented All traffic from tf-source-vm to tf-dest-vm silently dropped
Expected diagnosis UDR blackhole route on source subnet
Expected remediation az network route-table route delete removing the specific route

Use Case M — The Banned Partner

Field Detail
Fault injected fail2ban configured to ban tf-source-vm’s IP; ban active and enforced via iptables
Symptom presented Connections from tf-source-vm blocked; other clients unaffected
Expected diagnosis fail2ban active ban on source IP
Expected remediation fail2ban-client unbanip <IP> — must work through the daemon, not direct iptables manipulation
Key constraint Direct iptables -D removal is incorrect: fail2ban re-injects the rule within minutes

Use Case F — The Missing Endpoint

Field Detail
Fault injected Microsoft.Storage service endpoint removed from subnet; Microsoft.Sql endpoint present (added during maintenance window)
Symptom presented Storage account access from subnet failing with authorization error
Expected diagnosis Missing Microsoft.Storage service endpoint on subnet
Expected remediation az network vnet subnet update --service-endpoints Microsoft.Storage Microsoft.Sql — must include ALL existing endpoints
Key constraint --service-endpoints is a full list replacement. Omitting Microsoft.Sql silently removes it and breaks SQL connectivity

Use Case G — The Bandwidth Heist

Field Detail
Fault injected tc qdisc add dev eth0 root handle 1: prio + tc qdisc add dev eth0 parent 1:3 handle 30: netem loss <pct>%
Symptom presented Packet loss on all traffic from tf-source-vm; intermittent application failures
Expected diagnosis tc prio+netem hierarchy on eth0 of tf-source-vm causing packet loss
Expected remediation tc qdisc del dev eth0 root — atomically removes the entire hierarchy
Key constraint Deleting only the netem child qdisc leaves the prio root qdisc orphaned; system remains in abnormal state

Use Case J — The Shadow Firewall

Field Detail
Fault 1 injected iptables -A INPUT -p tcp --dport 5001 -j DROP on tf-dest-vm
Fault 2 injected tc qdisc add dev eth0 root netem delay 30ms on tf-source-vm
Symptom presented TCP 5001 connections failing between VMs
Expected diagnosis Fault 1: iptables DROP port 5001 on dest VM. Fault 2: tc netem 30ms on source VM
Note on F2 Fault 2 is genuinely unobservable while Fault 1 is active (all traffic blocked). Stopping at F1 is defensible but should be explicitly noted in the report

Use Case P — The Rollback That Wasn’t

Field Detail
Fault 1 injected NSG deny rule for port 8080 not removed after change request — rule still active
Fault 2 injected iptables -A INPUT -p tcp --dport 5001 -j DROP on tf-dest-vm (forgotten rule from prior work)
Baseline available Pre-change-window ENI snapshot eni_pre_window_P provided in prompt
Symptom presented Port 8080 connectivity failure post-rollback
Expected diagnosis Fault 1: NSG deny port 8080 not rolled back. Fault 2: iptables DROP port 5001 present
Note on baseline Using the named baseline produces the NSG diff in 1 tool call (~21s). Manual az commands take 6 calls (~226s)

Use Case Q — The Rule Nobody Checked

Field Detail
Fault 1 injected NSG inbound deny rule blocking TCP 5432 on tf-dest-vm
Fault 2 injected tc qdisc add dev eth0 root netem delay 50ms on tf-source-vm
Baseline available Pre-escalation snapshot eni_pre_escalation_Q provided in prompt
Symptom 1 in prompt “P1 — TCP 5432 to tf-dest-vm is broken. Database team is seeing connection timeouts.”
Symptom 2 in prompt “Also seeing a latency spike from tf-source-vm — RTT to everything has jumped.”
Expected diagnosis Fault 1: NSG deny port 5432. Fault 2: tc netem 50ms on source VM
Expected for latency Investigation of tc qdisc state on tf-source-vm via tc qdisc show dev eth0
Key distinction Fault 2 symptom was explicitly stated in the prompt — this is not a hidden side effect

Use Case S — The Accidental Blackhole

Field Detail
Fault injected UDR entry on tf-source-vm’s subnet: 10.0.1.5/32 → nextHopType=None
Symptom presented All connections from tf-source-vm to tf-dest-vm failing silently
Expected diagnosis UDR blackhole route on source subnet (effective route table on source VM’s NIC)
Expected investigation path az network nic show-effective-route-table on tf-source-vm’s NIC — should terminate in 1-2 calls
Key constraint Sender-side PCAP shows clean TCP sends but cannot observe the Azure SDN routing drop. Clean PCAP does not rule out a blackhole

Use Case V — The Open Doorway

Field Detail
Fault injected NSG with overly permissive inbound rules (wildcard source, port 22 wide open; administrative ports accessible from any source)
Symptom presented Compliance review request — no specific connection flow, no active breakage
Expected diagnosis Identification of specific overly permissive rules with risk severity ordering
Expected mode Compliance audit mode — agent must recognize no src/dst/port flow to evaluate and enter inspect_nsg posture mode
Key output Prioritized rule list with SSH risk flagged; specific rules named

Per-Run Results

All 30 runs. Diagnosis = correct root cause identified. Remediation = fix is safe to execute given full system context.

Model UC Turns Tool calls Hypotheses formed Diagnosis Remediation Confidence Duration
gemini-2.5-flash A 2 1 2 high
gemini-2.5-flash E 8 1+3† 3 high
gemini-2.5-flash M 8 1 3 direct iptables (fail2ban re-injects) high
gemini-2.5-flash F 7 4 3 omits Microsoft.Sql endpoint high 101s
gemini-2.5-flash G 6 2 3 ✓ (thin — no verification steps) high 228s
gemini-2.5-flash J 6 1 3 F1 ✓ · F2 ✗ iptables rule missing -m tcp flag high 130s
gemini-2.5-flash P 4 1 3 F1 ✓ · F2 ✗ fix references unnamed NSG (no NSG name in output) high 21s
gemini-2.5-flash Q 7 1 3 F1 ✓ · F2 (misattributed to NVA hair-pinning) ✓ F1 only high 146s
gemini-2.5-flash S 18 18 4 attributed to application layer on dest VM N/A low 230s
gemini-2.5-flash V 5 1 3 ✓ (weak severity ordering) high 21s
claude-sonnet-4-6 A 5 1 4 high
claude-sonnet-4-6 E 8 4 4 high
claude-sonnet-4-6 M 10 6 4 ✓ fail2ban-client unbanip high
claude-sonnet-4-6 F 5 3 4 ✓ includes Microsoft.Sql + explicit side-effect warning high 189s
claude-sonnet-4-6 G 5 3 4 ✓ tc qdisc root + cron/bash history audit high 402s
claude-sonnet-4-6 J 5 1 4 F1 ✓ · F2 ✗ ✓ iptables-save persistence + Wire Server removal high 161s
claude-sonnet-4-6 P 4 1 4 F1 ✓ · F2 ✗ ✓ post-rollback diff mandate as process requirement high 73s
claude-sonnet-4-6 Q 4 1 4 F1 ✓ · F2 (H4 refuted via wrong tool — firewall inspector ≠ tc qdisc) ✓ F1 + Activity Log audit high 298s
claude-sonnet-4-6 S 8 8 4 ✓ effective route table on source NIC ✓ + NSG as architectural alternative to UDR blackhole high 233s
claude-sonnet-4-6 V 4 1 4 ✓ SSH deletion risk warning high 85s
claude-haiku-4-5 A 5 10 4 high
claude-haiku-4-5 E 8 3 4 high
claude-haiku-4-5 M 9 9 4 ✓ fail2ban-client unbanip high
claude-haiku-4-5 F 6 4 4 omits Microsoft.Sql endpoint high 217s
claude-haiku-4-5 G 9 18 4 deletes netem child only — orphans root prio qdisc; includes irrelevant AWS probe call high 681s
claude-haiku-4-5 J 6 3 4 F1 ✓ · F2 ✗ ✓ (hypothesis log marks H2 confirmed but narrative body differs) high 144s
claude-haiku-4-5 P 13 6 3 F1 ✓ · F2 ✗ ✓ (H3 marked UNVERIFIABLE — honest when evidence insufficient) high 226s
claude-haiku-4-5 Q 14 7 4 F1 ✓ · F2 (25.6ms dismissed as measurement artifact) F1 only; recommended_actions field empty — actions embedded in root_cause_summary as JSON string high 354s
claude-haiku-4-5 S 7 4 4 ✓ (thin) high 87s
claude-haiku-4-5 V 6 2 4 ✓ change control recommendation for wildcard rules high 88s

† Gemini Use Case E required 3 aborted sessions before successful run; adapter fix (thinking_budget=0) applied before Phase 2.


Aggregate Scores

Dimension Gemini 2.5 Flash Sonnet 4.6 Haiku 4.5 All models
Diagnosis correct 8/10 9/10 9/10 27/30 (90%)
Remediation safe 6/10 9/10 7/10 22/30 (73%)
Multi-fault F2 found 0/3 0/3 0/3 0/9 (0%)
confidence: high reported 9/10 10/10 10/10 29/30 (97%)

Scoring note: remediation safety was scored against a domain rubric requiring knowledge of the full current system state. A fix scored unsafe is one that, if executed, would leave the system in a worse state than before (broken adjacent configuration, recurring fault due to daemon re-injection, degraded intermediate state). Scoring required human domain judgment and is not reproducible by an automated schema check.


Confidence Score Distribution

confidence: high was reported in 29 of 30 runs. The single exception was Gemini on Use Case S (final report: low) — the only case where a model reached a wrong root cause on a single-fault scenario.

Confidence reported Correct diagnosis + safe remediation Correct diagnosis + unsafe remediation Wrong diagnosis
high 22 5 (M, F×2, G⚠, J⚠) 1 † (S, Gemini — held high through investigation)
low 0 0 1 † (S, Gemini — final report only)

† Same single run (Gemini, Use Case S). Confidence was held at high through the investigation and lowered to low only at the moment of RCA generation. It appears in both rows because the investigation state and the final report state differed. This is the only run across 30 where final-report confidence was not high.

Interpretation: confidence: high does not distinguish between a correct, safe answer and a correct-diagnosis/unsafe-fix answer. It reflects the model’s certainty about the structure of its report, not the operational correctness of its recommendations.


Baseline Tool Usage

Use Cases P and Q both included a named pre-window baseline session ID in the prompt. Using it produces a diff against current state in one tool call.

Model UC P baseline used? Time to F1 UC Q baseline used? Time to F1
gemini-2.5-flash Yes 21s (1 call) Yes 146s
claude-sonnet-4-6 Yes 73s Yes 298s
claude-haiku-4-5 No (6 manual az commands) 226s No (7 manual az commands) 354s

Haiku used the baseline tool correctly in Use Cases F and J. The failure in P and Q is behavioral — it read the prompt but did not act on the named baseline. This is not a capability gap; it is a context-to-tool-selection gap.


Use Case Q — Latency Symptom Handling (Secondary Fault Detail)

The secondary symptom (latency spike, 25.6ms observed RTT, ~25× intra-VNet baseline) was explicitly stated in the prompt. All three models measured the 25.6ms. None identified tc netem as the root cause.

Model What was measured What was concluded Error type
gemini-2.5-flash 25.6ms RTT Attributed to NVA routing hair-pin Wrong attribution — no NVA in environment
claude-sonnet-4-6 25.6ms RTT; formed H4 (tc qdisc) Ran firewall inspector; no iptables rules; concluded “no traffic shaping” Category error: firewall inspector ≠ tc qdisc inspector
claude-haiku-4-5 25.6ms RTT “Stable and not anomalous — measurement artifact” Dismissed explicit stated symptom as noise

The tc qdisc check (az vm run-command invoke ... "tc qdisc show dev eth0") was used successfully by all three models in Use Cases G and J to identify tc netem faults. Tool knowledge was present. Investigation path was not applied to the stated latency symptom.


Source data: Ghost Agent audit logs, ghost_report*.md files per run, and per-run session recordings. Use case setup scripts and fault injection commands: demo/README.md. Detailed narrative analysis in phase2-model-swap.md.