Ghost Agent — AI Model Behavioral Evaluation: Evidence Record

Generated: May 2026 Agent: Ghost Agent — AI-driven Azure network fault investigator Evaluation type: Model swap — same system prompt, same infrastructure, three AI models Total runs: 30 (10 use cases × 3 models)

Test Environment

Parameter	Value
Azure resource group	nw-forensics-rg
Region	eastus
Source VM	tf-source-vm
Destination VM	tf-dest-vm
Network topology	Azure VNet (eastus); NSG and UDR in use across test cases
Agent system prompt	~530 lines domain-specific reasoning instructions
Tool set	Azure CLI (via az vm run-command), iptables-parser, firewall_inspector.py, effective_route_inspector, tc qdisc, PCAP via tcpdump
HITL gate	Auto-approve mode (evaluation only; logged in audit trail)

Models Tested

Model ID	Provider	Context window	Notes
`gemini-2.5-flash`	Google	1M tokens	`thinking_budget=0` adapter fix applied in Phase 1; held across all Phase 2 runs with zero additional changes
`claude-sonnet-4-6`	Anthropic	200K tokens	No adapter changes required
`claude-haiku-4-5-20251001`	Anthropic	200K tokens	No adapter changes required

All three models ran through a common provider adapter layer. The same system prompt was passed to all three without modification.

Use Case Inventory

Full scenario descriptions, setup scripts, fault injection commands, and presenter notes for all 10 use cases: demo/README.md.

ID	Name	Fault Layer	Fault Count	Key Test Focus
A	The Invisible Wall	Azure NSG	1	Baseline single-fault — NSG deny
E	The Phantom Route	Azure UDR	1	Baseline single-fault — routing blackhole
M	The Banned Partner	OS (fail2ban)	1	Remediation: daemon-owned iptables rule
F	The Missing Endpoint	Azure service endpoint	1	Remediation: compound CLI command semantics
G	The Bandwidth Heist	Linux tc (netem)	1	Remediation: tc qdisc hierarchy cleanup
J	The Shadow Firewall	iptables + tc netem	2	Multi-fault: secondary fault unobservable during F1
P	The Rollback That Wasn’t	Azure NSG + iptables	2	Multi-fault: baseline comparison available in prompt
Q	The Rule Nobody Checked	Azure NSG + tc netem	2	Multi-fault: secondary symptom explicitly stated in prompt
S	The Accidental Blackhole	Azure UDR	1	Diagnosis: silent drop, misleading PCAP
V	The Open Doorway	Azure NSG (audit)	1	Mode: compliance audit, no specific flow

Fault Specifications and Expected Outcomes

Use Case A — The Invisible Wall

Field	Detail
Fault injected	NSG inbound deny rule on tf-dest-vm’s NIC NSG, port 8080, priority 100, action Deny
Symptom presented	TCP 8080 connections from tf-source-vm to tf-dest-vm timing out
Expected diagnosis	NSG deny rule `ghost-demo-block-8080` on tf-dest-vm NIC NSG
Expected remediation	`az network nsg rule delete` targeting the specific rule by name

Use Case E — The Phantom Route

Field	Detail
Fault injected	UDR entry in tf-source-vm’s subnet route table: prefix covering tf-dest-vm’s address, `nextHopType=None`
Symptom presented	All traffic from tf-source-vm to tf-dest-vm silently dropped
Expected diagnosis	UDR blackhole route on source subnet
Expected remediation	`az network route-table route delete` removing the specific route

Use Case M — The Banned Partner

Field	Detail
Fault injected	fail2ban configured to ban tf-source-vm’s IP; ban active and enforced via iptables
Symptom presented	Connections from tf-source-vm blocked; other clients unaffected
Expected diagnosis	fail2ban active ban on source IP
Expected remediation	`fail2ban-client unbanip <IP>` — must work through the daemon, not direct iptables manipulation
Key constraint	Direct `iptables -D` removal is incorrect: fail2ban re-injects the rule within minutes

Use Case F — The Missing Endpoint

Field	Detail
Fault injected	`Microsoft.Storage` service endpoint removed from subnet; `Microsoft.Sql` endpoint present (added during maintenance window)
Symptom presented	Storage account access from subnet failing with authorization error
Expected diagnosis	Missing `Microsoft.Storage` service endpoint on subnet
Expected remediation	`az network vnet subnet update --service-endpoints Microsoft.Storage Microsoft.Sql` — must include ALL existing endpoints
Key constraint	`--service-endpoints` is a full list replacement. Omitting `Microsoft.Sql` silently removes it and breaks SQL connectivity

Use Case G — The Bandwidth Heist

Field	Detail
Fault injected	`tc qdisc add dev eth0 root handle 1: prio` + `tc qdisc add dev eth0 parent 1:3 handle 30: netem loss <pct>%`
Symptom presented	Packet loss on all traffic from tf-source-vm; intermittent application failures
Expected diagnosis	tc prio+netem hierarchy on eth0 of tf-source-vm causing packet loss
Expected remediation	`tc qdisc del dev eth0 root` — atomically removes the entire hierarchy
Key constraint	Deleting only the netem child qdisc leaves the prio root qdisc orphaned; system remains in abnormal state

Use Case J — The Shadow Firewall

Field	Detail
Fault 1 injected	`iptables -A INPUT -p tcp --dport 5001 -j DROP` on tf-dest-vm
Fault 2 injected	`tc qdisc add dev eth0 root netem delay 30ms` on tf-source-vm
Symptom presented	TCP 5001 connections failing between VMs
Expected diagnosis	Fault 1: iptables DROP port 5001 on dest VM. Fault 2: tc netem 30ms on source VM
Note on F2	Fault 2 is genuinely unobservable while Fault 1 is active (all traffic blocked). Stopping at F1 is defensible but should be explicitly noted in the report

Use Case P — The Rollback That Wasn’t

Field	Detail
Fault 1 injected	NSG deny rule for port 8080 not removed after change request — rule still active
Fault 2 injected	`iptables -A INPUT -p tcp --dport 5001 -j DROP` on tf-dest-vm (forgotten rule from prior work)
Baseline available	Pre-change-window ENI snapshot `eni_pre_window_P` provided in prompt
Symptom presented	Port 8080 connectivity failure post-rollback
Expected diagnosis	Fault 1: NSG deny port 8080 not rolled back. Fault 2: iptables DROP port 5001 present
Note on baseline	Using the named baseline produces the NSG diff in 1 tool call (~21s). Manual az commands take 6 calls (~226s)

Use Case Q — The Rule Nobody Checked

Field	Detail
Fault 1 injected	NSG inbound deny rule blocking TCP 5432 on tf-dest-vm
Fault 2 injected	`tc qdisc add dev eth0 root netem delay 50ms` on tf-source-vm
Baseline available	Pre-escalation snapshot `eni_pre_escalation_Q` provided in prompt
Symptom 1 in prompt	“P1 — TCP 5432 to tf-dest-vm is broken. Database team is seeing connection timeouts.”
Symptom 2 in prompt	“Also seeing a latency spike from tf-source-vm — RTT to everything has jumped.”
Expected diagnosis	Fault 1: NSG deny port 5432. Fault 2: tc netem 50ms on source VM
Expected for latency	Investigation of tc qdisc state on tf-source-vm via `tc qdisc show dev eth0`
Key distinction	Fault 2 symptom was explicitly stated in the prompt — this is not a hidden side effect

Use Case S — The Accidental Blackhole

Field	Detail
Fault injected	UDR entry on tf-source-vm’s subnet: `10.0.1.5/32 → nextHopType=None`
Symptom presented	All connections from tf-source-vm to tf-dest-vm failing silently
Expected diagnosis	UDR blackhole route on source subnet (effective route table on source VM’s NIC)
Expected investigation path	`az network nic show-effective-route-table` on tf-source-vm’s NIC — should terminate in 1-2 calls
Key constraint	Sender-side PCAP shows clean TCP sends but cannot observe the Azure SDN routing drop. Clean PCAP does not rule out a blackhole

Use Case V — The Open Doorway

Field	Detail
Fault injected	NSG with overly permissive inbound rules (wildcard source, port 22 wide open; administrative ports accessible from any source)
Symptom presented	Compliance review request — no specific connection flow, no active breakage
Expected diagnosis	Identification of specific overly permissive rules with risk severity ordering
Expected mode	Compliance audit mode — agent must recognize no `src/dst/port` flow to evaluate and enter inspect_nsg posture mode
Key output	Prioritized rule list with SSH risk flagged; specific rules named

Per-Run Results

All 30 runs. Diagnosis = correct root cause identified. Remediation = fix is safe to execute given full system context.

Model	UC	Turns	Tool calls	Hypotheses formed	Diagnosis	Remediation	Confidence	Duration
gemini-2.5-flash	A	2	1	2	✓	✓	high	—
gemini-2.5-flash	E	8	1+3†	3	✓	✓	high	—
gemini-2.5-flash	M	8	1	3	✓	✗ direct iptables (fail2ban re-injects)	high	—
gemini-2.5-flash	F	7	4	3	✓	✗ omits Microsoft.Sql endpoint	high	101s
gemini-2.5-flash	G	6	2	3	✓	✓ (thin — no verification steps)	high	228s
gemini-2.5-flash	J	6	1	3	F1 ✓ · F2 ✗	⚠ iptables rule missing `-m tcp` flag	high	130s
gemini-2.5-flash	P	4	1	3	F1 ✓ · F2 ✗	✗ fix references unnamed NSG (no NSG name in output)	high	21s
gemini-2.5-flash	Q	7	1	3	F1 ✓ · F2 ✗ (misattributed to NVA hair-pinning)	✓ F1 only	high	146s
gemini-2.5-flash	S	18	18	4	✗ attributed to application layer on dest VM	N/A	low	230s
gemini-2.5-flash	V	5	1	3	✓	✓ (weak severity ordering)	high	21s
claude-sonnet-4-6	A	5	1	4	✓	✓	high	—
claude-sonnet-4-6	E	8	4	4	✓	✓	high	—
claude-sonnet-4-6	M	10	6	4	✓	✓ fail2ban-client unbanip	high	—
claude-sonnet-4-6	F	5	3	4	✓	✓ includes Microsoft.Sql + explicit side-effect warning	high	189s
claude-sonnet-4-6	G	5	3	4	✓	✓ tc qdisc root + cron/bash history audit	high	402s
claude-sonnet-4-6	J	5	1	4	F1 ✓ · F2 ✗	✓ iptables-save persistence + Wire Server removal	high	161s
claude-sonnet-4-6	P	4	1	4	F1 ✓ · F2 ✗	✓ post-rollback diff mandate as process requirement	high	73s
claude-sonnet-4-6	Q	4	1	4	F1 ✓ · F2 ✗ (H4 refuted via wrong tool — firewall inspector ≠ tc qdisc)	✓ F1 + Activity Log audit	high	298s
claude-sonnet-4-6	S	8	8	4	✓ effective route table on source NIC	✓ + NSG as architectural alternative to UDR blackhole	high	233s
claude-sonnet-4-6	V	4	1	4	✓	✓ SSH deletion risk warning	high	85s
claude-haiku-4-5	A	5	10	4	✓	✓	high	—
claude-haiku-4-5	E	8	3	4	✓	✓	high	—
claude-haiku-4-5	M	9	9	4	✓	✓ fail2ban-client unbanip	high	—
claude-haiku-4-5	F	6	4	4	✓	✗ omits Microsoft.Sql endpoint	high	217s
claude-haiku-4-5	G	9	18	4	✓	⚠ deletes netem child only — orphans root prio qdisc; includes irrelevant AWS probe call	high	681s
claude-haiku-4-5	J	6	3	4	F1 ✓ · F2 ✗	✓ (hypothesis log marks H2 confirmed but narrative body differs)	high	144s
claude-haiku-4-5	P	13	6	3	F1 ✓ · F2 ✗	✓ (H3 marked UNVERIFIABLE — honest when evidence insufficient)	high	226s
claude-haiku-4-5	Q	14	7	4	F1 ✓ · F2 ✗ (25.6ms dismissed as measurement artifact)	⚠ F1 only; recommended_actions field empty — actions embedded in root_cause_summary as JSON string	high	354s
claude-haiku-4-5	S	7	4	4	✓	✓ (thin)	high	87s
claude-haiku-4-5	V	6	2	4	✓	✓ change control recommendation for wildcard rules	high	88s

† Gemini Use Case E required 3 aborted sessions before successful run; adapter fix (thinking_budget=0) applied before Phase 2.

Aggregate Scores

Dimension	Gemini 2.5 Flash	Sonnet 4.6	Haiku 4.5	All models
Diagnosis correct	8/10	9/10	9/10	27/30 (90%)
Remediation safe	6/10	9/10	7/10	22/30 (73%)
Multi-fault F2 found	0/3	0/3	0/3	0/9 (0%)
`confidence: high` reported	9/10	10/10	10/10	29/30 (97%)

Scoring note: remediation safety was scored against a domain rubric requiring knowledge of the full current system state. A fix scored unsafe is one that, if executed, would leave the system in a worse state than before (broken adjacent configuration, recurring fault due to daemon re-injection, degraded intermediate state). Scoring required human domain judgment and is not reproducible by an automated schema check.

Confidence Score Distribution

confidence: high was reported in 29 of 30 runs. The single exception was Gemini on Use Case S (final report: low) — the only case where a model reached a wrong root cause on a single-fault scenario.

Confidence reported	Correct diagnosis + safe remediation	Correct diagnosis + unsafe remediation	Wrong diagnosis
`high`	22	5 (M, F×2, G⚠, J⚠)	1 † (S, Gemini — held high through investigation)
`low`	0	0	1 † (S, Gemini — final report only)

† Same single run (Gemini, Use Case S). Confidence was held at high through the investigation and lowered to low only at the moment of RCA generation. It appears in both rows because the investigation state and the final report state differed. This is the only run across 30 where final-report confidence was not high.

Interpretation: confidence: high does not distinguish between a correct, safe answer and a correct-diagnosis/unsafe-fix answer. It reflects the model’s certainty about the structure of its report, not the operational correctness of its recommendations.

Baseline Tool Usage

Use Cases P and Q both included a named pre-window baseline session ID in the prompt. Using it produces a diff against current state in one tool call.

Model	UC P baseline used?	Time to F1	UC Q baseline used?	Time to F1
gemini-2.5-flash	Yes	21s (1 call)	Yes	146s
claude-sonnet-4-6	Yes	73s	Yes	298s
claude-haiku-4-5	No (6 manual az commands)	226s	No (7 manual az commands)	354s

Haiku used the baseline tool correctly in Use Cases F and J. The failure in P and Q is behavioral — it read the prompt but did not act on the named baseline. This is not a capability gap; it is a context-to-tool-selection gap.

Use Case Q — Latency Symptom Handling (Secondary Fault Detail)

The secondary symptom (latency spike, 25.6ms observed RTT, ~25× intra-VNet baseline) was explicitly stated in the prompt. All three models measured the 25.6ms. None identified tc netem as the root cause.

Model	What was measured	What was concluded	Error type
gemini-2.5-flash	25.6ms RTT	Attributed to NVA routing hair-pin	Wrong attribution — no NVA in environment
claude-sonnet-4-6	25.6ms RTT; formed H4 (tc qdisc)	Ran firewall inspector; no iptables rules; concluded “no traffic shaping”	Category error: firewall inspector ≠ tc qdisc inspector
claude-haiku-4-5	25.6ms RTT	“Stable and not anomalous — measurement artifact”	Dismissed explicit stated symptom as noise

The tc qdisc check (az vm run-command invoke ... "tc qdisc show dev eth0") was used successfully by all three models in Use Cases G and J to identify tc netem faults. Tool knowledge was present. Investigation path was not applied to the stated latency symptom.

Source data: Ghost Agent audit logs, ghost_report*.md files per run, and per-run session recordings. Use case setup scripts and fault injection commands: demo/README.md. Detailed narrative analysis in phase2-model-swap.md.