The hardest part of building a Network Forensics AI wasn’t the network. It was the reasoning. A senior network engineer knows what to check next. Teaching that to an AI is harder than it sounds.
The Problem With Expertise
When a senior network engineer sits down to diagnose a cloud connectivity failure, they follow a mental model built over years of incident response. NSG clean? Check routing — configured route table first, then the effective route table at the NIC, which is the computed state that actually determines where the packet goes. Still clean? Look at the OS firewall. Still nothing? Escalate to packet capture.
That mental model is not written down anywhere. It lives in the engineer’s head. It is the accumulated result of hundreds of investigations, dozens of war rooms, and enough late nights that certain failure patterns are recognisable before the first command is run.
Building an AI network forensics agent — Ghost Agent — meant making that implicit expertise explicit. The system prompt is where that happens. It is not configuration. It is a formal specification of how an expert investigates.
This post is about the journey I went through writing that specification: what accumulated, what broke, how I fixed it, and what I learned about the difference between an AI that passes test cases and one that reasons correctly.
Starting From First Principles
The initial system prompt for Ghost Agent was written around a clear investigation framework: start with local diagnostics, escalate to Azure control-plane reads, escalate further to packet capture only when the lower layers were inconclusive. Form hypotheses first. Gate every risky action through a human.
┌──────────────────────────────────────────────────────────────┐
│ INVESTIGATION ESCALATION │
│ │
│ 1. Local diagnostics │
│ OS-level checks, reachability, tool availability │
│ │ │
│ │ inconclusive │
│ ▼ │
│ 2. Azure control-plane reads │
│ NSG rules · routing tables · effective route at NIC │
│ │ │
│ │ inconclusive │
│ ▼ │
│ 3. Packet capture │
│ Wire-level evidence — independent of OS-reported status │
│ │
│ ⚠ Every risky action → human approval gate │
└──────────────────────────────────────────────────────────────┘
That structure held up well for the first set of scenarios — NSG misconfiguration, routing anomalies, packet capture investigations. The agent formed hypotheses, ran the right queries, found the faults, and produced coherent RCA reports. For these use cases, the system prompt read like a well-written runbook: general enough to apply to a class of problems, specific enough to give the model real guidance.
The first sign of trouble was subtle. As I added more demo scenarios, I started adding more prescriptive instructions. Not because the agent was broken, but because each new scenario surfaced a new edge case that the system prompt did not explicitly cover. The natural response was to add a rule. Then another. Then a more specific one.
The system prompt was getting longer. It was also getting more specific. And that is where the problems started.
The Curve-Fitting Trap
By the time I had twelve demo scenarios covering NSG misconfigurations, routing anomalies, packet captures, and VM-to-VM performance degradation, the system prompt contained instructions like this:
“When pipe_meter shows degraded performance and Azure control-plane checks are clean, the most likely fault class is OS-level traffic control on one of the VMs. Run: az vm run-command invoke tc qdisc show on BOTH source and dest VMs.”
Read that carefully. It is not a principle. It is a script for a specific scenario. It tells the agent exactly which command to run, in which situation, on which VMs. It was written that way because it made a specific demo scenario work correctly.
This is curve fitting. The system prompt was no longer encoding a methodology for network investigation. It was encoding solutions to the specific problems I had tested it against.
The risk is not obvious until you ask: what happens when a fault that looks like OS-level traffic shaping is actually an iptables rule? The agent runs tc qdisc show, finds nothing unexpected, and has no principled guidance for what to do next. The prescriptive instruction covered the scenario it was written for. It was not general enough to cover the adjacent one.
I took an explicit decision to refactor. The goal was to replace prescriptive commands with reasoning principles — instructions that would equip the agent to navigate a broader range of problems, not just the twelve I had scripted.
The rewrite changed “run tc qdisc show” into: “pipe_meter results showing degradation are the trigger for investigating OS-level traffic shaping or filtering on the source and/or destination VMs. Investigate both — the fault may be on either endpoint.” No specific command. The agent’s knowledge of OS-level traffic tooling, combined with the guidance to look at both endpoints, does more work than the prescribed command ever could.
BEFORE (prescriptive command) AFTER (reasoning principle)
────────────────────────────── ───────────────────────────
"Run tc qdisc show on BOTH "pipe_meter degradation is the
source and dest VMs" trigger for investigating OS-level
traffic shaping or filtering on
source and dest"
Scenarios it handles: Scenarios it handles:
✓ tc-tbf bandwidth rate limiter ✓ tc-tbf bandwidth rate limiter
✗ iptables DROP rule ✓ iptables DROP rule
✗ nftables filtering ✓ nftables filtering
✗ net_sched htb shaper ✓ net_sched htb shaper
This refactoring extended to other parts of the system prompt as well — removing phrases that were essentially decision rules for demo scenarios, replacing them with the network engineering principles that would have led an expert to the same place.
The Multi-Fault Attribution Problem
I thought the refactoring had resolved the main issues. Then I tested a scenario with two independent faults injected on the same VM: a bandwidth rate limiter on the source VM, and a separate iptables rule silently dropping ICMP traffic.
The problem statement the agent received described two symptoms: significantly degraded throughput, and complete ping failures between the VMs.
The agent investigated thoroughly. It found the bandwidth rate limiter. The evidence was solid — pipe_meter showed consistent throughput well below the expected baseline, and the rate limiter was visible in the OS traffic control configuration. The agent confirmed that hypothesis with appropriate confidence.
Then it called complete_investigation.
The ping failures were attributed to the same cause. The reasoning was plausible-sounding: “The rate limiter is causing significant packet loss, which explains both the throughput degradation and the ping failures.”
This was wrong. A bandwidth rate limiter degrades throughput proportionally across all traffic. It does not selectively drop ICMP while allowing TCP. The ping failures had a different cause — the iptables rule — which the agent never looked for because it believed the first confirmed finding explained everything.
SYMPTOMS OBSERVED FAULTS INJECTED (UNKNOWN TO AGENT)
───────────────── ───────────────────────────────────
S1: Degraded throughput F1: tc-tbf bandwidth rate limiter
(source VM)
S2: Complete ping failures F2: iptables ICMP DROP rule
(source VM)
AGENT'S ATTRIBUTION (wrong) CORRECT ATTRIBUTION
─────────────────────────── ───────────────────
S1 ──→ F1 ✓ S1 ──→ F1 ✓
S2 ──→ F1 ✗ S2 ──→ F2 ✓
Why it was wrong:
A rate limiter degrades ALL traffic proportionally.
It does not selectively drop ICMP while passing TCP.
S2 requires a mechanically distinct cause — which was never investigated.
The confidence level was high. The attribution was wrong. And critically, the agent did not hedge. It did not say “likely” or “possibly.” It made a confident, well-reasoned, factually incorrect conclusion.
Why the First Fix Failed
My first attempt to address this used a linguistic gate: instructions in the system prompt to catch hedging language before completing an investigation.
“Before calling complete_investigation, verify: does each observed symptom have its own confirmed, evidence-backed cause? If any symptom is still unexplained, register additional hypotheses.”
I also added a HARD GATE: “If you find yourself writing ‘likely caused by’ or ‘probably explains’ in your conclusions, stop — this is a signal that a symptom is not yet confirmed.”
This approach was wrong in design. The agent’s failure was not that it hedged — it was that it was confident. Confident and wrong. A linguistic gate catches uncertainty. It does not catch incorrect certainty. The agent stated its attribution with full confidence, used no hedging language, and the gate never fired.
The lesson: linguistic instructions cannot fix structural reasoning failures. If the agent can reach a wrong conclusion while being confident, a rule that catches uncertain language is not a safety net for that failure class.
The Structural Fix
The second attempt did not check the agent’s language. It required the agent to execute a specific reasoning process before completion was permitted — regardless of confidence level.
I called it the MANDATORY PRE-COMPLETION CHECKLIST. Before calling complete_investigation, the agent must explicitly work through every distinct symptom from the problem statement:
(a) Name the symptom as stated. (b) State the specific mechanism identified as its cause. © Verify mechanism-symptom consistency: ask — if only this mechanism were present and nothing else, would this specific symptom occur? If the answer is no or uncertain, the symptom is not yet explained. (d) Cite the specific audit ID that provides direct evidence for this symptom.
For each symptom in the problem statement:
│
▼
(a) Name the symptom exactly as stated
│
▼
(b) State the mechanism identified as its cause
│
▼
(c) Would THIS mechanism ALONE produce THIS specific symptom?
│
┌─────┴──────────────────┐
YES NO or UNCERTAIN
│ │
▼ ▼
(d) Cite the audit ID Register new hypothesis
as direct evidence Continue investigating ──→ back to (a)
for this symptom
│
▼
All symptoms cleared through (a)–(d)?
┌─────┴───┐
YES NO
│ └──────────────→ Continue investigating
▼
complete_investigation permitted
Only after completing (a) through (d) for every symptom may the agent call complete_investigation. If any symptom fails step © or lacks a direct evidence reference for step (d), the agent must register a new hypothesis and continue investigating.
Step © is the critical one. Applied to the failing scenario: “If only a bandwidth rate limiter were present and nothing else, would complete ping failures occur?” The answer is no. A rate limiter does not selectively drop ICMP. The symptom is not explained. Investigation must continue.
This works not because it checks the agent’s language, but because it forces a specific piece of domain reasoning that the agent was not performing spontaneously. The checklist is not a linguistic guardrail — it is a required reasoning step.
I also made the scope of this requirement explicit: “This checklist applies regardless of confidence level — a confident but incorrect attribution is still an incorrect attribution.” The gate fires for certain conclusions just as it fires for uncertain ones.
What This Teaches
The system prompt journey for Ghost Agent produced three generalisable lessons for anyone building LLM-based diagnostic systems.
First: the system prompt is a formal specification, not a configuration file. Implicit expertise — the “next step” that every senior engineer would know to take — does not transfer to a model unless it is written down explicitly. Treat the system prompt as a living document. Audit investigation transcripts. Every case where an expert would have done something the agent did not is a gap in the specification.
Second: optimising for test cases is not the same as encoding methodology. Prescriptive instructions that make specific scenarios work can actively prevent the agent from navigating adjacent ones. Replace commands with principles. Replace “run this specific command” with “the reasoning that would lead an expert to run it.” A system prompt that encodes the expert’s reasoning process will generalise; one that encodes solutions to known test cases will not.
Third: for structural reasoning failures, structural fixes are required. A confident wrong answer cannot be caught by a gate that detects uncertain language. The multi-fault attribution problem required a required reasoning step, not better words. When you find a class of reasoning failure in an LLM agent — not just a wrong answer, but a wrong reasoning pattern — the fix is a structural requirement embedded in the system prompt, not a linguistic hint.
The system prompt for Ghost Agent is now a specification with teeth: explicit investigation ordering, required hypothesis management, a mandatory checklist before any conclusion. Not because I added more rules — but because I replaced the rules that encoded my specific test cases with the principles that would have generated the right answers in the first place.
Ghost Agent is open source. The full system prompt, all demo scenarios, and the test suite are in the repository. → https://github.com/ranga-sampath/agentic-network-tools
