Network Ghost Agent: An Agentic Network Forensics Investigator for Cloud Infrastructure

For network engineers, cloud architects, and the product leaders who support them.


The Problem

Cloud networks are opaque by design. You get a control plane — route tables, NSG rules, peering state, effective routes — and a data plane you can only observe indirectly, through packet captures that take minutes to provision and reports that take hours to read.

When something breaks, the failure is rarely where it appears. A connectivity outage looks like a firewall problem. The firewall is clean. It’s a routing problem. The route table looks correct. The effective route on the NIC is overriding it. The portal shows all green throughout.

The senior engineer who has seen this pattern before will know to look at the effective route table. The junior engineer will file a ticket. The team without that senior engineer available will spend two hours in the wrong layer.

This is the problem the Ghost Agent was built to solve: encoding the investigation methodology of a senior network forensics engineer into an autonomous CLI that requires your explicit approval before any risky action, maintains a full audit trail, and works directly against your cloud infrastructure.


How Teams Debug Today

Here is what a real cloud network investigation looks like today:

Engineer opens the cloud portal
  → Checks firewall / security group rules manually (UI, no audit record)
      → "Looks fine"
  → Checks route tables (maybe — if they think of it)
      → Misses the effective route divergence
  → Opens a ticket: "network issue, NSG checked, routing unknown"
      → Second engineer takes over cold, no context transfer
  → Hours later: someone with packet capture experience runs tcpdump
      → 500-line output, interpreted manually
  → Root cause: stale UDR pointing to a decommissioned NVA
  → Total time: 2-4 hours, 2 engineers, no durable audit trail

Three failures compound: investigation stops at the first clean layer; local probe evidence is weighted the same as cloud API evidence; and the escalation path exists in a senior engineer’s head and nowhere else.


What Ghost Agent Does

Ghost Agent is a conversational CLI. You describe a symptom in plain English. It forms hypotheses, runs diagnostics autonomously, stops and waits for your approval before anything mutative, captures wire traffic when needed, and produces a forensic RCA with a full command audit trail.

╔══════════════════════════════════════════════════════════════╗
║                  GHOST AGENT (ghost_agent.py)                ║
║  Startup Handler  │  Tool-Use Loop  │  RCA Report Generator  ║
║  • Orphan detect  │  • Gemini API   │  • Reads audit JSONL   ║
║  • Session resume │  • Dispatch     │  • Writes RCA .md      ║
╚═════╤═════════════╧═════════════════╧══════════════╤═════════╝
      │                                               │ (read-only)
shell.execute()   orchestrate()                   ./audit/
      │                 │
      ▼                 ▼
╔══════════════╗  ╔══════════════════╗
║ Safety Shell ║  ║ Cloud            ║
║              ║  ║ Orchestrator     ║
║ 4-tier       ║  ║                  ║
║ classify     ║  ║ Capture          ║
║ Approval     ║  ║ lifecycle        ║
║ gate         ║  ║ Blob download    ║
║ Audit JSONL  ║  ╚════════╤═════════╝
╚══════╤═══════╝           │ invokes
       │           ╔════════╧════════╗
       │           ║ PCAP Forensic   ║
       │           ║ Engine          ║
       │           ║                 ║
       │           ║ tshark extract  ║
       │           ║ Semantic JSON   ║
       │           ║ Gemini RCA      ║
       │           ╚═════════════════╝
       │           (also standalone)
       ▼
az CLI, ping, dig, traceroute, ss

The escalation ladder is encoded, not assumed:

  Level 1 — Local probes (auto-approved, runs on engineer's machine)
     ping, dig, traceroute, ss, curl
          │  inconclusive → escalate
          ▼
  Level 2 — Cloud API reads (auto-approved, read-only verb)
     az nsg rule list, az route-table list,
     az nic show-effective-route-table, az vm show
          │  packet-level evidence needed → escalate
          ▼
  Level 3 — Network Watcher packet capture (approval required)
     capture_traffic → burst-poll → download → PCAP forensic analysis
          │  investigation complete
          ▼
  Level 4 — RCA generation with full evidence citation

Every action is classified before it executes. Read-only cloud operations are auto-approved. Anything mutative — update, create, delete — stops and waits for you. You see the exact command, the reasoning behind it, and you approve, deny, or modify it. Your decision is written to the audit trail.

A Production Investigation, Step by Step

The engineer’s prompt:

“We’re seeing intermittent TCP connection failures from tf-source-vm to tf-dest-vm on port 8080. Control plane looks clean — NSGs and routes have been checked. We need wire-level evidence.”

[Ghost Agent] Forming hypotheses:
  H1: NSG deny rule on port 8080 — not yet ruled out despite reported check
  H2: Routing anomaly causing intermittent path failures
  H3: TCP-level issue — MTU mismatch, retransmission storm, or RST injection

[Shell] SAFE — auto-approved: az network nsg rule list --nsg-name tf-dest-vm-nsg ...
[Shell] SAFE — auto-approved: az network nic show-effective-route-table --resource-group nw-forensics-rg ...

[Ghost Agent] NSG is clean — all rules ALLOW on port 8080. H1 REFUTED.
Effective route table confirms VNetLocal path — no UDR overrides. H2 REFUTED.
Control plane is entirely clean. H3 requires wire-level evidence.
Escalating to packet capture.

┌─────────────────────────────────────────────────────────────────────┐
│  SAFETY SHELL ALERT                                                 │
│  TIER: 2  │  CLASSIFICATION: RISKY                                  │
│  COMMAND:    az network watcher packet-capture create --vm tf-sour… │
│  RISK:       mutative operation — create verb (Tier 2)              │
│  REASONING:  H3 requires wire-level evidence. All control-plane …  │
│                                                                     │
│  [A]pprove   [D]eny   [M]odify command                             │
└─────────────────────────────────────────────────────────────────────┘
Your choice: A
[Shell] APPROVED: az network watcher packet-capture create --vm tf-sour…

[Cloud Orchestrator] Capture created: ghost-tf-source-vm_20260222T091543
  Polling status ... Succeeded
  Downloading ghost-tf-source-vm_20260222T091543.cap from nwlogs080613 ...
  Running PCAP forensic analysis ...

[Ghost Agent] Forensic report received. Key findings:
  TCP retransmission rate: 38%  (threshold: 5%)
  ICMP type 3 code 4 (Fragmentation Needed / PMTUD failure): 47 packets
  Oversized frames silently dropped — source not adjusting segment size

  Root cause: tf-source-vm has jumbo frames enabled (MTU 9000).
  The network path MTU is 1500. Every TCP session that produces segments
  larger than 1500 bytes fails intermittently. 

══════════════════════════════════════════════════════
  RCA REPORT WRITTEN: ./audit/ghost_rca_ghost_20260222_091831.md
  Confidence: high  |  Turns: 5
══════════════════════════════════════════════════════

Five turns. The Safety Shell blocked on the packet-capture create, waited for the engineer’s approval, logged the decision, then continued. The RCA names the exact failure — an MTU mismatch invisible to the control plane — with the specific ICMP and retransmission evidence cited by audit_id. Every command, every approval decision, and every hypothesis state transition is in the audit trail.


Use Cases, All Tested

Each scenario represents a distinct class of production failure, requiring a different investigation strategy. These four were selected to show the range of what the system handles — from pure control-plane analysis to wire-level forensics to multi-component relational failures.

Use Case What Breaks What the Investigation Reveals
B — The Wire Doesn’t Lie Intermittent TCP issue, control plane clean Wire-level PCAP report: retransmission rate, DNS latency, ICMP unreachables — evidence that no control-plane query can produce
D — The Two-Headed Hydra Two services fail after an NSG maintenance window Two deny rules at different priorities — two engineers, two separate changes, attributed individually in the RCA
E — The Phantom Route NSG clean, portal green, traffic vanishes Stale UDR pointing to an NVA that was planned but never provisioned
F — The Silent Gatekeeper Storage unreachable, all control plane clean Service endpoint removed during routine subnet maintenance — invisible unless both sides of the relationship are checked in the same step

Use Case F deserves a specific call-out. The storage account firewall correctly allows traffic from the subnet. The subnet itself has no error condition. Neither component shows a problem in isolation. The failure is only visible when you check both sides of the service endpoint relationship in the same diagnostic step — something that requires knowing to look for it. Ghost Agent finds it in under five minutes.


What It Takes to Run

Scope: Currently targets Azure. The Safety Shell and PCAP Forensic Engine are cloud-agnostic — extending the investigation layer to AWS, GCP, or OCI requires replacing the Azure CLI calls with the equivalent cloud CLI; the safety classification, hypothesis tracking, and forensic analysis pipeline are unchanged.

Prerequisites:

  • Python 3.12+ with uv
  • Azure CLI authenticated: az login
  • A Gemini API key from aistudio.google.com — the free tier is sufficient for most investigations
  • NetworkWatcherAgentLinux extension on target VMs (for packet capture use cases only)

Configuration: Copy demo/sample_config.env to demo/config.env and fill in your Azure resource names and Gemini API key. All credentials are loaded from this file at runtime — no environment variable setup required beyond that.

Cost per investigation: A typical control-plane investigation (Use Cases D, E, F) costs under $0.05 in Gemini API calls at default model settings. Packet capture runs add Azure Network Watcher costs (~$0.10/capture-hour) plus blob storage egress.

Start here: Use Case E if you want pure control-plane diagnosis with no storage account needed — it runs to a confirmed root cause in under five minutes. Use Case B if you have Azure Network Watcher and a storage account configured — Ghost Agent handles the full capture lifecycle automatically: creates the capture, polls for completion, downloads the file, runs forensic analysis, and generates the RCA report.


Four Independent Tools

Ghost Agent is the investigation layer. It is assembled from three reusable components that each stand alone — you can drop any of them into your own tooling without taking the rest.

🛡️ Agentic Safety Shell

The deterministic guardrail between any AI agent and your infrastructure. Every proposed command passes through a four-tier classification pipeline before executing:

  Tier 0 — Forbidden list     rm -rf /, mkfs, fork bombs → unconditionally blocked
  Tier 1 — Always-safe list   ping, dig, traceroute, az list/show → auto-approved
  Tier 2 — Verb matching      update, create, delete → requires approval
  Tier 3 — Dangerous patterns sudo, &&, $(...) injection → requires approval
  Default — Unknown input     anything unrecognised → requires approval

The default tier is the critical design decision: anything unrecognised is treated as requiring approval, not as safe. Classification is deterministic — no LLM involved. Every tier is independently unit-testable with adversarial inputs.

Drop it between your agent and any shell execution path.

🔍 Agentic PCAP Forensic Engine

AI-powered packet analysis. Takes a .pcap or .cap file, runs tshark to extract structured per-protocol metrics (TCP, DNS, ICMP, ARP), compresses the result to a Semantic JSON summary at up to 95% data reduction, and runs Gemini forensic reasoning over it. Produces a Markdown report with executive summary, ranked anomaly table, and actionable remediation commands.

What it detects: TCP retransmission storms, PMTUD failures, ARP spoofing, DNS DGA patterns, NXDOMAIN spikes, ICMP unreachable correlation, latency percentile regressions.

Run it standalone: python pcap_forensics.py your-capture.pcap

☁️ Agentic Cloud Orchestrator

Azure Network Watcher packet capture lifecycle manager. Creates captures, burst-polls provisioning status, downloads the .cap blob, invokes the PCAP Forensic Engine, and cleans up Azure resources — all as a single audited task. Handles Azure’s platform constraints directly: one active capture per VM, --location required for all non-create operations, orphan detection and cleanup across sessions.

Wire it into any Python automation that needs to create, monitor, analyze, and clean up Azure Network Watcher captures as a single operation.


Technical Challenges That Made This Hard

1. The Cloud API Is Not One Interface

Azure’s CLI and REST API diverge on write operations. The CLI flattens what the REST API expects nested, and the difference is silent:

  az subnet update --route-table ""

  What you expect:    route table association → null
  What the CLI sends: /subscriptions/.../routeTables/  ← empty name, rejected silently

For “remove association” operations: az rest GET the resource in its raw nested form, strip the field in one line of Python, az rest PUT it back. The CLI abstraction misleads; step around it.

The same command group can require different mandatory parameters per subcommand. az network watcher packet-capture create requires --resource-group. Every other subcommand in that group — show, delete, list — requires --location instead. The only way to discover this is to test every operation end-to-end, not just creation.

2. LLM State Machine Reliability

A transient rate-limit error on turn 1 can silently skip state initialization. The agent’s working hypothesis list is never created. On turn 2, the agent checks “are all hypotheses resolved?” — yes, the list is empty — and signals investigation complete with zero evidence collected.

  Turn 1: [API rate limit — response dropped]
          → hypothesis list never initialized: []

  Turn 2: "Are all hypotheses resolved?" → yes (empty list)
          → "Investigation complete" — no evidence gathered

The fix is a recovery invariant in the system prompt: “If you arrive at a turn where the hypothesis list is empty and the investigation has not concluded, re-initialize it before any other action.” This rule must be written into the specification — it cannot be assumed.

3. Deterministic Safety Over Probabilistic Safety

Asking the LLM to classify its own proposed commands is tempting — the model understands semantics. But it makes the safety gate dependent on the most unpredictable component in the system.

  LLM proposes action
          │
          ▼
  ┌─────────────────────────────────┐
  │     Deterministic classifier    │  ← allowlist, verb match, pattern rules
  └─────────────────────────────────┘
          │                   │
          ▼                   ▼
      APPROVED              DENIED
    (action runs)       (LLM notified;
                         never the decider)

The LLM reasons about what to do. Deterministic logic decides whether it is safe to do it. These are different questions that belong in different parts of the system.

4. Context Pollution in Agentic Prompts

Any resource name present in the investigation prompt becomes a candidate for the agent’s reasoning — including as the target of its own operational actions.

  Prompt: "Storage account nwlogs080613 is unreachable..."
                              ↑
                    Agent uses this for:
                    [1] investigation target ✓
                    [2] packet capture destination ✗

In one scenario, the investigation target storage account was the same account used to store packet captures. The agent, reading the prompt, routed its own capture outputs to the locked-down account and failed. The fix: separate the subject under investigation from the agent’s operational infrastructure at the naming level, the prompt level, and via CLI argument injection that the agent cannot confuse with user-provided context.


Conclusion

What Ghost Agent removes is the overhead of remembering which layer to check next and the risk of stopping too early. It does not replace the engineer — it requires your explicit approval before every risky action and produces a full audit trail of every decision. What it replaces is the manual, sequential, expertise-dependent process that produces a different outcome depending on who is on call.

The tool is open source, built with standard components (Python 3.12, Gemini via google-genai, Azure CLI), and deployable against any Azure subscription where you have read access to network resources. The field lessons that emerged from building it are published alongside the code.


GitHub: github.com/ranga-sampath/agentic-network-tools

Clone the repo. Point it at a resource group. Describe a symptom.

Would be great to know what you would like to hear more...