Diagnosing Compound Faults: Network Ghost Agent To The Rescue

For network engineers, cloud architects, and the product leaders who support them.


The Azure portal shows green. The NSG shows port 5001 allowed. The effective route table confirms a valid VNetLocal path. And every iperf connection to that VM is timing out. The block is a single iptables DROP rule, inserted by an OS hardening script at midnight — invisible to every Azure API. The first fix most engineers would reach for, checking and re-checking NSG rules, does not touch the cause.


The Problem

An earlier post on Ghost Agent covered the first set of tools built for autonomous cloud network investigation: the Agentic Safety Shell (deterministic approval gate for every command), the PCAP Forensic Engine (wire-level packet analysis), and the Cloud Orchestrator (Azure Network Watcher capture lifecycle). Together, they handle investigations at the Azure control plane and wire level: NSG rules, effective routes, VNet peering, packet captures.

That first set of tools handles the cases where a single layer is at fault and that layer is visible to the Azure API. The harder class of problem is the compound fault: two independent failures at different investigation layers, each of which reports clean when inspected in isolation, and where resolving the first fault does not surface the second.

Compound faults aren’t rare. They arise naturally from the way real infrastructure changes: a change window touches an NSG and also runs OS hardening scripts; a field engineer adds a diagnostic iptables rule during an incident and forgets to remove it; Docker or a CNI plugin rewrites the netfilter chain while a separate traffic-shaping rule has been left on the interface. Each change is recorded somewhere, or nowhere in the case of iptables and tc, but no existing tool correlates them across layers.

The three new tools added to the Ghost Agent toolkit target the investigation layers that compound faults most commonly span:

  • Netfilter Inspector — retrieves iptables / ip6tables / nftables rulesets from Linux VMs (via SSH or Azure run-command), stores point-in-time baseline snapshots, and diffs against a prior baseline. The --explain-diff flag sends the diff to the LLM for traffic-impact analysis. Covers both IPv4 and IPv6 in a single operation.

  • Effective Network Inspector — snapshots two per-NIC computed states: the effective route table (az network nic show-effective-route-table) and combined NSG evaluation result (az network nic list-effective-nsg). Diffs two snapshots to detect BGP route withdrawal, UDR changes, and NSG evaluation drift that produce no ARM resource change and therefore appear in no Azure audit tool. Stores artifacts with SHA-256 verification.

  • Agentic Pipe Meter — runs qperf (TCP latency) and iperf2 (throughput, 8 parallel streams) between two Azure VMs over configurable iterations, computes P90 statistics, checks NSG ports pre-flight, compares against a stored baseline, and uploads a structured JSON artifact to Azure Blob Storage.

╔══════════════════════════════════════════════════════════════════╗
║               GHOST AGENT — Investigation Layers                 ║
╠══════════════════════════════════════════════════════════════════╣
║  LAYER 3: OS-Layer Firewall                                      ║
║  Netfilter Inspector                                             ║
║    iptables / nftables snapshot → baseline diff → explain-diff   ║
║    detect_config_drift (Ghost Agent tool)                        ║
╠══════════════════════════════════════════════════════════════════╣
║  LAYER 2: Cloud Computed State                                   ║
║  Effective Network Inspector                                     ║
║    Effective routes + combined NSG evaluation → SHA-256 diff     ║
║    detect_effective_network_drift (Ghost Agent tool)             ║
║  Effective Route Inspector — deterministic Azure LPM engine      ║
║  Security Rule Inspector — dual-gate inbound/outbound verdict    ║
╠══════════════════════════════════════════════════════════════════╣
║  LAYER 1: Performance Baseline                                   ║
║  Agentic Pipe Meter                                              ║
║    qperf latency + iperf2 throughput → P90 → baseline compare    ║
║    run_pipe_meter (Ghost Agent tool)                             ║
╠══════════════════════════════════════════════════════════════════╣
║  ORIGINAL TOOLS (unchanged)                                      ║
║  PCAP Forensic Engine    Cloud Orchestrator                      ║
║  Agentic Safety Shell  — gates every command at all layers       ║
╚══════════════════════════════════════════════════════════════════╝

The Safety Shell’s four-tier deterministic classifier gates every command at every layer. Nothing changes about how approvals work — read-only operations are auto-approved, mutative operations require your explicit confirmation. The investigation stack grows deeper; the safety model is unchanged.


Three Compound Fault Investigations

The Shadow Firewall

Symptom: iperf throughput tests from source to destination are completely failing. Every connection is refused or hangs. The Azure NSG on the destination VM shows port 5001 allowed. Both VMs are running and healthy. The symptom appeared after the infrastructure team applied OS hardening scripts overnight.

What was injected: Two independent faults on separate VMs.

  • An iptables DROP rule on the destination VM, inbound TCP port 5001. The NSG shows this port allowed; the drop happens after the packet passes the NSG, inside the OS network stack.
  • A tc netem 30ms delay on the source VM’s primary interface, with SSH (port 22) exempted.

The investigation:

  run_pipe_meter
    → CONNECTIVITY_DROP on TCP 5001
         │
  NSG effective rule evaluation — destination VM
    → TCP 5001: ALLOW  ← control-plane is clean; first dead end
         │
  Packet capture on destination VM (PCAP Forensic Engine)
    → TCP SYNs arrive at the destination VM NIC
    → No SYN-ACK returned
    ← Azure Network Watcher captures after NSG enforcement, before
       iptables. SYNs arriving with no response points to OS-layer drop.
         │
  detect_config_drift — destination VM
    → iptables INPUT: DROP rule for TCP port 5001  ← root cause
         │
  tc qdisc show — source VM
    → netem delay 30ms active (SSH exempt)  ← second fault
    ← tc qdisc show is read-only via az vm run-command; detectable at any
       point in the investigation. Latency impact is only measurable via
       Pipe Meter once the iptables block on port 5001 is resolved.

The evidence that motivates crossing the Azure→OS boundary is the packet capture. SYNs reach the destination VM NIC — NSG is allowing them — but the OS generates no response. That asymmetry is only visible at the wire level, not at the control plane. Once the iptables rule is found and removed, the tc netem delay on the source VM surfaces as a secondary latency finding on the next Pipe Meter run.

Neither fault appears in any Azure audit log. Both are found in a single Ghost Agent session.


The Double Lock

Symptom: iperf connectivity between two VMs is completely broken after a change window that applied “an NSG update and OS-level performance tuning” to the destination VM. The investigation prompt explicitly notes: there may be more than one issue.

What was injected: Two independent faults, both on the destination VM, at different layers.

  • An NSG DENY rule, inbound TCP port 5001, priority 200. This blocks iperf connectivity immediately.
  • A tc netem delay=80ms loss=8% applied to the destination VM’s primary interface, with SSH exempted. This degrades the TCP ACK return path — with 8% ACK loss and 80ms added delay, TCP throughput degrades and becomes unstable across iterations.

The investigation:

  run_pipe_meter
    → CONNECTIVITY_DROP  ← iperf blocked
         │
  NSG audit — destination VM
    → DENY TCP 5001 at priority 200  ← first finding
         │
  tc qdisc show — destination VM
    → netem delay=80ms loss=8% active  ← second finding
    ← az vm run-command runs through the Azure management plane,
       independent of TCP 5001. The NSG block does not prevent it.
         │
  tc qdisc show — source VM
    → clean
         │
  RCA: two independent faults on destination VM;
       NSG rule blocks iperf; tc netem degrades throughput once unblocked.
       Both require separate remediation.

Both faults are found via read-only operations in the same investigation session, without removing the NSG rule. tc qdisc show executes through az vm run-command, which uses the Azure management plane — it is independent of whether TCP 5001 is reachable. The non-obvious element is where the tc fault is placed: on the destination VM, not the source. An inspection limited to the source VM would return clean results on both faults. Ghost Agent checks both VMs before closing the investigation.


The Rollback That Wasn’t

Symptom: A change management window closes. The lead engineer confirms: “All changes applied and verified. Rollback complete.” Ninety minutes later, port 8080 to the destination VM is unreachable, and a separate service reports degraded connectivity on port 5001. The symptom started during the window.

What was injected: Two independent faults that the rollback failed to clean up.

  • An NSG DENY rule for TCP port 8080, priority 150, added during the window and never removed. The portal shows an allow rule exists at a lower priority — the deny fires first because priority 150 precedes the allow rule’s priority number.
  • An iptables DROP rule for TCP port 5001, added by a field engineer as a temporary diagnostic measure during the window and not removed.

A pre-window effective network state baseline was captured before the window opened.

The investigation:

  detect_effective_network_drift
    → baseline: eni_pre_window  (captured before window)
    → current state diff:
        security_rule_change: TCP 8080 DENY added at priority 150
    → drift_detected: true
    SHA-256 verified — rollback claim is contradicted by the diff
    ← effective NSG diff is the primary evidence for the TCP 8080 block;
       no separate connectivity test needed to confirm a rule change
         │
  detect_config_drift — destination VM OS layer
    → iptables INPUT: DROP rule for TCP 5001
    ← executed via az vm run-command; no ARM record; invisible to
       Azure Change Analysis
         │
  run_pipe_meter — iperf2 port 5001
    → CONNECTIVITY_DROP  ← confirms the OS-layer iptables finding
    ← Pipe Meter uses iperf2 (port 5001) and qperf (port 19765);
       it tests these ports specifically, not arbitrary application ports
         │
  RCA: two artifacts left behind from the window;
       NSG deny at priority 150 (would have been visible in Azure);
       iptables DROP (no Azure record of any kind)

The Effective Network Inspector’s role here is precise: it isn’t running a general audit. It’s comparing two snapshots with cryptographic verification (the pre-window state and the current state) and reporting the delta. The rollback claim is that no delta exists. The diff says otherwise.

The drift_detected: false result, when the diff is truly empty, is equally useful. It’s a machine-readable negative confirmation, verifiable by SHA-256, that the effective network state at a NIC is unchanged from the baseline. This is the artifact a post-incident review needs when the question is whether the change window caused the problem.


Technical Challenges That Made This Hard

The Azure→OS Boundary Requires Two Different Retrieval Paths

Netfilter Inspector retrieves iptables state from VMs accessible via SSH directly or via Azure’s az vm run-command. The retrieval backends produce different output structures — run-command wraps the shell output in Azure platform metadata; SSH returns raw stdout. The design decision: --provider azure and --provider ssh route to different backends, but both converge on the same parsed structure before any diff or explain operation runs. The calling layer — including Ghost Agent — has no knowledge of which path was used.

iptables-legacy and iptables-nft Are Different Binaries

They share the same CLI interface but write to different kernel hooks. A rule written via iptables-legacy is invisible to iptables-nft. The active variant depends on update-alternatives configuration, which varies across Ubuntu LTS versions. Calling the wrong binary returns a silently empty or incorrect ruleset — no error, just missing rules.

# The inspector checks this before retrieving rules
update-alternatives --query iptables

Effective State Snapshots Have a Timing Dependency

The Effective Network Inspector is a before/after tool. Its value depends on the baseline being captured before the change window opens. If the baseline captures the faulted state, the diff is empty and the investigation has no reference. The --is-baseline / --compare-baseline workflow must be enforced at the process level — the tool compares what it was given and cannot detect whether the baseline was taken too late.

The SHA-256 verification in the diff artifact addresses a different question: whether the baseline file was modified between capture and comparison. This matters for incident reports where the integrity of the pre-change snapshot may be contested.


What It Takes to Run

All three tools are part of the agentic-network-tools repository and integrate into Ghost Agent via demo/config.env.

Netfilter Inspector: Python 3.9+. For --provider azure: Azure CLI authenticated with Microsoft.Compute/virtualMachines/runCommands/action on the target VMs. For --provider ssh: SSH key access. No third-party Python packages required.

Effective Network Inspector: Azure CLI authenticated. Permissions required: Microsoft.Network/networkInterfaces/effectiveRouteTable/action and Microsoft.Network/networkInterfaces/effectiveNetworkSecurityGroups/action on the target NICs. Python 3.12+, uv.

Agentic Pipe Meter: qperf and iperf2 on both VMs — Pipe Meter checks for them at startup and offers to install if missing, pending your approval through the HITL gate. Azure CLI authenticated. A storage account for artifact upload. Python 3.12+, uv.

Pipe Meter console output — --test-type both --compare-baseline:

[preflight] Checking NSG ports (5001, 19765) between 10.0.0.4 and 10.0.0.5...
[preflight] NSG ports OK. Checking tools on VMs...
[preflight]   Checking 10.0.0.4...
[preflight]   Checking 10.0.0.5...
[preflight] Tools OK. Preflight passed.

=====================================
=== Agentic Pipe Meter — Results ===
=====================================
Session:    pmeter_20260303T175346
Source:     10.0.0.4  →  10.0.0.5
Test:       both  |  8 iterations
Status:     SUCCESS

Latency  (P90):     124.5 µs  ← +2.1% vs baseline (slower)
Throughput (P90):     9.40 Gbps  ← -0.5% vs baseline (lower)

Stability:  STABLE
Audit:      ./audit/pmeter_20260303T175346_result.json
Blob:       https://mystorageaccount.blob.core.windows.net/pipe-meter-results/...
=====================================

Quick start for each tool standalone:

# Netfilter Inspector — capture baseline, then compare
cd netfilter-inspector/firewall-inspector
python3 firewall_inspector.py --config config.env --is-baseline --session-id pre_change
python3 firewall_inspector.py --config config.env --compare-baseline pre_change

# Effective Network Inspector — capture baseline, then compare
cd effective-network-inspector
python effective_network_inspector.py --scope vm --vm-name <name> \
  --resource-group <rg> --is-baseline --session-id pre_window
python effective_network_inspector.py --scope vm --vm-name <name> \
  --resource-group <rg> --compare-baseline pre_window

# Pipe Meter — record baseline, then measure and compare
cd agentic-pipe-meter
uv run python pipe_meter.py --config config.env --is-baseline
uv run python pipe_meter.py --config config.env --compare-baseline

Cost: Pipe Meter adds no Azure cost beyond the VMs already running. Netfilter Inspector and Effective Network Inspector make read-only API calls. Ghost Agent investigations using these tools cost approximately the same in Gemini API calls as a control-plane-only investigation — under $0.05 for typical sessions.


Conclusion

The earlier Ghost Agent work focused on Azure control-plane and wire-level investigation. The new tools extend the investigation stack to the OS firewall layer, to computed network state that leaves no ARM record, and to VM-to-VM performance as a first-class diagnostic signal. The investigation stack goes deeper. The philosophy is unchanged: form hypotheses, escalate through layers with evidence, gate every risky action through human approval, produce an audit trail.

The compound fault cases are the ones that matter most. Single-layer failures are findable with the right tool in isolation. The cases that stretch investigations across hours are the ones where each layer looks clean individually, and where the second fault is invisible until the first is resolved. That’s what these tools were built to address.

Each tool is usable as a standalone CLI and as an integrated Ghost Agent component. Start with a baseline before your next change window. Compare after.


GitHub: github.com/ranga-sampath/agentic-network-tools

Clone the repo. Capture a baseline before the next change window. Describe a symptom when something breaks.