The Control Plane Stops at the OS Boundary. My Agent Doesn’t Anymore. 

How adding live measurement to the Ghost Agent covers a fault class that Azure’s native tooling does not surface.


The Azure portal shows both VMs healthy. The NSG allows the traffic. The route table is correct. Your engineer runs the throughput test anyway and gets 5 Mbps on a path that should deliver 2 Gbps. No alert fired. No platform event. Nothing in the logs.

Azure’s native monitoring did not surface that failure. It happened at the OS layer — below the visibility horizon of the control plane.


The Problem

Cloud network investigations have a structural blind spot. Every native Azure diagnostic tool — NSG flow logs, effective route tables, Network Watcher, Connection Monitor — operates at the control plane. They tell you what Azure configured. They cannot tell you what the OS is actually doing.

That gap is where a class of faults lives permanently:

  • A tc netem rule injecting 100 ms delay and 10% packet loss on a VM’s egress interface
  • An iptables DROP rule an OS hardening script installed after the last security audit
  • A token bucket filter rate-limiting all TCP traffic to 5 Mbps while the Azure NSG reports port open and healthy
  • A combination of packet loss and corruption that makes throughput collapse intermittently and makes nothing deterministic

Ghost Agent could already investigate at the Azure control-plane layer — NSGs, route tables, effective NIC routes, BGP route sources — and escalate to wire-level evidence via packet captures. But when the investigation reached an OS-level fault, it stopped. The agent could look at everything Azure exposes and find nothing — because the fault was below it.

The integration with the Agentic Pipe Meter closes that gap. It gives Ghost Agent a live measurement capability: run actual throughput and latency tests between two VMs, compute P90 statistics with anomaly detection, compare against a stored baseline, and return a structured result the agent can reason over.

The core argument is simple: if the measurement is degraded and the control plane is clean, the fault is OS-level. That inference requires a number. Until now, Ghost Agent had no way to get one.


How Engineers Investigate Performance Today

A typical degraded-throughput investigation looks like this:

Ticket arrives: "file transfers between VM-A and VM-B are slow"
  → Engineer checks NSG rules
      → "Port open, looks clean"
  → Engineer checks effective route table on source NIC
      → "Routes correct"
  → Opens escalation: "Azure control plane clean, unknown cause"
      → Network team takes over cold
          → Reruns same checks
          → Eventually SSHs into source VM
          → Finds a traffic shaping rule from last week's chaos engineering
            run that nobody cleaned up

The answer was one command: tc qdisc show.

The problem is not that the command is hard. The problem is structural: the tooling points you to the control plane first, and the control plane is clean, so the investigation stalls there. OS-level inspection requires someone to decide to SSH into the VM and look — which only happens after everything else has been ruled out.

A structured agentic investigation that includes live measurement from the start short-circuits this loop. You measure first. If measurement is clean, the fault is upstream of the network path. If measurement is degraded and the control plane is clean, you go to the OS layer immediately, not after exhausting Azure’s tooling.


What the Integration Does

Ghost Agent calls the Agentic Pipe Meter as a tool in its investigation pipeline:

run_pipe_meter(config: PipelineConfig, shell: SafeExecShell, provider: CloudProvider) → PipelineResult

The pipe meter runs a structured pipeline — not an ad hoc test:

┌────────────────────────────────────────────────────────────────────────┐
│  Caller                                                                │
│  CLI args  ──or──  Ghost Agent tool call                               │
└──────────────────────────┬─────────────────────────────────────────────┘
                           │ PipelineConfig
                           ▼
┌────────────────────────────────────────────────────────────────────────┐
│  pipe_meter.py  ─  Pipeline Orchestrator                               │
│                                                                        │
│  validate → preflight → measure → compute → compare → report          │
│                                                                        │
│  Each stage writes one intermediate artifact to {audit_dir}/:         │
│    _preflight.json  _raw.json  _computed.json  _comparison.json        │
│    _result.json  (final artifact, also uploaded to blob)               │
│                                                                        │
│  ┌─────────────────────────────┐  ┌──────────────────────────────┐    │
│  │  SafeExecShell              │  │  CloudProvider (providers.py) │    │
│  │  (sibling library)          │  │  Protocol + AzureProvider     │    │
│  │  classify → gate → execute  │  │  AzureProvider(shell=...)     │    │
│  │  → audit                    │  │  - effective NSG queries       │    │
│  │                             │  │  - blob read / write          │    │
│  │  All SSH and az CLI calls   │  │  (all az calls via shell)     │    │
│  │  flow through this boundary │  │                               │    │
│  └──────────┬──────────────────┘  └──────────────┬───────────────┘    │
└─────────────┼────────────────────────────────────┼────────────────────┘
              │                                     │
   ┌──────────▼──────────┐            ┌─────────────▼──────────────┐
   │   Source VM (SSH)   │            │   Azure Control Plane       │
   │   qperf / iperf2    │            │   az network nic            │
   │   client            │            │   az storage blob           │
   └──────────┬──────────┘            └────────────────────────────┘
              │ network under test
   ┌──────────▼──────────┐
   │   Dest VM (SSH)     │
   │   qperf / iperf2    │
   │   server            │
   └─────────────────────┘

Every SSH command and every az CLI call flows through Agentic Safety Shell — the same human-in-the-loop gate the Ghost Agent uses for all risky actions. The measurement tool is not operating outside the safety boundary; it is inside it.

The pipeline stages:

  • validate — config schema check before anything hits the network
  • preflight — checks measurement ports 5001 and 19765 using effective NSG rules (not configured rules) on both VMs’ NICs; confirms qperf and iperf2 are present; installs them if absent, gated
  • measure — runs N iterations (default: 8) of qperf (latency) and iperf2 (throughput) via SSH
  • compute — calculates P90, min, max per metric; applies anomaly detection
  • compare — loads the stored baseline from Azure Blob Storage and computes delta percentages
  • report — writes _result.json and uploads to blob; prints a structured console summary

The anomaly detection rules are deterministic:

  • CONNECTIVITY_DROP: any sample equals 0.0 (timeout or dropped connection). Takes priority over variance check.
  • HIGH_VARIANCE: (max − min) / min > 0.50. Path is connected but inconsistent.
  • is_stable=True / anomaly_type=null: all samples within a 50% band. Path is consistent.

Here is the full result artifact from a clean run (session pmeter_20260305T051746):

{
  "test_metadata": {
    "session_id": "pmeter_20260305T051746",
    "source_ip": "10.0.1.4",
    "destination_ip": "10.0.1.5",
    "ssh_user": "azureuser",
    "test_type": "both",
    "is_baseline": false,
    "timestamp": "2026-03-05T05:20:40.165854+00:00",
    "iterations": 4,
    "resource_group": "nw-forensics-rg",
    "storage_account": "nwlogs080613",
    "container": "pktcaptures"
  },
  "preflight": {
    "ports_open": true,
    "tools_ready": true,
    "actions_taken": [],
    "blocked_ports": []
  },
  "results": {
    "is_stable": true,
    "anomaly_type": null,
    "latency_p90": 139.0,
    "latency_min": 123.0,
    "latency_max": 139.0,
    "throughput_p90": 1.95,
    "throughput_min": 1.95,
    "throughput_max": 1.95,
    "units": {
      "latency": "us",
      "throughput": "Gbps"
    },
    "iteration_data": [
      {"iteration": 1, "latency_us": 123.0, "throughput_gbps": 1.95},
      {"iteration": 2, "latency_us": 125.0, "throughput_gbps": 1.95},
      {"iteration": 3, "latency_us": 139.0, "throughput_gbps": 1.95},
      {"iteration": 4, "latency_us": 134.0, "throughput_gbps": 1.95}
    ]
  },
  "comparison": {
    "baseline_found": true,
    "baseline_timestamp": "2026-03-04T11:27:13.336706+00:00",
    "baseline_latency_p90": 136.0,
    "baseline_throughput_p90": 2.01,
    "delta_pct_latency": 2.2058823529411766,
    "delta_pct_throughput": -2.9850746268656634
  }
}

The console summary for the same run (session pmeter_20260305T051746, derived from the result artifact above):

=====================================
=== Agentic Pipe Meter — Results ===
=====================================
Session:    pmeter_20260305T051746
Source:     10.0.1.4  →  10.0.1.5
Test:       both  |  4 iterations
Status:     SUCCESS

Latency  (P90):    139.0 µs  ← +2.2% vs baseline (slower)
Throughput (P90):    1.95 Gbps  ← -3.0% vs baseline (lower)

Stability:  STABLE
Audit:      ./audit/pmeter_20260305T051746_result.json
Blob:       https://nwlogs080613.blob.core.windows.net/pktcaptures/...
=====================================

Every result artifact is uploaded to Azure Blob Storage. Every command that produced it is in the Agentic Safety Shell audit log, timestamped and attributed.


Use Cases, All Tested

The integration was validated across six fault scenarios (G through L). Each has a setup script that injects a real fault into a live Azure environment, and a teardown script that removes it.

ID Name Fault type Primary signal Key finding
G Bandwidth Heist tc netem loss 20% on source VM egress HIGH_VARIANCE on throughput TCP retransmits cause variable collapse; tc qdisc show confirms
H Latency Landmine tc netem delay 100ms loss 10% on source VM HIGH_VARIANCE on latency 10% loss forces TCP RTO on ~2 probes/iteration; iteration averages vary well beyond 50% threshold; tc qdisc show confirms
I Packet Grinder tc netem loss 10% delay 10ms corrupt 2% on source VM HIGH_VARIANCE or CONNECTIVITY_DROP on both metrics Combined loss and corruption: loss forces retransmits, corruption triggers checksum failures and additional retransmits
J The Shadow Firewall iptables DROP on dest VM port 5001 + tc netem delay 30ms on source CONNECTIVITY_DROP; NSG shows port allowed Agent crosses Azure→OS boundary using PCAP evidence
K The Bandwidth Thief tc tbf rate 5mbit on source + iptables DROP ICMP on dest is_stable=True, throughput_p90=0.005 Gbps Two independent root causes; anomaly flag alone is insufficient
L The Double Lock NSG DENY TCP 5001 at priority 200 + tc netem delay 80ms loss 8% on dest VM CONNECTIVITY_DROP; netem on dest not source Removing only the NSG rule leaves latency fault active

Two scenarios are worth unpacking.

Use Case K — “The Bandwidth Thief”

The pipe meter returns is_stable=True, anomaly_type=null. No anomaly fires. A naive reading says the path is healthy.

It is not. throughput_p90=0.005 Gbps is 5 Mbps. A healthy Azure accelerated networking VM delivers 2+ Gbps. The path is stably broken — perfectly consistent at an absolutely wrong value.

This is the key design constraint of the measurement contract: is_stable means the samples are statistically consistent with each other. It does not mean they are within Azure platform expectations. The agent reasons over the absolute value. It is not the measurement tool’s job to know what 5 Mbps means on an Azure VNet; it is the agent’s.

Meanwhile, ICMP from source to dest is completely failing — iptables -I INPUT 1 -p icmp -j DROP on the dest VM. The agent must not conflate that with the throughput fault. They share no root cause. The ICMP drop is a red herring in the throughput investigation, and a real second finding that needs its own remediation.

Use Case L — “The Double Lock”

Previous use cases (G, H, I) inject tc netem on the source VM. Use Case L injects it on the dest VM. This is intentional: it tests whether the agent checks both VMs, not just the one it starts with.

There are two independent faults: an NSG DENY rule at priority 200 blocking TCP port 5001 inbound on the dest VM (Azure control-plane fault), and a tc netem delay 80ms loss 8% rule on the dest VM’s egress interface (OS-level fault, SSH port exempt).

With the NSG rule in place, the pipe meter returns CONNECTIVITY_DROP — iperf cannot establish a connection to port 5001. Remove the NSG rule and connectivity is restored, but the tc netem fault on the dest VM’s egress remains. Now the pipe meter returns HIGH_VARIANCE on latency. The connection works; the performance does not.

Both faults require their own remediation. Removing only the NSG rule and closing the ticket is the maintenance window trap — the latency fault survives and the post-incident review has an incomplete picture.


What the Agent Gets Wrong

Here are some failure cases to be aware of.

The measurement result is deterministic and trustworthy — that is the point of keeping LLMs out of the pipeline. Where the agent can go wrong is in reasoning over the result. In Use Case K, the pipe meter returns is_stable=True with throughput=0.005 Gbps. A correct agent reasons: “5 Mbps is anomalously low for this path.” An incorrect agent might read is_stable=True and terminate the throughput investigation without checking the absolute value. Whether it does the right thing depends on how the Ghost Agent’s system prompt frames the measurement interpretation step — which is a system prompt engineering problem, not a measurement problem.

Similarly, Use Case L tests whether the agent checks tc qdisc show on the dest VM when prior use cases always found the fault on the source VM. If the agent develops a pattern of checking source VM first and stopping when it finds something, it will miss the second fault in L. The structured RCA prompt guards against this, but “the prompt guards against it” is different from “it is architecturally impossible to miss.”

The measurement layer does what it claims: it returns a correct, audited, deterministic result. What the agent does with that result is where judgment — and error — lives.


Two Design Decisions Worth Explaining

1 — A healthy-looking result is not the same as a healthy path

The anomaly detection flags two conditions: CONNECTIVITY_DROP (connection failed) and HIGH_VARIANCE (results are inconsistent across iterations). What it does not flag is a path that is consistently degraded — one where every measurement comes back at the same, wrong value.

Use Case K demonstrates this directly. A rate-limiting rule caps throughput at 5 Mbps. Every iteration measures 5 Mbps. The samples are perfectly consistent: no anomaly fires, is_stable=True. A reading of the result that stops at the stability flag misses the fault entirely.

The design decision was to keep absolute-value thresholds out of the measurement tool. Platform throughput expectations vary by VM SKU and accelerated networking configuration — embedding those assumptions in the measurement layer would mean the tool breaks silently when deployed against a different environment. Instead, the measurement tool reports what it observed. The agent reasons about whether that observation is anomalous given what it knows about the path. Measurement and diagnosis stay in separate layers.

The same principle applies to latency. A constant delay added to every probe averages out across iterations and produces a stable result even on a severely degraded path. Use Case H injects both delay and packet loss — the loss introduces retransmit timeouts that vary across iterations, producing the spread that HIGH_VARIANCE is designed to detect. Pure delay alone would not trigger it.

2 — No AI in the measurement pipeline, by design

Every other component in the Ghost Agent stack involves LLM reasoning. The measurement pipeline deliberately excludes it. P90 computation, anomaly detection, and baseline delta calculation are deterministic operations. An LLM in the measurement layer would make correctness probabilistic where it must be exact.

The JSON result Ghost Agent reads from pipe_meter is ground truth. The agent reasons over it — it does not produce it. If the measurement layer used an LLM to interpret or summarize results, the agent would be reasoning over a probabilistic summary of a measurement rather than the measurement itself. That breaks the evidentiary chain.

Measurement is deterministic. Interpretation is the agent’s job.


What It Takes to Run

Dependencies:

  • Python 3.12
  • qperf and iperf2 on both VMs (the preflight stage installs them if absent, gated through Agentic Safety Shell)
  • SSH access to both VMs from the operator machine (ssh-agent or VM identity)
  • GEMINI_API_KEY for Ghost Agent
  • Azure CLI authenticated against the target subscription

Configuration (demo/config.env):

RESOURCE_GROUP
SOURCE_VM_NAME
SOURCE_VM_PRIVATE_IP
DEST_VM_NAME
DEST_VM_PRIVATE_IP
DEST_VM_NSG_NAME
STORAGE_ACCOUNT
STORAGE_CONTAINER

Running an investigation (use cases G–L):

# Inject fault
./demo/use_case_k/setup.sh

# Run investigation
python ghost_agent.py --config demo/config.env

# Remove fault
./demo/use_case_k/teardown.sh

Test coverage:

194 tests across 16 test files cover all pipeline stages, NSG parsing, HITL callbacks, SSH command templates, blob routing, and Azure provider logic.


Conclusion

The Agentic Pipe Meter integration gives Ghost Agent visibility into the one fault class it previously could not reach: OS-level degradation that is invisible to the Azure control plane.

The design has deliberate constraints:

  • No LLM in any measurement stage — all statistics are computed deterministically
  • No auto-remediation — every write action flows through the HITL gate
  • TCP only — UDP is not supported (qperf port 19765, iperf2 port 5001)
  • One source/destination pair per invocation
  • Azure only — the CloudProvider Protocol exists for future providers; only AzureProvider is implemented
  • No time-series storage or trending — individual JSON blobs per run

A measurement tool that embeds LLM reasoning is less trustworthy, not more capable. An agent that auto-remediates OS-level faults it found autonomously is an incident, not an investigation.

The six use cases — from simple tc netem delay injection to compound faults spanning both Azure NSG and OS-level iptables — cover the scenarios where this class of tooling proves its worth: when the control plane is clean, the ticket is still open, and someone needs a number.


Source and documentation: https://github.com/ranga-sampath/agentic-network-tools

If you are building agentic infrastructure tooling and working through the same design questions — how to gate autonomy, how to keep measurement deterministic, how to structure HITL for high-frequency operations — I am happy to compare notes. The field is new enough that most of the interesting decisions are not yet written down anywhere.