Performance is Fine. Compared to What?

For senior network engineers, SREs, and cloud architects who need a repeatable, audited measurement practice — and for engineering leaders evaluating whether that practice is ready for operational adoption.

Before every maintenance window, someone says “performance is fine.” After the window, someone else says “performance is fine.” Neither of them has a number. When the ticket reopens three days later, nobody can prove the change did or did not cause the regression — because there is no before measurement to compare against.

That is the problem Agentic Pipe Meter is built to solve.

The Problem

Ad hoc iperf is how most teams measure network performance today. Run it when something breaks. Run it again when you think you fixed it. Interpret the output from memory. Forget the exact flags you used last time.

There is no standard parameter set. No baseline to compare against. No record that the test ran, what it found, or who approved the remediation step. If two engineers run iperf a week apart using different iteration counts, different parallelism flags, and different measurement windows, their numbers are not comparable — even on the same path.

The problem compounds during maintenance windows. The standard practice is:

Measure after the change and declare success if the number looks reasonable.
Skip the before measurement entirely, because it requires coordination to run it before the window opens.

When a regression surfaces later, the investigation starts from zero. There is no pre-change measurement to anchor the comparison. The post-change measurement could be degraded by 20% and nobody would know, because 20% of what is not established.

How Engineers Measure Performance Today

Ticket: "performance degraded between VM-A and VM-B"
  → Engineer SSHs into source VM
      → Runs: iperf3 -c <dest-ip> -t 10
          → Gets a number
          → "Looks about right"
          → Closes the measurement tab
              → No record of flags used
              → No record of baseline for this path
              → No agreement on what "about right" means
  → Change applied
  → Engineer SSHs in again
      → Runs iperf3 with slightly different flags
          → Gets a different number
          → "Looks fine"
          → Ticket closed
              → No before/after comparison
              → No audit trail
              → No artifact to share with the next engineer

When the regression returns in a different form two weeks later, the investigation restarts with nothing.

What Agentic Pipe Meter Does

Agentic Pipe Meter runs structured latency and throughput measurements between two Azure VMs via SSH, computes P90 statistics with anomaly detection, compares results against a stored baseline, and produces six audited JSON artifacts per run — all commands flowing through a human-in-the-loop safety gate.

The pipeline is fixed and sequenced. There is no ad hoc path:

┌────────────────────────────────────────────────────────────────────────┐
│  Caller                                                                │
│  CLI args  ──or──  Ghost Agent tool call                               │
└──────────────────────────┬─────────────────────────────────────────────┘
                           │ PipelineConfig
                           ▼
┌───────────────────────────────────────────────────────────────────────┐
│  pipe_meter.py  ─  Pipeline Orchestrator                              │
│                                                                       │
│  validate → preflight → measure → compute → compare → report          │
│                                                                       │
│  Each stage writes one intermediate artifact to {audit_dir}/:         │
│    _preflight.json  _raw.json  _computed.json  _comparison.json       │
│    _result.json  (final artifact, also uploaded to blob)              │
│                                                                       │
│  ┌─────────────────────────────┐  ┌──────────────────────────────┐    │
│  │  SafeExecShell              │  │  CloudProvider (providers.py)│    │
│  │  (sibling library)          │  │  Protocol + AzureProvider    │    │
│  │  classify → gate → execute  │  │  AzureProvider(shell=...)    │    │
│  │  → audit                    │  │  - effective NSG queries     │    │
│  │                             │  │  - blob read / write         │    │
│  │  All SSH and az CLI calls   │  │  (all az calls via shell)    │    │
│  │  flow through this boundary │  │                              │    │
│  └──────────┬──────────────────┘  └──────────────┬───────────────┘    │
└─────────────┼────────────────────────────────────┼────────────────────┘
              │                                     │
   ┌──────────▼──────────┐            ┌─────────────▼──────────────┐
   │   Source VM (SSH)   │            │   Azure Control Plane      │
   │   qperf / iperf2    │            │   az network nic           │
   │   client            │            │   az storage blob          │
   └──────────┬──────────┘            └────────────────────────────┘
              │ network under test
   ┌──────────▼──────────┐
   │   Dest VM (SSH)     │
   │   qperf / iperf2    │
   │   server            │
   └─────────────────────┘

Every SSH command and every az CLI call flows through Agentic Safety Shell — the same human-in-the-loop gate used across the agentic-network-tools stack. Read-only commands run immediately. Write commands stop for operator confirmation before anything executes.

The pipeline stages:

validate — config schema check before anything touches the network
preflight — checks NSG port rules (effective, not configured) and confirms qperf and iperf2 are present on both VMs; installs them if absent, gated
measure — runs N iterations (default: 8) of qperf (latency, port 19765) and iperf2 (throughput, port 5001) via SSH
compute — calculates P90, min, max per metric; applies deterministic anomaly detection
compare — loads the stored baseline from Azure Blob and computes delta percentages
report — writes _result.json to disk, uploads to blob, prints the console summary

No LLM is involved in any measurement stage. P90 computation, anomaly detection, and baseline delta calculation are deterministic operations. A language model in the measurement pipeline would make correctness probabilistic where it must be exact.

The Baseline Workflow

The baseline is the centrepiece. Everything else is a comparison against it.

Step 1 — Establish the baseline on a known-healthy path

On a day when the path is clean — before any planned change, after a confirmed recovery, or as part of initial infrastructure commissioning — run the tool with --is-baseline:

python pipe_meter.py \
  --source-ip 10.0.1.4 \
  --dest-ip 10.0.1.5 \
  --ssh-user azureuser \
  --test-type both \
  --storage-account nwlogs080613 \
  --container pktcaptures \
  --resource-group nw-forensics-rg \
  --iterations 8 \
  --is-baseline

The report stage writes the result to two blob locations: the run artifact (pmeter_{session-id}_result.json) and the baseline slot (10_0_1_4_10_0_1_5_baseline.json). The baseline blob name is derived from the IP pair with dots replaced by underscores. There is one baseline slot per ordered IP pair.

Step 2 — The overwrite gate

If a baseline already exists when --is-baseline is passed, the tool does not silently overwrite it. It passes the blob upload command to Agentic Safety Shell, which classifies it as RISKY and surfaces it for operator confirmation. The prompt shows the command, the reasoning (a baseline already exists for this IP pair recorded at a specific timestamp, overwriting it will replace the known-good reference), and the risk classification (Azure CLI state-change, cannot be undone). The operator chooses to approve, deny, or modify before anything executes.

If the operator declines, the run is saved as a regular (non-baseline) run. The existing baseline is preserved. This matters: a known-good reference recorded on the day infrastructure was last confirmed healthy is worth protecting from an accidental replacement triggered by a flag passed during a routine comparison run.

Step 3 — Run comparisons against the baseline

Every subsequent run without --is-baseline loads the stored baseline and computes deltas:

python pipe_meter.py \
  --source-ip 10.0.1.4 \
  --dest-ip 10.0.1.5 \
  --ssh-user azureuser \
  --test-type both \
  --storage-account nwlogs080613 \
  --container pktcaptures \
  --resource-group nw-forensics-rg \
  --iterations 4

The console output for session pmeter_20260305T051746:

=====================================
=== Agentic Pipe Meter — Results ===
=====================================
Session:    pmeter_20260305T051746
Source:     10.0.1.4  →  10.0.1.5
Test:       both  |  4 iterations
Status:     SUCCESS

Latency  (P90):    139.0 µs  ← +2.2% vs baseline (slower)
Throughput (P90):    1.95 Gbps  ← -3.0% vs baseline (lower)

Stability:  STABLE
Audit:      ./audit/pmeter_20260305T051746_result.json
Blob:       https://nwlogs080613.blob.core.windows.net/pktcaptures/...
=====================================

The full _result.json for that run:

{
  "test_metadata": {
    "session_id": "pmeter_20260305T051746",
    "source_ip": "10.0.1.4",
    "destination_ip": "10.0.1.5",
    "ssh_user": "azureuser",
    "test_type": "both",
    "is_baseline": false,
    "timestamp": "2026-03-05T05:20:40.165854+00:00",
    "iterations": 4,
    "resource_group": "nw-forensics-rg",
    "storage_account": "nwlogs080613",
    "container": "pktcaptures"
  },
  "preflight": {
    "ports_open": true,
    "tools_ready": true,
    "actions_taken": [],
    "blocked_ports": []
  },
  "results": {
    "is_stable": true,
    "anomaly_type": null,
    "latency_p90": 139.0,
    "latency_min": 123.0,
    "latency_max": 139.0,
    "throughput_p90": 1.95,
    "throughput_min": 1.95,
    "throughput_max": 1.95,
    "units": { "latency": "us", "throughput": "Gbps" },
    "iteration_data": [
      {"iteration": 1, "latency_us": 123.0, "throughput_gbps": 1.95},
      {"iteration": 2, "latency_us": 125.0, "throughput_gbps": 1.95},
      {"iteration": 3, "latency_us": 139.0, "throughput_gbps": 1.95},
      {"iteration": 4, "latency_us": 134.0, "throughput_gbps": 1.95}
    ]
  },
  "comparison": {
    "baseline_found": true,
    "baseline_timestamp": "2026-03-04T11:27:13.336706+00:00",
    "baseline_latency_p90": 136.0,
    "baseline_throughput_p90": 2.01,
    "delta_pct_latency": 2.2058823529411766,
    "delta_pct_throughput": -2.9850746268656634
  }
}

The baseline was recorded on 2026-03-04 at 11:27 UTC. This run on 2026-03-05 shows +2.2% latency and -3.0% throughput against that baseline. Both are within normal variance. The path is healthy.

Here is a second run taken four minutes after the baseline was recorded (session pmeter_20260304T112845):

{
  "test_metadata": {
    "session_id": "pmeter_20260304T112845",
    "source_ip": "10.0.1.4",
    "destination_ip": "10.0.1.5",
    "ssh_user": "azureuser",
    "test_type": "both",
    "is_baseline": false,
    "timestamp": "2026-03-04T11:31:37.651368+00:00",
    "iterations": 4,
    "resource_group": "nw-forensics-rg",
    "storage_account": "nwlogs080613",
    "container": "pktcaptures"
  },
  "preflight": {
    "ports_open": true,
    "tools_ready": true,
    "actions_taken": [],
    "blocked_ports": []
  },
  "results": {
    "is_stable": true,
    "anomaly_type": null,
    "latency_p90": 145.0,
    "latency_min": 135.0,
    "latency_max": 145.0,
    "throughput_p90": 1.95,
    "throughput_min": 1.94,
    "throughput_max": 1.95,
    "units": { "latency": "us", "throughput": "Gbps" }
  },
  "comparison": {
    "baseline_found": true,
    "baseline_timestamp": "2026-03-04T11:27:13.336706+00:00",
    "baseline_latency_p90": 136.0,
    "baseline_throughput_p90": 2.01,
    "delta_pct_latency": 6.61764705882353,
    "delta_pct_throughput": -2.9850746268656634
  }
}

Latency P90 is 145 µs vs the baseline 136 µs (+6.6%). Throughput is 1.95 Gbps vs 2.01 Gbps (-3.0%). Both within normal variance. Taken four minutes apart on the same path, the two runs bracket the baseline cleanly.

That is what a healthy comparison looks like. When a maintenance window degrades the path, the deltas tell you immediately — and the artifact is already in blob storage for whoever needs to review it.

Pre/Post Change Verification

The primary operational use pattern is the maintenance window bracket:

──────────────────────────────────────────────────────────────────────────
  MAINTENANCE WINDOW BRACKET
──────────────────────────────────────────────────────────────────────────

  T=0  Pre-change run  (--is-baseline)
       Stores baseline:  10_0_1_4_10_0_1_5_baseline.json
       Saves artifact:   pre_change_20260310_result.json
            │
            │  ◀──── change window ────────────────────────────────────▶
            │
  T=1  Post-change run  (comparison)
       Loads baseline automatically from blob
       Computes delta:   latency +X%   throughput −Y%
       Saves artifact:   post_change_20260310_result.json
       Uploads to blob before the post-change review meeting

──────────────────────────────────────────────────────────────────────────

Before the window opens:

# Establish the pre-change measurement as the new baseline
python pipe_meter.py \
  --source-ip 10.0.1.4 \
  --dest-ip 10.0.1.5 \
  --ssh-user azureuser \
  --test-type both \
  --storage-account nwlogs080613 \
  --container pktcaptures \
  --resource-group nw-forensics-rg \
  --iterations 8 \
  --is-baseline \
  --session-id pre_change_20260310

Apply the change.

After the window closes:

# Compare against the stored baseline
python pipe_meter.py \
  --source-ip 10.0.1.4 \
  --dest-ip 10.0.1.5 \
  --ssh-user azureuser \
  --test-type both \
  --storage-account nwlogs080613 \
  --container pktcaptures \
  --resource-group nw-forensics-rg \
  --iterations 8 \
  --session-id post_change_20260310

The post-change run loads the pre-change baseline automatically from blob storage and computes deltas. The console shows the before/after delta immediately. The _result.json artifact — including the full comparison block with baseline timestamp and delta percentages — is in blob storage before the change review meeting starts. Nobody needs to re-run anything to share the evidence.

The Preflight Auto-Remediation Path

Before measurement begins, the preflight stage checks two things using effective NSG rules — not configured rules — on both VMs’ NICs:

Whether measurement ports 5001 (iperf2) and 19765 (qperf) are open
Whether qperf and iperf2 binaries are present on both VMs

  preflight stage
        │
        ▼
  Check effective NSG  (source VM + dest VM)
  ├── Port blocked?
  │   ├── YES ──▶  Generate  az network nsg rule create  command
  │   │             Agentic Safety Shell  RISKY gate
  │   │             ├── Approve ──▶  Rule created  ──▶  continue
  │   │             └── Deny   ──▶  RUN ABORTS
  │   └── NO ──▶  continue
        │
        ▼
  Check qperf / iperf2 binaries  (source VM + dest VM)
  ├── Missing?
  │   ├── YES ──▶  Generate install command
  │   │             Agentic Safety Shell  RISKY gate
  │   │             ├── Approve ──▶  Installed  ──▶  continue
  │   │             └── Deny   ──▶  RUN ABORTS
  │   └── NO ──▶  continue
        │
        ▼
  measure stage begins

If a port is blocked, the tool does not abort silently. It generates the az network nsg rule create command needed to open the port — with priority computed at runtime to avoid conflicts with existing NSG rules — and passes it to Agentic Safety Shell as a RISKY command. Agentic Safety Shell surfaces it for operator confirmation, showing the full command, the reasoning (which port is blocked on which VM’s effective NSG, which measurement tool it affects), and the risk classification (Azure CLI state-change, creates a new NSG rule). The operator chooses to approve, deny, or modify.

If the operator approves, the rule is created and measurement proceeds. If declined, the run aborts — the tool does not attempt to measure through a blocked port. A measurement taken through a partial connectivity state would produce a result that looks like degradation when it is actually a preflight gap.

The same gate applies to tool installation. If qperf or iperf2 is missing on either VM, the install command stops for approval. If declined, the run aborts.

The check uses effective rules (az network nic list-effective-nsg) because configured rules represent intent and effective rules represent enforcement. Checking the wrong layer here would produce false positives and miss real blocks.

The Six JSON Artifacts

Each run writes six artifacts to {audit_dir}/:

Artifact	Content
`{session_id}_manifest.json`	Config snapshot: source IP, dest IP, test type, iterations, is_baseline flag, session ID, timestamp
`{session_id}_preflight.json`	Port status, tool check results, actions taken (NSG rules created, packages installed)
`{session_id}_raw.json`	Per-iteration latency (µs) and throughput (Gbps) samples
`{session_id}_computed.json`	P90, min, max, is_stable, anomaly_type per metric
`{session_id}_comparison.json`	Delta vs baseline, baseline P90 values and timestamp
`{session_id}_result.json`	Full merged artifact; also uploaded to blob storage

These artifacts are a durable audit record. Every command that produced them is in the Agentic Safety Shell audit JSONL — timestamped, attributed, with human decision recorded for each RISKY step. An engineer on a different team in a different timezone can read the result without re-running anything. The next time a ticket opens on this path, the baseline comparison history is already in blob storage.

No artifact is modified after its stage writes it. Each stage owns one write.

Use Cases

Use Case	Measurement type	What the result shows
Maintenance window bracket	Both (pre-change + post-change comparison)	Delta percentages against the pre-change baseline; confirms whether the change affected the path
Infrastructure commissioning	Both (`--is-baseline`)	Establishes the healthy-path reference on day one; every future run compares against it
Periodic path health verification	Both (scheduled comparison runs)	Drift detection — catches degradation that accumulated between maintenance windows without a triggering incident
Incident triage	Both (comparison run during active degradation)	Quantifies the degradation and confirms whether it was present before the most recent change

One non-obvious finding worth naming: is_stable=True does not mean the path is healthy.

  Each iteration sample collected
            │
            ▼
  Any sample = 0.0?
  ├── YES ──▶  CONNECTIVITY_DROP   (connection timed out or dropped)
  └── NO
            │
            ▼
  (max − min) / min  >  0.50?
  ├── YES ──▶  HIGH_VARIANCE       (connected but inconsistent)
  └── NO
            │
            ▼
  is_stable=True,  anomaly_type=null

  ⚠  This does NOT mean the path is healthy.
     A rate-limited path returning 5 Mbps every iteration
     passes both checks. The baseline delta catches it.

The anomaly detection flags two conditions: CONNECTIVITY_DROP (any sample is 0.0 — connection timed out or dropped) and HIGH_VARIANCE ((max − min) / min > 0.50 — the path is connected but inconsistent). What it does not flag is a path that is consistently degraded — one where every measurement returns the same, wrong value.

Consider a token bucket filter rate-limiting all TCP traffic to 5 Mbps. Every iteration measures 5 Mbps. The samples are perfectly consistent: no anomaly fires, is_stable=True, anomaly_type=null. The result looks clean. It is not — 5 Mbps is not a healthy throughput on an Azure accelerated networking VM path.

The design decision was to keep absolute-value thresholds out of the measurement tool. Platform throughput expectations vary by VM SKU and accelerated networking configuration. Embedding those assumptions in the measurement layer would mean the tool breaks silently when deployed against a different environment. The tool reports what it observed. The operator — or the agent calling the tool — reasons about whether that observation is anomalous given what they know about the path.

The baseline comparison is the correct mechanism for catching this case. If the baseline was established when the path was healthy at 2+ Gbps, a subsequent run returning 0.005 Gbps will show a delta of roughly -99.75% regardless of what is_stable says.

What It Takes to Run

Prerequisites:

Python 3.12
SSH access to both VMs from the operator machine (ssh-agent or VM identity configured)
Azure CLI authenticated against the target subscription
Azure storage account and container for baseline and result blobs
qperf and iperf2 on both VMs (preflight installs them if absent, gated through Agentic Safety Shell)

Establish the baseline:

python pipe_meter.py \
  --source-ip 10.0.1.4 \
  --dest-ip 10.0.1.5 \
  --ssh-user azureuser \
  --test-type both \
  --storage-account nwlogs080613 \
  --container pktcaptures \
  --resource-group nw-forensics-rg \
  --iterations 8 \
  --is-baseline

Run a comparison:

python pipe_meter.py \
  --source-ip 10.0.1.4 \
  --dest-ip 10.0.1.5 \
  --ssh-user azureuser \
  --test-type both \
  --storage-account nwlogs080613 \
  --container pktcaptures \
  --resource-group nw-forensics-rg \
  --iterations 8

Full CLI reference:

python pipe_meter.py
  --source-ip IP         IP of the client VM
  --dest-ip   IP         IP of the server VM
  --ssh-user  USER       SSH username valid on both VMs
  --test-type {latency,throughput,both}
  --storage-account NAME Azure storage account for artifacts
  --container NAME       Azure blob container for artifacts
  --resource-group RG    Azure resource group (required for NSG remediation)
  [--iterations N]       Default: 8
  [--is-baseline]        Flag: mark this run as the baseline for this IP pair
  [--session-id ID]      Default: auto-generated pmeter_{YYYYMMDDTHHMMSS}
  [--audit-dir PATH]     Default: ./audit

What the tool does not do — intentional omissions:

No LLM or AI in any measurement stage — all operations are deterministic
No auto-remediation — every write action (NSG rule creation, package install) flows through the Agentic Safety Shell HITL gate
UDP not supported — TCP only: qperf port 19765, iperf2 port 5001
One source/destination pair per invocation
Azure only — the CloudProvider Protocol exists for future providers; only AzureProvider is built
No time-series storage or trending — individual JSON blobs per run
No parallel multi-pair testing
No silent retry on measurement failure — a parse error is a run failure

The tool is also a library. Ghost Agent calls it as a tool in its investigation pipeline when it needs live measurement evidence during an active investigation. That integration is described separately. The standalone CLI path documented here does not require Ghost Agent or any other caller.

Conclusion

Ad hoc iperf is not a measurement practice. It is a command that produces a number with no anchor, no audit trail, and no repeatability across engineers or time.

Agentic Pipe Meter replaces that with a structured, sequenced pipeline that produces a fixed set of artifacts on every run, stores a reference baseline in blob storage, and computes deltas automatically. The operator gets a number that means something — not because it is big or small in isolation, but because it is measured against a known-good reference using the same parameters every time.

What the tool does not replace is expert judgment. is_stable=True with a delta of -99% still requires someone to read that number and act on it. The measurement layer reports what the path delivered. What that means for a specific VM SKU, application protocol, or traffic profile is the engineer’s call.

What becomes consistent is the evidence. The before measurement exists. The after measurement exists. The delta is computed automatically. The artifacts are in blob storage before the post-change review meeting. The Agentic Safety Shell audit log records every command that produced them.

That is the practice. The tool just enforces it.

GitHub: https://github.com/ranga-sampath/agentic-network-tools

Clone the repo, establish a baseline on any two Azure VMs with SSH access, and run a comparison. The preflight stage handles port rules and tool installation. The first delta is ready in minutes.

If you are building measurement tooling for other cloud platforms or working through the same HITL gate design for write operations in your own agentic tooling, I am happy to compare notes.