What the Azure Toolkit doesn’t tell you about your Effective Routes

For network engineers, cloud architects, and the product leaders who support them.


The on-call engineer checked the NSG. Port 5432 is allowed. Port 5432 is blocked. Both statements are true.


The Problem

Two classes of Azure network failure are invisible to every existing Azure diagnostic tool — including Change Analysis, Network Watcher topology, and per-resource NSG views.

BGP route withdrawal. When a VPN gateway or ExpressRoute circuit withdraws routes — due to a session flap, a maintenance event, or the “propagate gateway routes” flag being disabled — no ARM resource changes. Azure Change Analysis records nothing. The routes disappear silently from every NIC’s effective route table while the configured route table resource stays intact.

NSG evaluation drift. Azure evaluates inbound traffic against the subnet NSG first, then the NIC NSG. A deny at priority 100 in the subnet NSG overrides every NIC NSG allow rule, regardless of priority number. Querying each NSG in isolation cannot surface this interaction. The combined evaluation result — the effective security rules — only appears via az network nic list-effective-nsg.

Both are computed states. They are derived on-demand from the Azure network stack. They are not ARM resources and they produce no Activity Log entries. This is the gap Azure Change Analysis cannot close.


How Teams Handle It Today

P1 escalation: TCP 5432 to destination VM is broken
  → On-call: az network nsg rule list → "port 5432 is allowed"
      → Investigation stops. Control plane looks clean.
  → Actually: subnet NSG has a priority-100 DENY that fires before the NIC NSG allow.
      Visible only via az network nic list-effective-nsg.
      Never queried. Never part of the standard runbook.

Maintenance window closes
  → Engineer queries az nsg rule list, az route-table route list
      → "Same as before"
  → Traffic degraded 30 minutes later
      → No baseline was taken before the window.
          Diff is impossible. Investigation becomes "find the change."
  → VPN gateway had "propagate gateway routes" disabled during the window.
      BGP routes disappeared from every NIC on the subnet.
      No ARM resource changed. Change Analysis: nothing to report.

The structural failure is the same in both cases: teams query configured state and assume it matches effective state.


What Effective Network Inspector Does

Effective Network Inspector (ENI) snapshots and diffs two Azure control-plane computed states per NIC: effective routes (az network nic show-effective-route-table) and the combined effective NSG evaluation (az network nic list-effective-nsg). It takes a baseline before a change window, captures current state after, and produces a structured diff — SHA-256 verified, machine-readable, with drift_detected: false written explicitly as a negative confirmation.

┌──────────────────────────────────────────────────────────────────┐
│  effective_network_inspector.py                                   │
│                                                                   │
│  [1] Discover NICs     az vm nic list  /  az vnet subnet list     │
│         │                                                         │
│  [2] Query per NIC     az network nic show-effective-route-table  │
│         │              az network nic list-effective-nsg          │
│         │              (ThreadPoolExecutor, configurable workers)  │
│         │                                                         │
│  [3] Save snapshot     audit/eni_{session_id}_snapshot.json      │
│         │              audit/eni_{session_id}_snapshot.json.sha256│
│         │                                                         │
│  [4] Diff (optional)   verify SHA-256 on baseline                │
│                        diff.py (pure function — no I/O)          │
│                        audit/eni_{baseline}_vs_eni_{compare}_diff.json
│                        print summary to stdout                   │
└──────────────────────────────────────────────────────────────────┘

Three files, one responsibility each. providers.py is the only file that calls the Azure CLI. diff.py is a pure function — deterministic for identical inputs, no I/O, no side effects. All structured output goes to the artifact file; stdout is for human progress output only.

A P1 Investigation — Step by Step

TCP 5432 is unreachable. On-call confirmed the NIC NSG allows port 5432. A last-known-good baseline was taken at the start of shift: eni_pre_escalation_Q.

python effective_network_inspector.py \
  --scope vm --vm-name tf-dest-vm \
  --resource-group nw-forensics-rg \
  --compare-baseline pre_escalation_Q
[1/4] Discovering NICs (scope: vm / tf-dest-vm) ...
      Found 1 NIC(s): tf-dest-vm-nic
[2/4] Querying effective network state ...
      (az effective-route-table can take 30–60 s per NIC; running up to 4 NIC queries in parallel) ...
      Snapshotting NIC 1/1: tf-dest-vm-nic (routes: 8, nsg_rules: 23)
[3/4] Saving results ...
[4/4] Comparing against baseline: eni_pre_escalation_Q ...
      drift_detected: true — 1 change(s)
        security_rule_change: 1
        [tf-dest-vm-nic] ADDED security_rule_change: ghost-demo-subnet-block-5432 (Inbound)
      Diff report: audit/eni_pre_escalation_Q_vs_eni_20260407_143022_diff.json

ghost-demo-subnet-block-5432 is the rule name from the tested scenario. In production, it would be whatever the engineer named the rule that was added to the subnet NSG. The finding is the same: a rule appeared in the combined effective evaluation that wasn’t there at baseline — a rule the NIC NSG query never surfaced.

Why the NIC NSG query was correct and insufficient: Azure evaluates inbound traffic against the subnet NSG first. The priority-100 DENY in the subnet NSG fires before the NIC NSG allow at any higher priority number. az network nsg rule list on the NIC NSG returns an accurate list of NIC NSG rules. It says nothing about the subnet NSG evaluation that precedes it. Only the combined effective result tells the full story.

The diff is what Ghost Agent reads and routes to its reasoning loop. It does not re-query Azure — it reads the artifact and returns the structured finding directly.


Where ENI Fits in Ghost Agent

Ghost Agent is an autonomous network forensics investigator. It forms hypotheses, runs diagnostics, and escalates through an evidence hierarchy. ENI fills the computed state layer — the specific gap between “what Azure is configured to do” and “what Azure is actually doing at the NIC.”

Ghost Agent — Azure Investigation Escalation
══════════════════════════════════════════════════════════════

  Symptom described in plain English
        │
        ▼
  ┌─────────────────────────────────────────────────────────┐
  │  Azure configured state reads (auto-approved)           │
  │  az nsg rule list, az route-table route list,           │
  │  az network vnet show, az network dns zone list         │
  └──────┬─────────────────────────────────────────────────┘
         │ configured state is clean
         │ effective state needed
         ▼
  ┌──────────────────────────────────────────────────────────┐
  │  Azure computed effective state        ← ENI lives here  │
  │  detect_effective_network_drift                          │
  │  → BGP route withdrawal                                  │
  │  → Combined NSG evaluation result                        │
  │  → UDR effective routing at the NIC                      │
  └──────┬───────────────────────────────────────────────────┘
         │ wire-level evidence needed
         ▼
  ┌─────────────────────────────────────────────────────────┐
  │  Packet capture (approval required)                     │
  │  capture_traffic → Cloud Orchestrator → PCAP Forensic   │
  └─────────────────────────────────────────────────────────┘

  Parallel track — OS layer inside the VM:
  ┌─────────────────────────────────────────────────────────┐
  │  detect_config_drift → firewall_inspector.py            │
  │  → iptables / nftables rules inside the guest OS        │
  └─────────────────────────────────────────────────────────┘

The OS layer and the Azure layer are separate state spaces. A clean effective NSG does not imply a clean OS firewall — and vice versa. In the P1 scenario above, Ghost Agent ran ENI to identify the subnet NSG block, then ran pipe meter to identify an unrelated tc netem latency injection on the source VM. Two root causes, two separate remediations, found in one investigation. The artifact prefix enforces the boundary: eni_* artifacts are ENI session data, fw_* are firewall inspector session data. Mixing them across tools fails SHA-256 verification.

Before ENI, Ghost Agent could query configured state only. The class of failure that requires computed effective state — BGP withdrawal, combined NSG evaluation — required manual az CLI calls and manual interpretation. ENI closes that gap and makes the finding machine-readable for the reasoning loop.


Technical Challenges That Made This Hard

1. Computed State Is Not a REST Resource

Azure’s ARM API is built around resources — objects with IDs, properties, and Activity Log entries. Effective route tables and effective NSG rules are not resources. They are computed on-demand via action-type API verbs (effectiveRouteTable/action, effectiveNetworkSecurityGroups/action), not standard GET requests. This has two consequences.

First, neither permission is included in the built-in Reader role — a silent gap that surfaces as an authorization error buried in the az CLI response, not a clean 403. ENI detects AuthorizationFailed in CLI output and raises a typed error distinguishing RBAC failures from generic errors.

Second, BGP-propagated route state is only observable by diffing the effective route table at the NIC — before and after. The route whose source is VirtualNetworkGateway that disappears from the NIC effective table leaves no ARM event behind. Diffing configured route table resources produces no finding. This is the capability no other Azure tool provides.

2. Partial Snapshots Must Be Valid Artifacts

At scale, NIC queries fail for legitimate reasons: a stopped VM, an RBAC gap on one NIC, Azure throttling during a large fleet snapshot. The wrong response is to abort and write nothing.

ENI treats a snapshot with per-NIC errors as a valid artifact. Each NIC records either its effective state or a typed error:

{ "nic_name": "prod-vm-nic-2", "error": "AuthorizationFailed: ..." }

The diff engine skips errored NICs on both sides and records them in skipped_nics. Exit codes are explicit: 0 = all NICs succeeded, 1 = partial (some NIC errors), 2 = fatal (no artifact written). Ghost Agent reads the exit code alongside the artifact — partial visibility is more useful than a failed run.


What It Takes to Run

Prerequisites:

  • Python 3.12+
  • Azure CLI authenticated: az login
  • Custom RBAC role (or Network Contributor) with:
    • Microsoft.Network/networkInterfaces/effectiveNetworkSecurityGroups/action
    • Microsoft.Network/networkInterfaces/effectiveRouteTable/action

Two commands:

# Baseline — before the change window, or at investigation start
python effective_network_inspector.py \
  --scope vm --vm-name VMNAME \
  --resource-group RG_NAME \
  --is-baseline --session-id pre_CRNNNN

# Compare — after the window
python effective_network_inspector.py \
  --scope vm --vm-name VMNAME \
  --resource-group RG_NAME \
  --compare-baseline pre_CRNNNN

For VNet-wide coverage: --scope vnet --vnet-id VNET_RESOURCE_ID. ENI discovers all NICs across all subnets, including subnets without an associated route table.

Ghost Agent: Set ENI_VM_NAME in config.env. The detect_effective_network_drift tool invokes ENI as a subprocess, reads the diff artifact, and returns the structured findings to the reasoning loop.


Conclusion

ENI makes one specific gap observable: the difference between configured Azure network state and effective Azure network state, expressed as a timestamped, SHA-256 verified, machine-readable diff. It does not replace Azure Change Analysis. It covers what Change Analysis cannot — computed state that leaves no ARM event.

The drift_detected: false result is operationally useful. An explicit, verified negative tells a CAB the window was clean — without a manual NSG and route table audit. It makes that determination a two-command, sub-minute operation rather than a 30-minute process that depends on who is on call.


GitHub: github.com/ranga-sampath/agentic-network-tools

Clone the repo. Take a baseline before your next change window. Run the compare after.

Would be great to know what you would like to hear more...