External Monitoring with the Metrics and Status APIs

The Aviatrix Metrics and Status APIs allow you to integrate CoPilot with third-party monitoring platforms such as Datadog, Splunk, Grafana, or Prometheus. This guide covers which APIs to use, what data is available, how to configure alerting that stays consistent with CoPilot’s built-in monitoring, and how to avoid common pitfalls like double-counted traffic or noisy metrics. CoPilot exposes two complementary APIs for external consumption:

Metrics API

Endpoint: /metrics-api/v1/gatewaysPerformance metrics including CPU, memory, throughput, and packet drops.Scrape interval: Every 5 minutes

Status API

Endpoint: /status-api/v1/Availability status for gateways, tunnels, and BGP peerings.Scrape interval: Every 1 minute

Both APIs support Prometheus text format and JSON output. All data transmissions are encrypted using industry-standard protocols.

Authentication

Both APIs use the same API key, passed as a Bearer token:

Authorization: Bearer <your-api-key>

The API key is configured in CoPilot under Settings > API Keys. A single key grants access to both endpoints.

Enabling the API and Downloading Specifications

To use the APIs, you need to enable API access in CoPilot and create an authentication key.

The Aviatrix API uses port 443, the same port as the CoPilot UI. Ensure that port 443 is accessible and not restricted by any security groups.

The API key created during this procedure will not be accessible again. Save it in a secure place. If you lose the key, you must reset it.

In CoPilot, navigate to Settings > Configuration > General.
Scroll down to Features and select Metrics API or Status API.
Click Download to download the associated OpenAPI .yaml specification.

The OpenAPI specification files provide complete endpoint documentation, request/response schemas, and examples. Download the latest versions from CoPilot rather than relying on external copies, as the specifications are updated with each CoPilot release.

Testing API Access

Verify access to your CoPilot instance with a curl command:

curl -k -X 'GET' \
  'https://<copilot-address>/metrics-api/v1/gateways?format=json' \
  -H 'Authorization: Bearer <your-api-key>'

A 200 response with metric data confirms the API is working. A 403 response indicates an incorrect or expired key.

Metrics API

Endpoints

GET https://<copilot-address>/metrics-api/v1/gateways
GET https://<copilot-address>/metrics-api/v1/gateways?format=json

The default format is Prometheus text. Append ?format=json for structured JSON output.

How Data Is Collected

CoPilot collects performance data from every managed gateway once per minute via the Aviatrix Controller. The Metrics API serves a snapshot of the most recent collection, rounded to the nearest 5-minute boundary with a 15-minute lookback window. Values represent point-in-time samples, not averages or aggregations.

Recommended scrape interval: 5 minutes. Polling more frequently returns the same cached data.

Gateway-Level Metrics

These metrics are reported once per gateway, with a gateway label.

Metric	Description	Unit
`cpu_idle`	CPU idle time	Percent (0-100)
`cpu_used_per`	CPU utilization	Percent (0-100)
`cpu_us`	CPU time spent in user space	Percent
`cpu_ks`	CPU time spent in kernel space	Percent
`cpu_wait`	CPU time waiting on I/O	Percent
`memory_available`	Memory available to applications	Bytes
`memory_free`	Completely unused memory	Bytes
`memory_cached`	Memory used for disk cache	Bytes
`memory_buf`	Memory used for kernel buffers	Bytes
`memory_swpd`	Memory written to swap	Bytes
`memory_used_per`	Memory utilization	Percent (0-100)

cpu_used_per and memory_used_per are available in CoPilot 4.32+. On older versions, derive CPU utilization as 100 - cpu_idle. For memory percentage on older versions, you must know the instance type’s total memory from your cloud provider.

Per-vCPU Metrics

On gateways with multiple virtual CPUs, the API reports per-core utilization with gateway and vcpu_name labels.

Metric	Description	Unit
`vcpu_avg_usage`	Average CPU usage for this vCPU	Percent (0-100)
`vcpu_min_usage`	Minimum CPU usage for this vCPU	Percent (0-100)
`vcpu_max_usage`	Maximum CPU usage for this vCPU	Percent (0-100)

These metrics are useful for identifying core imbalance, for example one vCPU pegged at 100% while others are idle, which may indicate a single-threaded bottleneck.

Interface-Level Metrics

These metrics are reported per gateway and per network interface, with gateway and interface labels.

Metric	Description	Unit
`rate_received`	Inbound throughput	Bits/sec
`rate_sent`	Outbound throughput	Bits/sec
`rate_total`	Combined throughput	Bits/sec
`rx_drop`	Inbound packet drops (cumulative)	Count
`tx_drop`	Outbound packet drops (cumulative)	Count
`rate_rx_drop`	Inbound drop rate	Drops/sec
`rate_tx_drop`	Outbound drop rate	Drops/sec
`rate_pkt_drop`	Combined drop rate	Drops/sec
`bandwidth_ingress_limit_exceeded`	Times ingress bandwidth limit was exceeded (cumulative)	Count
`pps_limit_exceeded`	Times packets-per-second limit was exceeded (cumulative)	Count

Throughput metrics (rate_sent, rate_received, rate_total) are reported in bits per second, not bytes. The pps_limit_exceeded and bandwidth_ingress_limit_exceeded counters are cumulative counts of packets throttled by the cloud provider’s instance-type network limits (AWS ENA driver). These counters are only present on AWS instances.

Example Prometheus Output

cpu_idle{gateway="spoke-gw-us-east-1"} 74.3 1734006896789
cpu_used_per{gateway="spoke-gw-us-east-1"} 25.7 1734006896789
memory_available{gateway="spoke-gw-us-east-1"} 3221225472 1734006896789
memory_used_per{gateway="spoke-gw-us-east-1"} 30.2 1734006896789
memory_swpd{gateway="spoke-gw-us-east-1"} 0 1734006896789
rate_received{gateway="spoke-gw-us-east-1", interface="eth0"} 125400.5 1734006896789
rate_pkt_drop{gateway="spoke-gw-us-east-1", interface="eth0"} 0 1734006896789
pps_limit_exceeded{gateway="spoke-gw-us-east-1", interface="eth0"} 0 1734006896789
vcpu_avg_usage{gateway="spoke-gw-us-east-1", vcpu_name="0"} 30.5 1734006896789
vcpu_avg_usage{gateway="spoke-gw-us-east-1", vcpu_name="1"} 20.9 1734006896789

Status API

Endpoints

GET https://<copilot-address>/status-api/v1/
GET https://<copilot-address>/status-api/v1/?format=json

How Data Is Collected

CoPilot polls gateway, tunnel, and BGP status every minute. The Status API serves the latest cached state with no windowing or aggregation.

Recommended scrape interval: 1 minute. The Status API refreshes at this rate, and availability events are time-sensitive.

Gateway Status

Reported as status with a gateway label.

Value	Meaning
`1`	Up — gateway is operating normally
`0`	Down — gateway is not responding
`-1`	Degraded — keepalive failure, configuration failure, upgrade failure, or transitional state

Prometheus
JSON

status{gateway="transit-gw-1"} 1 1734006896789
status{gateway="spoke-gw-1"} 0 1734006896789

{
  "gateways": {
    "transit-gw-1": {
      "status": "up",
      "lastUpdatedTimestamp": "2024-12-12T19:47:01.000Z",
      "lastModifiedTimestamp": "2024-12-10T08:30:00.000Z"
    }
  }
}

The lastModifiedTimestamp indicates when the status last changed, while lastUpdatedTimestamp indicates when the status was last polled.

Tunnel Status

Includes both gateway-to-gateway peering tunnels and Site-to-Cloud (S2C) tunnels, reported as status with a tunnel label.

Value	Meaning
`1`	Up
`0`	Down

status{tunnel="spoke-gw-1.transit-gw-1"} 1 1734006896789
status{tunnel="s2c-branch-office"} 0 1734006896789

BGP Peering Status

Reported as bgp_status with gateway and bgp_neighbor labels.

Value	Meaning
`1`	Established
`0`	Not established

bgp_status{gateway="transit-gw-1",bgp_neighbor="vgw(169.254.254.185)"} 1 1734006896789
bgp_status{gateway="transit-gw-1",bgp_neighbor="peer(10.0.0.1)"} 0 1734006896789

Network Interface Guidance

Each gateway reports interface-level metrics for every network interface on the instance. Understanding which interfaces to monitor is essential for accurate dashboards and alerts.

Interfaces to Monitor

Interface	Description	Recommendation
`eth0`	Primary management and data interface	Always monitor — carries production traffic
`eth1`, `eth2`, etc.	Additional data-plane interfaces (if present)	Monitor — carries production traffic

Interfaces to Exclude

Interface	Description	Recommendation
`lo`	Loopback (127.0.0.1)	Exclude — internal-only, no operational value
`tun-*`	IPsec/GRE tunnel interfaces	Exclude in most cases — see note below

Why Exclude Tunnel Interfaces?

Traffic on tun-* interfaces is a subset of traffic already counted on the underlying eth interface. Including both leads to double-counted bandwidth in dashboards and inflated throughput numbers. For example, a packet traversing an IPsec tunnel from spoke-gw to transit-gw is counted once on eth0 (encrypted) and once on tun-abc123 (decrypted). Summing both overstates actual bandwidth consumption.

If you need per-tunnel visibility (for example, to identify which specific S2C tunnel is experiencing packet drops) you may collect tun-* metrics separately. In that case, do not sum them with eth interface metrics.

Recommended Filter

In Prometheus-based systems, apply a relabel or query filter:

# Throughput on production interfaces only
rate_total{interface=~"eth.*"}

# Drop loopback and tunnel metrics at ingestion
metric_relabel_configs:
  - source_labels: [interface]
    regex: '(lo|tun-.*)'
    action: drop

Matching CoPilot’s Built-in Alerts

CoPilot ships with three default alert definitions. This section maps each to the equivalent external alert you can configure in your monitoring platform.

Global Control Plane Health

CoPilot monitors Controller and CoPilot CPU, memory, and disk health. Any condition true for 15 minutes triggers the alert.

CoPilot Condition	External Equivalent	Notes
CPU Usage > 90%	`cpu_used_per > 90` for 15m	CoPilot 4.32+. On older versions: `(100 - cpu_idle) > 90`
Memory Usage > 90%	`memory_used_per > 90` for 15m	CoPilot 4.32+. On older versions: `memory_available < (total_memory * 0.10)`
Disk Free < 5%	Not available via API	Use CoPilot webhook

Global Network Health

CoPilot monitors all gateways and their network interfaces. Any condition true for 15 minutes triggers the alert.

CoPilot Condition	External Equivalent	API
Gateway Status: DOWN, KEEPALIVE_FAIL, CONFIG_FAIL, UPGRADE_FAIL	`status{gateway=~".+"} != 1` for 5m	Status API
PPS Limit Exceeded > 1%	`increase(pps_limit_exceeded[5m]) > 0`	Metrics API
Packet Failure > 5%	Not available via API	Use CoPilot webhook
Memory Usage > 90%	`memory_used_per > 90` for 15m	Metrics API

Global Memory Swap Surge

CoPilot monitors all gateways for unexpected swap usage on instances with sufficient RAM. All conditions must be true for 15 minutes.

CoPilot Condition	External Equivalent	API
Swap > 0 bytes	`memory_swpd > 0` for 15m	Metrics API
Total Memory > 1 GB	Filter by instance type (known RAM)	External knowledge

Recommended Alert Configuration

Tier 1 — Critical Alerts

Configure these in every deployment. They cover the most impactful failure conditions.

Alert Name	Source	Expression	Duration	Severity
Gateway Down	Status API	`status{gateway=~".+"} != 1`	5 min	Critical
Tunnel Down	Status API	`status{tunnel=~".+"} == 0`	5 min	Critical
BGP Peer Down	Status API	`bgp_status != 1`	5 min	Critical
High CPU	Metrics API	`cpu_used_per > 90`	15 min	Critical
Memory Exhaustion	Metrics API	`memory_used_per > 90`	15 min	Critical
Swap Active	Metrics API	`memory_swpd > 0`	15 min	Warning

Adjust the memory threshold based on your gateway instance sizes. A gateway with 2 GB of RAM should alert at a different absolute threshold than one with 16 GB.

Tier 2 — Operational Alerts

These provide early warning for capacity issues and performance degradation.

Alert Name	Source	Expression	Duration	Severity
Cloud PPS Limit Hit	Metrics API	`increase(pps_limit_exceeded{interface=~"eth.*"}[5m]) > 0`	15 min	Warning
Cloud BW Limit Hit	Metrics API	`increase(bandwidth_ingress_limit_exceeded{interface=~"eth.*"}[5m]) > 0`	15 min	Warning
Packet Drops	Metrics API	`rate_pkt_drop{interface=~"eth.*"} > 0`	5 min	Warning
I/O Wait	Metrics API	`cpu_wait > 20`	15 min	Warning
High Throughput	Metrics API	`rate_total{interface="eth0"} > <80% of instance limit>`	15 min	Info

Tier 3 — CoPilot Webhook Alerts

The following conditions cannot be monitored via the external APIs. Configure these as CoPilot alert definitions with a webhook notification channel that forwards events to your monitoring platform.

Condition	Why It Requires a Webhook
Disk Free < 5%	Disk metrics are not exposed in the Metrics API
Packet Failure Rate > 5%	The `per_pkt_fail` metric is not exposed in the Metrics API
Underlay Connection Down	DPD status is tracked internally but not exposed in the Status API

To configure a webhook channel in CoPilot, navigate to Notifications > Alert Configuration > Channels and create a webhook channel pointing to your monitoring platform’s ingest URL.

Scrape Configuration

Prometheus
Datadog
Generic HTTP (Splunk, Custom)

scrape_configs:
  # Performance metrics - 5-minute resolution
  - job_name: 'aviatrix-metrics'
    scrape_interval: 5m
    scrape_timeout: 30s
    metrics_path: /metrics-api/v1/gateways
    scheme: https
    authorization:
      type: Bearer
      credentials: '<your-api-key>'
    static_configs:
      - targets: ['<copilot-address>']
    # Exclude loopback and tunnel interfaces
    metric_relabel_configs:
      - source_labels: [interface]
        regex: '(lo|tun-.*)'
        action: drop

  # Availability status - 1-minute resolution
  - job_name: 'aviatrix-status'
    scrape_interval: 1m
    scrape_timeout: 15s
    metrics_path: /status-api/v1/
    scheme: https
    authorization:
      type: Bearer
      credentials: '<your-api-key>'
    static_configs:
      - targets: ['<copilot-address>']

Use the Datadog Agent’s OpenMetrics integration to scrape both endpoints:

# /etc/datadog-agent/conf.d/openmetrics.d/conf.yaml
instances:
  - openmetrics_endpoint: https://<copilot-address>/metrics-api/v1/gateways
    namespace: aviatrix.metrics
    min_collection_interval: 300  # 5 minutes
    headers:
      Authorization: Bearer <your-api-key>
    exclude_metrics_by_labels:
      interface:
        - lo
        - tun-.*

  - openmetrics_endpoint: https://<copilot-address>/status-api/v1/
    namespace: aviatrix.status
    min_collection_interval: 60  # 1 minute
    headers:
      Authorization: Bearer <your-api-key>

For platforms that don’t natively support Prometheus scraping, use the JSON format:

# Metrics
GET https://<copilot-address>/metrics-api/v1/gateways?format=json
Authorization: Bearer <your-api-key>

# Status
GET https://<copilot-address>/status-api/v1/?format=json
Authorization: Bearer <your-api-key>

Parse the JSON response and map fields to your platform’s metric and event models.

Data Freshness and Timing

Aspect	Metrics API	Status API
Collection frequency	Every 1 minute (internal)	Every 1 minute
API cache resolution	5-minute intervals	Real-time (latest cache)
Recommended scrape	Every 5 minutes	Every 1 minute
Lag vs. CoPilot alerts	~4-5 minutes	~1 minute

CoPilot’s built-in alert engine evaluates metrics every 60 seconds against a real-time internal cache. External monitoring introduces a small lag:

Status alerts (gateway/tunnel/BGP down): Expect approximately 1-2 minutes of lag compared to CoPilot’s built-in alerting.
Performance alerts (CPU, memory, drops): Expect approximately 5-10 minutes of lag due to the Metrics API’s 5-minute caching window.

This lag is by design. CoPilot’s built-in alerts default to 15-minute evaluation windows to avoid alert fatigue from transient issues. If more granular data is required, you can leverage CoPilot as a drill-down tool, use CoPilot webhook-based alerts, or leverage Syslog via the Aviatrix SIEM connector.

CoPilot Performance Page vs. Metrics API

Values shown on CoPilot’s Monitor > Performance page may not exactly match Metrics API output. This is expected — the two serve different purposes.

Aspect	CoPilot Performance Page	Metrics API
Purpose	Processed insights and trend analysis	Raw data for external processing
Data points	Time-series of aggregated values over a selected range	Single latest raw sample per gateway
Aggregation	Configurable: Average (default), Min, or Max	None — returns the most recent raw data point
Time range	User-selectable (last hour to 60+ days)	Fixed: latest sample from last 15 minutes
Time resolution	Dynamic buckets (1 min for last hour, 30 min for last 24 hours, etc.)	Single snapshot rounded to 5-minute boundary
Metrics available	~90 metrics including derived fields	~24 curated metrics (4.32+)

Aggregation is the primary cause of differences. The Performance page displays the average of all 1-minute samples within each time bucket. The Metrics API returns a single raw sample. Dashboards built from the Metrics API will appear noisier than CoPilot’s Performance charts. Your external monitoring platform should apply its own aggregation and smoothing functions.

Trends should align. If CoPilot’s Performance page shows CPU usage climbing over time, your external dashboards should show the same trend. The individual data points may differ, but the direction and magnitude of changes will be consistent.

API Coverage Reference

Data	Metrics API	Status API	Alternative
Gateway up/down	No	Yes	—
Tunnel up/down	No	Yes	—
BGP peer state	No	Yes	—
CPU utilization	Yes	No	—
Memory utilization	Yes	No	—
Swap usage	Yes	No	—
Throughput	Yes	No	—
Packet drops	Yes	No	—
Cloud instance limits	Yes	No	—
Disk utilization	No	No	CoPilot webhook
Packet failure rate	No	No	CoPilot webhook
Underlay connection state	No	No	CoPilot webhook
Memory total	No	No	Cloud provider API or known instance specs

Resetting the API Key

You can reset the API key from CoPilot. The Network Insights API card displays on the Configuration page only if the feature has been enabled.

If you reset the authorization key, the old key is purged from the system and cannot be retrieved. You must generate a new key and update any scripts that use the old key.

Navigate to Settings > Configuration > General.
Scroll to the Features section.
Under Network Insights API Key, click Reset API Key.
Select the checkbox for “I understand the implications,” and then click Reset.
Copy the key and close the confirmation window.

Metrics Monitored for Aviatrix Resources — complete list of all metrics CoPilot tracks
Notifications (Alerts) About Network Events — configuring CoPilot’s built-in alerts
Aviatrix Integration with Prometheus and Grafana — community walkthrough

Concepts & Architecture

Guides

Reference

External Monitoring with the Metrics and Status APIs

Metrics API

Status API

Authentication

Enabling the API and Downloading Specifications

Testing API Access

Metrics API

Endpoints

How Data Is Collected

Gateway-Level Metrics

Per-vCPU Metrics

Interface-Level Metrics

Example Prometheus Output

Status API

Endpoints

How Data Is Collected

Gateway Status

Tunnel Status

BGP Peering Status

Network Interface Guidance

Interfaces to Monitor

Interfaces to Exclude

Why Exclude Tunnel Interfaces?

Recommended Filter

Matching CoPilot’s Built-in Alerts

Global Control Plane Health

Global Network Health

Global Memory Swap Surge

Recommended Alert Configuration

Tier 1 — Critical Alerts

Tier 2 — Operational Alerts

Tier 3 — CoPilot Webhook Alerts

Scrape Configuration

Data Freshness and Timing

CoPilot Performance Page vs. Metrics API

API Coverage Reference

Resetting the API Key

Concepts & Architecture

Guides

Reference

Metrics API

Status API

​Authentication

​Enabling the API and Downloading Specifications

​Testing API Access

​Metrics API

​Endpoints

​How Data Is Collected

​Gateway-Level Metrics

​Per-vCPU Metrics

​Interface-Level Metrics

​Example Prometheus Output

​Status API

​Endpoints

​How Data Is Collected

​Gateway Status

​Tunnel Status

​BGP Peering Status

​Network Interface Guidance

​Interfaces to Monitor

​Interfaces to Exclude

​Why Exclude Tunnel Interfaces?

​Recommended Filter

​Matching CoPilot’s Built-in Alerts

​Global Control Plane Health

​Global Network Health

​Global Memory Swap Surge

​Recommended Alert Configuration

​Tier 1 — Critical Alerts

​Tier 2 — Operational Alerts

​Tier 3 — CoPilot Webhook Alerts

​Scrape Configuration

​Data Freshness and Timing

​CoPilot Performance Page vs. Metrics API

​API Coverage Reference

​Resetting the API Key

​Related Resources

Authentication

Enabling the API and Downloading Specifications

Testing API Access

Metrics API

Endpoints

How Data Is Collected

Gateway-Level Metrics

Per-vCPU Metrics

Interface-Level Metrics

Example Prometheus Output

Status API

Endpoints

How Data Is Collected

Gateway Status

Tunnel Status

BGP Peering Status

Network Interface Guidance

Interfaces to Monitor

Interfaces to Exclude

Why Exclude Tunnel Interfaces?

Recommended Filter

Matching CoPilot’s Built-in Alerts

Global Control Plane Health

Global Network Health

Global Memory Swap Surge

Recommended Alert Configuration

Tier 1 — Critical Alerts

Tier 2 — Operational Alerts

Tier 3 — CoPilot Webhook Alerts

Scrape Configuration

Data Freshness and Timing

CoPilot Performance Page vs. Metrics API

API Coverage Reference

Resetting the API Key

Related Resources