External Monitoring with the Metrics and Status APIs
The Aviatrix Metrics and Status APIs allow you to integrate CoPilot with third-party monitoring platforms such as Datadog, Splunk, Grafana, or Prometheus.
The Aviatrix Metrics and Status APIs allow you to integrate CoPilot with third-party monitoring platforms such as Datadog, Splunk, Grafana, or Prometheus. This guide covers which APIs to use, what data is available, how to configure alerting that stays consistent with CoPilot’s built-in monitoring, and how to avoid common pitfalls like double-counted traffic or noisy metrics.CoPilot exposes two complementary APIs for external consumption:
Metrics API
Endpoint:/metrics-api/v1/gatewaysPerformance metrics including CPU, memory, throughput, and packet drops.Scrape interval: Every 5 minutes
Status API
Endpoint:/status-api/v1/Availability status for gateways, tunnels, and BGP peerings.Scrape interval: Every 1 minute
Both APIs support Prometheus text format and JSON output. All data transmissions are encrypted using industry-standard protocols.
To use the APIs, you need to enable API access in CoPilot and create an authentication key.
The Aviatrix API uses port 443, the same port as the CoPilot UI. Ensure that port 443 is accessible and not restricted by any security groups.
The API key created during this procedure will not be accessible again. Save it in a secure place. If you lose the key, you must reset it.
In CoPilot, navigate to Settings > Configuration > General.
Scroll down to Features and select Metrics API or Status API.
Click Download to download the associated OpenAPI .yaml specification.
The OpenAPI specification files provide complete endpoint documentation, request/response schemas, and examples. Download the latest versions from CoPilot rather than relying on external copies, as the specifications are updated with each CoPilot release.
CoPilot collects performance data from every managed gateway once per minute via the Aviatrix Controller. The Metrics API serves a snapshot of the most recent collection, rounded to the nearest 5-minute boundary with a 15-minute lookback window. Values represent point-in-time samples, not averages or aggregations.
Recommended scrape interval: 5 minutes. Polling more frequently returns the same cached data.
These metrics are reported once per gateway, with a gateway label.
Metric
Description
Unit
cpu_idle
CPU idle time
Percent (0-100)
cpu_used_per
CPU utilization
Percent (0-100)
cpu_us
CPU time spent in user space
Percent
cpu_ks
CPU time spent in kernel space
Percent
cpu_wait
CPU time waiting on I/O
Percent
memory_available
Memory available to applications
Bytes
memory_free
Completely unused memory
Bytes
memory_cached
Memory used for disk cache
Bytes
memory_buf
Memory used for kernel buffers
Bytes
memory_swpd
Memory written to swap
Bytes
memory_used_per
Memory utilization
Percent (0-100)
cpu_used_per and memory_used_per are available in CoPilot 4.32+. On older versions, derive CPU utilization as 100 - cpu_idle. For memory percentage on older versions, you must know the instance type’s total memory from your cloud provider.
On gateways with multiple virtual CPUs, the API reports per-core utilization with gateway and vcpu_name labels.
Metric
Description
Unit
vcpu_avg_usage
Average CPU usage for this vCPU
Percent (0-100)
vcpu_min_usage
Minimum CPU usage for this vCPU
Percent (0-100)
vcpu_max_usage
Maximum CPU usage for this vCPU
Percent (0-100)
These metrics are useful for identifying core imbalance, for example one vCPU pegged at 100% while others are idle, which may indicate a single-threaded bottleneck.
These metrics are reported per gateway and per network interface, with gateway and interface labels.
Metric
Description
Unit
rate_received
Inbound throughput
Bits/sec
rate_sent
Outbound throughput
Bits/sec
rate_total
Combined throughput
Bits/sec
rx_drop
Inbound packet drops (cumulative)
Count
tx_drop
Outbound packet drops (cumulative)
Count
rate_rx_drop
Inbound drop rate
Drops/sec
rate_tx_drop
Outbound drop rate
Drops/sec
rate_pkt_drop
Combined drop rate
Drops/sec
bandwidth_ingress_limit_exceeded
Times ingress bandwidth limit was exceeded (cumulative)
Count
pps_limit_exceeded
Times packets-per-second limit was exceeded (cumulative)
Count
Throughput metrics (rate_sent, rate_received, rate_total) are reported in bits per second, not bytes. The pps_limit_exceeded and bandwidth_ingress_limit_exceeded counters are cumulative counts of packets throttled by the cloud provider’s instance-type network limits (AWS ENA driver). These counters are only present on AWS instances.
Each gateway reports interface-level metrics for every network interface on the instance. Understanding which interfaces to monitor is essential for accurate dashboards and alerts.
Traffic on tun-* interfaces is a subset of traffic already counted on the underlying eth interface. Including both leads to double-counted bandwidth in dashboards and inflated throughput numbers.For example, a packet traversing an IPsec tunnel from spoke-gw to transit-gw is counted once on eth0 (encrypted) and once on tun-abc123 (decrypted). Summing both overstates actual bandwidth consumption.
If you need per-tunnel visibility (for example, to identify which specific S2C tunnel is experiencing packet drops) you may collect tun-* metrics separately. In that case, do not sum them with eth interface metrics.
CoPilot ships with three default alert definitions. This section maps each to the equivalent external alert you can configure in your monitoring platform.
Configure these in every deployment. They cover the most impactful failure conditions.
Alert Name
Source
Expression
Duration
Severity
Gateway Down
Status API
status{gateway=~".+"} != 1
5 min
Critical
Tunnel Down
Status API
status{tunnel=~".+"} == 0
5 min
Critical
BGP Peer Down
Status API
bgp_status != 1
5 min
Critical
High CPU
Metrics API
cpu_used_per > 90
15 min
Critical
Memory Exhaustion
Metrics API
memory_used_per > 90
15 min
Critical
Swap Active
Metrics API
memory_swpd > 0
15 min
Warning
Adjust the memory threshold based on your gateway instance sizes. A gateway with 2 GB of RAM should alert at a different absolute threshold than one with 16 GB.
The following conditions cannot be monitored via the external APIs. Configure these as CoPilot alert definitions with a webhook notification channel that forwards events to your monitoring platform.
Condition
Why It Requires a Webhook
Disk Free < 5%
Disk metrics are not exposed in the Metrics API
Packet Failure Rate > 5%
The per_pkt_fail metric is not exposed in the Metrics API
Underlay Connection Down
DPD status is tracked internally but not exposed in the Status API
To configure a webhook channel in CoPilot, navigate to Notifications > Alert Configuration > Channels and create a webhook channel pointing to your monitoring platform’s ingest URL.
CoPilot’s built-in alert engine evaluates metrics every 60 seconds against a real-time internal cache. External monitoring introduces a small lag:
Status alerts (gateway/tunnel/BGP down): Expect approximately 1-2 minutes of lag compared to CoPilot’s built-in alerting.
Performance alerts (CPU, memory, drops): Expect approximately 5-10 minutes of lag due to the Metrics API’s 5-minute caching window.
This lag is by design. CoPilot’s built-in alerts default to 15-minute evaluation windows to avoid alert fatigue from transient issues. If more granular data is required, you can leverage CoPilot as a drill-down tool, use CoPilot webhook-based alerts, or leverage Syslog via the Aviatrix SIEM connector.
Values shown on CoPilot’s Monitor > Performance page may not exactly match Metrics API output. This is expected — the two serve different purposes.
Aspect
CoPilot Performance Page
Metrics API
Purpose
Processed insights and trend analysis
Raw data for external processing
Data points
Time-series of aggregated values over a selected range
Single latest raw sample per gateway
Aggregation
Configurable: Average (default), Min, or Max
None — returns the most recent raw data point
Time range
User-selectable (last hour to 60+ days)
Fixed: latest sample from last 15 minutes
Time resolution
Dynamic buckets (1 min for last hour, 30 min for last 24 hours, etc.)
Single snapshot rounded to 5-minute boundary
Metrics available
~90 metrics including derived fields
~24 curated metrics (4.32+)
Aggregation is the primary cause of differences. The Performance page displays the average of all 1-minute samples within each time bucket. The Metrics API returns a single raw sample. Dashboards built from the Metrics API will appear noisier than CoPilot’s Performance charts. Your external monitoring platform should apply its own aggregation and smoothing functions.
Trends should align. If CoPilot’s Performance page shows CPU usage climbing over time, your external dashboards should show the same trend. The individual data points may differ, but the direction and magnitude of changes will be consistent.
You can reset the API key from CoPilot. The Network Insights API card displays on the Configuration page only if the feature has been enabled.
If you reset the authorization key, the old key is purged from the system and cannot be retrieved. You must generate a new key and update any scripts that use the old key.
Navigate to Settings > Configuration > General.
Scroll to the Features section.
Under Network Insights API Key, click Reset API Key.
Select the checkbox for “I understand the implications,” and then click Reset.