The AppGate ZTNA system generates metrics for SNMP and Prometheus in a unified way internally. Effectively when a metric is collected it is programmatically formatted for both types of output, so even though Prometheus is the lead method, there will (almost) always be an exactly matching metric available in SNMP.
Prometheus
Prometheus is an open-source project managed by the Cloud Native Computing Foundation (CNCF). It was designed to collect metrics about your application and infrastructure. A Prometheus library has been built into the AppGate ZTNA appliance. This exposes an HTTP endpoint from where Prometheus can scrape metrics. These scraped metrics are then stored, and rules applied to aggregate or generate time series from the data. Grafana is the default means of visualizing this data.
Understanding Prometheus metrics
The first part of the name reflects the source of the metric:
apn = Appliance
ctr = Controller
gw = Gateway
ptl = Portal
The remainder of the name points to the daemon/feature which is being monitored.
The color indicates the relevance of the metric:
Green = metrics are likely to be used in most Collectives,
Orange = metrics will be relevant in specific circumstances.
Black = metrics are somewhat specialized and are unlikely to be of value in a production environment.
Grey = metrics that have been deprecated and should not be used.
Labels are used with most metrics to cover the possibility of there being multiple results available from one metric: name, measure, type, usage, etc. The most interesting Label(s) in respect of each metric are detailed in the table below.
There are seven standard labels included in all metrics. These are: appliance_id="", appliance_name="",collective_id="", collective_name="", site_id="", site_name="", appliance_version=""
These can be individually excluded in the Prometheus Exporter settings for an appliance or the Metrics Aggregator. This helps produce more consistent output (e.g. you can ignore appliance version) and reduce the volume of the data ( _ids can be long and bloat the metrics considerably).
For certain metrics there are some suggested default thresholds which can be used to trigger alerts and take some corrective action.
Metric | Introduced or updated | Threshold reference | Type | Comment | Labels of interest |
|---|---|---|---|---|---|
apn_active_connections |
| number | The number of active connections the operating system is tracking. | proto: {tcp, udp, icmp} | |
apn_active_connections_max |
|
| number | Maximum number of active connections allowed. | ~ |
apn_audit_events |
|
| counter | Qty of local audit logs generated. Can be useful to show additional metrics not otherwise available. | type: of audit log |
apn_audit_logs |
|
| counter | Monitors the flow of audit logs from the appliance. Logs are queued (from daemons), read (by external party) and after receive is confirmed, removed. Queued, read, received and removed should all be the same. Dropped and queued on startup would normally be zero. Received or removed are very useful to show the appliance is properly connected and doing work. | type: {dropped, queued, queued on startup, read, received, removed} |
apn_audit_queue_maxlimit | 6.5.3 |
| gauge | The limit of the number of audits in the source database. | ~ |
apn_audit_queue_total | 6.5.3 |
| gauge | The number of audits on the source appliance not successfully forwarded. | ~ |
apn_certificate_days_remaining |
| days | Days remaining before the given certificate expires | type: {CA, Portal access, Portal https, URL access, admin https, appliance} | |
apn_cpu_usage_percent |
| bytes | % | Current CPU usage. Matters more for Gateways which will eventually throttle users. | measure: {available, free, percent, used, total} | |
apn_disk |
| bytes | % | Current disk usage. Important to make sure Controllers do not run out of disk. | measure: {free, percent, used, total} | |
apn_disk_partition_statistic |
| bytes | % | Current disk partition usage. Important to make sure Controllers do not run out of disk. | name: {data, overlay, state}, measure: {free, percent, used, total} | |
apn_dns_cache_entries |
|
| number | The number of cached records (where request was a success) and the number of denials (where request was denied (NXDOMAIN, etc). | purpose: the instance of CoreDNS, type: {success, denial} |
apn_dns_requests_total |
|
| counter | The DNS requests being handled by CoreDNS. | purpose: the instance of CoreDNS, type: the type of request i.e.AAAA |
apn_dns_responses_total |
|
| counter | The number of responses Including standard DNS response codes such a NXDOMAIN. | purpose: the instance of CoreDNS, response_code: the DNS response code |
apn_function_sessions |
|
| number | The number of current client sessions. Same as shown in the appliance health status screen. | ~ |
apn_function_status |
|
| number | The reported health of each of the appliance functions. | name: of the function reporting its status, status: {healthy (0), busy (1), warning (2), error (3)} |
apn_function_suspended |
|
| number | The reported health of each of the appliance functions. | name: of the function reporting its status, reason: {ram, queue, heap, sessions, manual}, status: {normal (0), suspended (1)} |
apn_image_size |
|
| bytes | Size of the image partition (where the SDP code lives). | ~ |
apn_memory |
| bytes | Current memory usage. Gateways need monitoring as usage is proportional to users and conditions. | measure: {available, free, percent, total, used} | |
apn_nat_traversal_relayed_data_throughput_bits | bps | Average NAT traversal relayed data throughput over the given timeframe. Default timeframe is ten minutes. | |||
apn_network_interface_statistic |
|
| counter | Network traffic measure which includes the bytes and packets | ifname: of the interface, measure: the type (and direction) of traffic such as "rx_packets". |
apn_network_interface_speed |
|
| Mb/s | Current network traffic throughput. | ifname: of the interface, measure: {tx_speed, rx_speed} |
apn_proxy_protocol_messages |
|
| counter | Proxy protocol message counter. | version: {v1, v2} |
apn_replicationd_job_entity_stat | counter | Replicated entity metric value | |||
apn_replicationd_job_stat | gauge | Replication job metric value | |||
apn_snat |
|
| number | Number of UDP/TCP ports (or ICMP types) used by specific source_ip. Used to make sure enough IP aliases have been configured. | source_ip: generating the SNAT traffic |
apn_spa_packet_authorization_time |
|
| mS | The time it takes to validate an SPA packet. | measure: {maximum, average, minimum} proto: {tcp-tls, udp-dns, udp-dtls} |
apn_spa_packets |
|
| counter | Detail metrics for SPA (and non SPA) packets received. These are mainly specialist metrics for development use. In normal use expect to see mostly type: authorized. For a view of unwanted connection attempts; use type: invalid for proto: tcp, and type: ignored for proto: udp. | type: of action taken, proto: {tcp, tcp-tls, udp, udp-dns, udp-dtls} |
apn_spa_replay_attack_cache_entries |
|
| number | The cache keeps a record of all valid SPA packets to prevent replay attacks. This can consume up to 1GB of RAM so can be interesting on smaller appliances. | ~ |
apn_status |
|
| number | The reported summary health of the appliance as reported on the Appliance Health Page. This status differ from the Appliance status reported in the Admin UI, as that one is reflecting the Appliance from the point of view of a single Controller. This primarily affects the Admin UI Appliance status Offline. That will be reported as warning (if one Controller is unreachable), error (if no Controller is reachable) or not at all (if the Appliance is dead). | status: {healthy (0), busy (1), warning (2), error (3)} |
apn_volume_number |
|
| 1 or 2 | The active volume - the disk is split into 2 volumes. One for the current (live) version and one for the previous version. | ~ |
ctr_admin_authentication |
|
| counter | mS | Time and stats for admin authentications to Controllers (mainly LDAP, SAML, OATH, etc) excluding MFA. | measure: {success, error, total, average_time} |
ctr_admin_authorization |
|
| counter | mS | Time and stats for admin authorizations by Controllers (includes running assignment criteria scripts, signing tokens etc). | measure: {success, error, total, average_time} |
ctr_admin_mfa |
|
| counter | mS | Time and stats for use of MFA at admin sign-in | measure: {success, error, total, average_time} |
ctr_client_authentication |
| counter | mS | Time and stats for client authentications to Controllers that exclude MFA at sign-in. | measure: {success, error, total, average_time} | |
ctr_client_authorization |
| counter | mS | Time and stats for client authorizations by Controllers (includes running assignment criteria scripts, ctr_client_evaluate_all_policies, collecting and grouping them to Sites, signing tokens etc). | measure: {success, error, total, average_time} | |
ctr_client_average_time | ms | All average time for API calls for ctrClient and ctrAdmin values | |||
ctr_client_csr |
|
| counter | mS | Time and stats for the creation of VPN certificates for use by the client if required. | measure: {success, error, total, average_time} |
ctr_client_enter_password |
|
| counter | mS | Time and stats for use of user interactions requiring Password. | measure: {success, error, total, average_time} |
ctr_client_evaluate_all_policies |
|
| counter | mS | Time and stats for evaluating the policies looking for any matches (part of authorization). | measure: {success, error, total, average_time} |
ctr_client_evaluate_auto_update_criteria_script |
|
| counter | mS | Time and stats for evaluating the auto-update criteria scripts looking for any matches if configured. | measure: {success, error, total, average_time} |
ctr_client_mfa |
|
| counter | mS | Time and stats for use of user interactions requiring MFA. | measure: {success, error, total, average_time} |
ctr_client_new_ip_allocation |
|
| counter | mS | Time and stats for the allocation of an IP address to a client at sign-in | measure: {success, error, total, average_time} |
ctr_client_risk_engine_response |
|
| mS | Time it takes for the ZTP risk engine to respond | measure: {average time} |
ctr_client_sign_in_with_mfa |
|
| counter | mS | Time and stats for client authentications to Controllers that include MFA at sign-in. | measure: {success, error, total, average_time} |
ctr_database_conflicts |
|
| - | Specialist metrics for development use | ~ |
ctr_database_node_state |
|
| - | Specialist metrics for development use | ~ |
ctr_database_raft_state |
|
| - | Specialist metrics for development use | ~ |
ctr_database_replication |
|
| 1 | 0 | Indicates if all of the Controllers can replicate with each other (1:yes 0:no). | ~ |
ctr_database_size |
|
| bytes | Size of the Controller database. | ~ |
ctr_database_replication_slot_replay_lag |
|
| number | Quantity of records waiting to be synchronized with each of the other Controllers. Should be minimal before attempting an upgrade. | name: the name(s) of the other Controller(s) |
ctr_evaluate_user_claim_script |
|
| counter | Time and stats for executing the user claim script(s). |
|
ctr_ip_pool |
| number | IP pool statistics. IPv4 will be more useful than IPv6 (as the latter is unlikely to run out). | usage: {reserved, current, total} | |
ctr_license |
| number | Provides the current installed license and usage information. | measure: {entitled, used} | |
ctr_license_days_remaining |
|
| days | Controllers report the days remaining before the nearest license expiry date. | ~ |
ctr_memory_heap |
|
| bytes | This relates to the java engine and shows amount of reserved memory. | measure: {initial,used, committed, max} |
ctr_policy_evaluator |
|
| bytes | This relates to the javascript engine and shows the size of the cache being used. | measure: {cacheSize} |
ctr_threads |
|
| number | This relates to the java engine and shows the number of concurrent threads. | ~ |
gw_dns_forwarder_cache |
|
| counter | Details the activity of the DNS forwarder cache | type: {hit, miss, insert, expired, count} |
gw_dns_forwarder_domain |
|
| counter | Count of the top domains that the DNS Proxy is resolving | name: the names of the top domains |
gw_dns_forwarder_query |
|
| counter | The dns_forwarder uses cache so expect most results to be success and cache | result: {success, cache, nodata, timeout, notfound}. type: {a, aaaa} |
gw_dns_resolver_cache |
|
| counter | Details the activity of the DNS resolver cache | result:{hit, miss, nodata} |
gw_event_queue_size |
|
| number | This relates to the sessiond event queue. The length of the queue varies see Instance sizing. On lightly loaded Gateways Current may not yield very useful data as it is hit-and-miss as to whether there is anything in the event queue at the time of the sample. | measure: {current, length, max_used} |
gw_fallback_site_usage |
|
| number | The number of users on a given Site that are there because of the fallback feature. | name: name of the Site that caused the user to fallback. |
gw_ha_interface |
|
| counter | This monitors the (inside) network HA related traffic in non SNAT environments. | type: {arp, garp-replies, neighbor, unsolicited-neighbor-advertisements} |
gw_http_action |
|
| counter | Actions performed withing the URL access function. | type: {allowed, blocked, reported} |
gw_http_connection |
|
| counter | The amount of HTTP connections the Gateway handles. | type: {accepts, handled} |
gw_http_open_connection |
|
| number | Current number of connections. | type: {reading, writing, waiting, active} |
gw_http_requests |
|
| counter | Number of http requests. | ~ |
gw_name_resolver |
|
| number | This monitors the current settings/behavior of the name resolver. | kind: aws, Azure, etc. measure: {update_interval, effective_update_interval, unstable_names} |
gw_name_resolver_cache_count |
|
| counter | The usage of the resolver's cache | kind: aws, Azure, etc. measure: {evictions, insertions, misses} |
gw_name_resolver_cache_ttl |
| number | Current configuration of the TTL for the resolver | ~ | |
gw_name_resolver_cache_value |
|
| number | The cache's time to live setting | kind: aws, Azure, etc. measure: {ttl} |
gw_name_resolver_label |
| text | Name of the configured resolver | ~ | |
gw_name_resolver_names_missing_resolver |
|
| number | The number of resource names configured on Sites where the required resolver has not been configured. | ~ |
gw_name_resolver_value |
|
| number | The number of resources discovered. | measure: the types of resource i.e. instances. kind: the type of resolver i.e. gcp. |
gw_policy_evaluator |
|
| bytes | This relates to the javascript engine and shows the size of the cache being used. | measure: {cacheSize} |
gw_session_dropped_signin |
|
| counter | The number of sign-in attempts that were dropped because of resource shortages. | reason: {high_java_heap_usage} |
gw_session_event_count |
| counter | Number of events handled by the system. Can be used with _timing to design your own averages.
| type: {domainResolveUpdate, handleRemedy, handleTunnelDisconnected, login, nameResolveUpdate, reEvaluateConditions, reGenerateExpressConnectorRules, updateTokens, validateTokens}
| |
gw_event_queue_size_session |
| number | The size of the event handling queue. gw_event_queue_size_session {measure="max_used"} can be useful on lightly loaded system where "current" might report 0 much of the time. | measure: {current, length, max used} | |
gw_session_event_timing |
| number / counter | The amount of time it takes to process the different types of events. | measure: {average, accumulated}. type: {domainResolveUpdate, handleRemedy, handleTunnelDisconnected, login, nameResolveUpdate, reEvaluateConditions, reGenerateExpressConnectorRules, updateTokens, validateTokens} | |
gw_session_js_exectime |
|
| mS | Script execution time - which may include external API calls. Effectively this provides more detail when gw_session_event_timing shows a problem. | type: {Entitlement, Condition, appshortcutscript} |
gw_sessiond_heap |
| bytes | Sessiond uses Java - and this can consume a lot of RAM but is somewhat linked to the amount of RAM available. This metric allows the Java usage to be monitored. | measure: {initial,used, committed, max} | |
gw_sessiond_thread_count |
|
| number | All Java process threads. This is somewhat linked to the amount of RAM available and also reflects the number of events the Gateway is handling. | type: {current, peak} |
gw_session_traffic_active | number | Number of active sessions for this traffic type. | type: {direct, directViaNatt, relay} | ||
gw_session_traffic_allsitesup_averagetime | mS | Average uptime for all Sites for this traffic type. | type: {direct, directViaNatt, relay} | ||
gw_session_traffic_allsitesup_maxtime | mS | Maximum uptime for all Sites for this traffic type. | type: {direct, directViaNatt, relay} | ||
gw_session_traffic_allsitesup_mintime | mS | Minimum uptime for all Sites for this traffic type. | type: {direct, directViaNatt, relay} | ||
gw_session_traffic_roundtrip_averagetime | mS | Average round-trip time for this traffic type. | type: {direct, directViaNatt, relay} | ||
gw_session_traffic_roundtrip_maxtime | mS | Maximum round-trip time for this traffic type. | type: {direct, directViaNatt, relay} | ||
gw_session_traffic_roundtrip_mintime | mS | Minimum round-trip time for this traffic type. | type: {direct, directViaNatt, relay} | ||
gw_token_size |
|
| bytes | The size of the payload sent from the client to the Gateway. There is a limit of 16MB for this. | measure: {average, min, max} |
gw_vpn_client_metric |
|
| number | client connectivity metrics relating to a given Site. | name: of the measure, measure: {min, max avg}, client_site_name: of the reporting Site |
gw_vpn_memory_usage |
|
| number | Overall memory usage of the firewall daemon(s). | ~ |
gw_vpn_resolved_actions |
|
| number | Current number of resolved actions in use by the firewall. Monitored actions check the end-point. | measure: {monitored, unmonitored} |
gw_vpn_resolved_actions_size |
|
| bytes | Memory used to store resolved actions. | ~ |
gw_vpn_rtt_sample | mS | Unaggregated round-trip time per vpnd instance. | traffic mode: {direct, direct-via-natt, relay}, extreme: {highest, lowest}, dn: a given user’s distinguished name | ||
gw_vpn_rules |
|
| number | Number of active VPN rules in place | proto: {ipv4, ipv6} |
gw_vpn_rules_size |
|
| bytes | Memory used to store the current rules held in the firewall | proto: {ipv4, ipv6} |
gw_vpn_sessions |
| number | Number of (user) sessions being serviced by the gateway |
| |
gw_vpn_states |
|
| number | Number of connection states being held in the firewall | proto: {ipv4, ipv6} |
gw_vpn_states_size |
|
| bytes | Memory used to store the current states held in the firewall | proto: {ipv4, ipv6} |
lf_audit_logs | 6.5.1 |
| counter | Quantity of audit logs forwarded to the LogForwarder | source: hostname of source appliance, type: audit event type |
log_audit_logs | 6.5.1 |
| counter | Quantity of audit logs forwarded to the LogServer | source: hostname of source appliance, type: audit event type |
ptl_client |
|
| number | The status of the Portal's clients. 'running' is the available clients which is based on the size of the appliance. | type: {free, pending_removal, running, used} |