Prometheus metrics details

The AppGate ZTNA system generates metrics for SNMP and Prometheus in a unified way internally. Effectively when a metric is collected it is programmatically formatted for both types of output, so even though Prometheus is the lead method, there will (almost) always be an exactly matching metric available in SNMP.

Prometheus

Prometheus is an open-source project managed by the Cloud Native Computing Foundation (CNCF). It was designed to collect metrics about your application and infrastructure. A Prometheus library has been built into the AppGate ZTNA appliance. This exposes an HTTP endpoint from where Prometheus can scrape metrics. These scraped metrics are then stored, and rules applied to aggregate or generate time series from the data. Grafana is the default means of visualizing this data.

Understanding Prometheus metrics

The first part of the name reflects the source of the metric:

apn = Appliance

ctr = Controller

gw = Gateway

ptl = Portal

The remainder of the name points to the daemon/feature which is being monitored.

The color indicates the relevance of the metric:

Green = metrics are likely to be used in most Collectives,

Orange = metrics will be relevant in specific circumstances.

Black = metrics are somewhat specialized and are unlikely to be of value in a production environment.

Grey = metrics that have been deprecated and should not be used.

Labels are used with most metrics to cover the possibility of there being multiple results available from one metric: name, measure, type, usage, etc. The most interesting Label(s) in respect of each metric are detailed in the table below.

There are seven standard labels included in all metrics. These are: appliance_id="", appliance_name="",collective_id="", collective_name="", site_id="", site_name="", appliance_version=""

These can be individually excluded in the Prometheus Exporter settings for an appliance or the Metrics Aggregator. This helps produce more consistent output (e.g. you can ignore appliance version) and reduce the volume of the data ( _ids can be long and bloat the metrics considerably).

For certain metrics there are some suggested default thresholds which can be used to trigger alerts and take some corrective action.

Metric	Introduced or updated	Threshold reference	Type	Comment	Labels of interest
apn_active_connections		A	number	The number of active connections the operating system is tracking.	proto: {tcp, udp, icmp}
apn_active_connections_max			number	Maximum number of active connections allowed.	~
apn_audit_events			counter	Qty of local audit logs generated. Can be useful to show additional metrics not otherwise available.	type: of audit log
apn_audit_logs			counter	Monitors the flow of audit logs from the appliance. Logs are queued (from daemons), read (by external party) and after receive is confirmed, removed. Queued, read, received and removed should all be the same. Dropped and queued on startup would normally be zero. Received or removed are very useful to show the appliance is properly connected and doing work.	type: {dropped, queued, queued on startup, read, received, removed}
apn_audit_queue_maxlimit	6.5.3		gauge	The limit of the number of audits in the source database.	~
apn_audit_queue_total	6.5.3		gauge	The number of audits on the source appliance not successfully forwarded.	~
apn_certificate_days_remaining		M	days	Days remaining before the given certificate expires	type: {CA, Portal access, Portal https, URL access, admin https, appliance}
apn_cpu_usage_percent		B	bytes \| %	Current CPU usage. Matters more for Gateways which will eventually throttle users.	measure: {available, free, percent, used, total}
apn_disk		C	bytes \| %	Current disk usage. Important to make sure Controllers do not run out of disk.	measure: {free, percent, used, total}
apn_disk_partition_statistic		D	bytes \| %	Current disk partition usage. Important to make sure Controllers do not run out of disk.	name: {data, overlay, state}, measure: {free, percent, used, total}
apn_dns_cache_entries			number	The number of cached records (where request was a success) and the number of denials (where request was denied (NXDOMAIN, etc).	purpose: the instance of CoreDNS, type: {success, denial}
apn_dns_requests_total			counter	The DNS requests being handled by CoreDNS.	purpose: the instance of CoreDNS, type: the type of request i.e.AAAA
apn_dns_responses_total			counter	The number of responses Including standard DNS response codes such a NXDOMAIN.	purpose: the instance of CoreDNS, response_code: the DNS response code
apn_function_sessions			number	The number of current client sessions. Same as shown in the appliance health status screen.	~
apn_function_status			number	The reported health of each of the appliance functions.	name: of the function reporting its status, status: {healthy (0), busy (1), warning (2), error (3)}
apn_function_suspended			number	The reported health of each of the appliance functions.	name: of the function reporting its status, reason: {ram, queue, heap, sessions, manual}, status: {normal (0), suspended (1)}
apn_image_size			bytes	Size of the image partition (where the SDP code lives).	~
apn_memory		E	bytes	Current memory usage. Gateways need monitoring as usage is proportional to users and conditions.	measure: {available, free, percent, total, used}
apn_nat_traversal_relayed_data_throughput_bits			bps	Average NAT traversal relayed data throughput over the given timeframe. Default timeframe is ten minutes.
apn_network_interface_statistic			counter	Network traffic measure which includes the bytes and packets	ifname: of the interface, measure: the type (and direction) of traffic such as "rx_packets".
apn_network_interface_speed			Mb/s	Current network traffic throughput.	ifname: of the interface, measure: {tx_speed, rx_speed}
apn_proxy_protocol_messages			counter	Proxy protocol message counter.	version: {v1, v2}
apn_replicationd_job_entity_stat			counter	Replicated entity metric value
apn_replicationd_job_stat			gauge	Replication job metric value
apn_snat			number	Number of UDP/TCP ports (or ICMP types) used by specific source_ip. Used to make sure enough IP aliases have been configured.	source_ip: generating the SNAT traffic
apn_spa_packet_authorization_time			mS	The time it takes to validate an SPA packet.	measure: {maximum, average, minimum} proto: {tcp-tls, udp-dns, udp-dtls}
apn_spa_packets			counter	Detail metrics for SPA (and non SPA) packets received. These are mainly specialist metrics for development use. In normal use expect to see mostly type: authorized. For a view of unwanted connection attempts; use type: invalid for proto: tcp, and type: ignored for proto: udp.	type: of action taken, proto: {tcp, tcp-tls, udp, udp-dns, udp-dtls}
apn_spa_replay_attack_cache_entries			number	The cache keeps a record of all valid SPA packets to prevent replay attacks. This can consume up to 1GB of RAM so can be interesting on smaller appliances.	~
apn_status			number	The reported summary health of the appliance as reported on the Appliance Health Page. This status differ from the Appliance status reported in the Admin UI, as that one is reflecting the Appliance from the point of view of a single Controller. This primarily affects the Admin UI Appliance status Offline. That will be reported as warning (if one Controller is unreachable), error (if no Controller is reachable) or not at all (if the Appliance is dead).	status: {healthy (0), busy (1), warning (2), error (3)}
apn_volume_number			1 or 2	The active volume - the disk is split into 2 volumes. One for the current (live) version and one for the previous version.	~
ctr_admin_authentication			counter \| mS	Time and stats for admin authentications to Controllers (mainly LDAP, SAML, OATH, etc) excluding MFA.	measure: {success, error, total, average_time}
ctr_admin_authorization			counter \| mS	Time and stats for admin authorizations by Controllers (includes running assignment criteria scripts, signing tokens etc).	measure: {success, error, total, average_time}
ctr_admin_mfa			counter \| mS	Time and stats for use of MFA at admin sign-in	measure: {success, error, total, average_time}
ctr_client_authentication		K	counter \| mS	Time and stats for client authentications to Controllers that exclude MFA at sign-in.	measure: {success, error, total, average_time}
ctr_client_authorization		K	counter \| mS	Time and stats for client authorizations by Controllers (includes running assignment criteria scripts, ctr_client_evaluate_all_policies, collecting and grouping them to Sites, signing tokens etc).	measure: {success, error, total, average_time}
ctr_client_average_time			ms	All average time for API calls for ctrClient and ctrAdmin values
ctr_client_csr			counter \| mS	Time and stats for the creation of VPN certificates for use by the client if required.	measure: {success, error, total, average_time}
ctr_client_enter_password			counter \| mS	Time and stats for use of user interactions requiring Password.	measure: {success, error, total, average_time}
ctr_client_evaluate_all_policies			counter \| mS	Time and stats for evaluating the policies looking for any matches (part of authorization).	measure: {success, error, total, average_time}
ctr_client_evaluate_auto_update_criteria_script			counter \| mS	Time and stats for evaluating the auto-update criteria scripts looking for any matches if configured.	measure: {success, error, total, average_time}
ctr_client_mfa			counter \| mS	Time and stats for use of user interactions requiring MFA.	measure: {success, error, total, average_time}
ctr_client_new_ip_allocation			counter \| mS	Time and stats for the allocation of an IP address to a client at sign-in	measure: {success, error, total, average_time}
ctr_client_risk_engine_response			mS	Time it takes for the ZTP risk engine to respond	measure: {average time}
ctr_client_sign_in_with_mfa			counter \| mS	Time and stats for client authentications to Controllers that include MFA at sign-in.	measure: {success, error, total, average_time}
ctr_database_conflicts			-	Specialist metrics for development use	~
ctr_database_node_state			-	Specialist metrics for development use	~
ctr_database_raft_state			-	Specialist metrics for development use	~
ctr_database_replication			1 \| 0	Indicates if all of the Controllers can replicate with each other (1:yes 0:no).	~
ctr_database_size			bytes	Size of the Controller database.	~
ctr_database_replication_slot_replay_lag			number	Quantity of records waiting to be synchronized with each of the other Controllers. Should be minimal before attempting an upgrade.	name: the name(s) of the other Controller(s)
ctr_evaluate_user_claim_script			counter	Time and stats for executing the user claim script(s).
ctr_ip_pool		I	number	IP pool statistics. IPv4 will be more useful than IPv6 (as the latter is unlikely to run out).	usage: {reserved, current, total}
ctr_license		J	number	Provides the current installed license and usage information.	measure: {entitled, used}
ctr_license_days_remaining			days	Controllers report the days remaining before the nearest license expiry date.	~
ctr_memory_heap			bytes	This relates to the java engine and shows amount of reserved memory.	measure: {initial,used, committed, max}
ctr_policy_evaluator			bytes	This relates to the javascript engine and shows the size of the cache being used.	measure: {cacheSize}
ctr_threads			number	This relates to the java engine and shows the number of concurrent threads.	~
gw_dns_forwarder_cache			counter	Details the activity of the DNS forwarder cache	type: {hit, miss, insert, expired, count}
gw_dns_forwarder_domain			counter	Count of the top domains that the DNS Proxy is resolving	name: the names of the top domains
gw_dns_forwarder_query			counter	The dns_forwarder uses cache so expect most results to be success and cache	result: {success, cache, nodata, timeout, notfound}. type: {a, aaaa}
gw_dns_resolver_cache			counter	Details the activity of the DNS resolver cache	result:{hit, miss, nodata}
gw_event_queue_size			number	This relates to the sessiond event queue. The length of the queue varies see Instance sizing. On lightly loaded Gateways Current may not yield very useful data as it is hit-and-miss as to whether there is anything in the event queue at the time of the sample.	measure: {current, length, max_used}
gw_fallback_site_usage			number	The number of users on a given Site that are there because of the fallback feature.	name: name of the Site that caused the user to fallback.
gw_ha_interface			counter	This monitors the (inside) network HA related traffic in non SNAT environments.	type: {arp, garp-replies, neighbor, unsolicited-neighbor-advertisements}
gw_http_action			counter	Actions performed withing the URL access function.	type: {allowed, blocked, reported}
gw_http_connection			counter	The amount of HTTP connections the Gateway handles.	type: {accepts, handled}
gw_http_open_connection			number	Current number of connections.	type: {reading, writing, waiting, active}
gw_http_requests			counter	Number of http requests.	~
gw_name_resolver			number	This monitors the current settings/behavior of the name resolver.	kind: aws, Azure, etc. measure: {update_interval, effective_update_interval, unstable_names}
gw_name_resolver_cache_count			counter	The usage of the resolver's cache	kind: aws, Azure, etc. measure: {evictions, insertions, misses}
gw_name_resolver_cache_ttl			number	Current configuration of the TTL for the resolver	~
gw_name_resolver_cache_value			number	The cache's time to live setting	kind: aws, Azure, etc. measure: {ttl}
gw_name_resolver_label			text	Name of the configured resolver	~
gw_name_resolver_names_missing_resolver			number	The number of resource names configured on Sites where the required resolver has not been configured.	~
gw_name_resolver_value			number	The number of resources discovered.	measure: the types of resource i.e. instances. kind: the type of resolver i.e. gcp. name: the name given to the resolver on the Site
gw_policy_evaluator			bytes	This relates to the javascript engine and shows the size of the cache being used.	measure: {cacheSize}
gw_session_dropped_signin			counter	The number of sign-in attempts that were dropped because of resource shortages.	reason: {high_java_heap_usage}
gw_session_event_count		L	counter	Number of events handled by the system. Can be used with _timing to design your own averages. domainResolverUpdate and nameResolverUpdate measure external hostname lookups. login and reEvaluateConditions may include external js script exectime. (see gw_session_js_exectime) The others are reflective of the Gateway performance (doing internal processing such as crypto operations).	type: {domainResolveUpdate, handleRemedy, handleTunnelDisconnected, login, nameResolveUpdate, reEvaluateConditions, reGenerateExpressConnectorRules, updateTokens, validateTokens}
gw_event_queue_size_session		H	number	The size of the event handling queue. gw_event_queue_size_session {measure="max_used"} can be useful on lightly loaded system where "current" might report 0 much of the time.	measure: {current, length, max used}
gw_session_event_timing		L	number / counter	The amount of time it takes to process the different types of events.	measure: {average, accumulated}. type: {domainResolveUpdate, handleRemedy, handleTunnelDisconnected, login, nameResolveUpdate, reEvaluateConditions, reGenerateExpressConnectorRules, updateTokens, validateTokens}
gw_session_js_exectime			mS	Script execution time - which may include external API calls. Effectively this provides more detail when gw_session_event_timing shows a problem.	type: {Entitlement, Condition, appshortcutscript}
gw_sessiond_heap		F	bytes	Sessiond uses Java - and this can consume a lot of RAM but is somewhat linked to the amount of RAM available. This metric allows the Java usage to be monitored.	measure: {initial,used, committed, max}
gw_sessiond_thread_count			number	All Java process threads. This is somewhat linked to the amount of RAM available and also reflects the number of events the Gateway is handling.	type: {current, peak}
gw_session_traffic_active			number	Number of active sessions for this traffic type.	type: {direct, directViaNatt, relay}
gw_session_traffic_allsitesup_averagetime			mS	Average uptime for all Sites for this traffic type.	type: {direct, directViaNatt, relay}
gw_session_traffic_allsitesup_maxtime			mS	Maximum uptime for all Sites for this traffic type.	type: {direct, directViaNatt, relay}
gw_session_traffic_allsitesup_mintime			mS	Minimum uptime for all Sites for this traffic type.	type: {direct, directViaNatt, relay}
gw_session_traffic_roundtrip_averagetime			mS	Average round-trip time for this traffic type.	type: {direct, directViaNatt, relay}
gw_session_traffic_roundtrip_maxtime			mS	Maximum round-trip time for this traffic type.	type: {direct, directViaNatt, relay}
gw_session_traffic_roundtrip_mintime			mS	Minimum round-trip time for this traffic type.	type: {direct, directViaNatt, relay}
gw_token_size			bytes	The size of the payload sent from the client to the Gateway. There is a limit of 16MB for this.	measure: {average, min, max}
gw_vpn_client_metric			number	client connectivity metrics relating to a given Site.	name: of the measure, measure: {min, max avg}, client_site_name: of the reporting Site
gw_vpn_memory_usage			number	Overall memory usage of the firewall daemon(s).	~
gw_vpn_resolved_actions			number	Current number of resolved actions in use by the firewall. Monitored actions check the end-point.	measure: {monitored, unmonitored}
gw_vpn_resolved_actions_size			bytes	Memory used to store resolved actions.	~
gw_vpn_rtt_sample			mS	Unaggregated round-trip time per vpnd instance.	traffic mode: {direct, direct-via-natt, relay}, extreme: {highest, lowest}, dn: a given user’s distinguished name
gw_vpn_rules			number	Number of active VPN rules in place	proto: {ipv4, ipv6}
gw_vpn_rules_size			bytes	Memory used to store the current rules held in the firewall	proto: {ipv4, ipv6}
gw_vpn_sessions		G	number	Number of (user) sessions being serviced by the gateway
gw_vpn_states			number	Number of connection states being held in the firewall	proto: {ipv4, ipv6}
gw_vpn_states_size			bytes	Memory used to store the current states held in the firewall	proto: {ipv4, ipv6}
lf_audit_logs	6.5.1		counter	Quantity of audit logs forwarded to the LogForwarder	source: hostname of source appliance, type: audit event type
log_audit_logs	6.5.1		counter	Quantity of audit logs forwarded to the LogServer	source: hostname of source appliance, type: audit event type
ptl_client			number	The status of the Portal's clients. 'running' is the available clients which is based on the size of the appliance.	type: {free, pending_removal, running, used}

Documentation Index

Prometheus metrics details

Prometheus

Understanding Prometheus metrics