Prometheus metrics details

Prev Next

The AppGate ZTNA system generates metrics for SNMP and Prometheus in a unified way internally. Effectively when a metric is collected it is programmatically formatted for both types of output, so even though Prometheus is the lead method, there will (almost) always be an exactly matching metric available in SNMP.

Prometheus

Prometheus is an open-source project managed by the Cloud Native Computing Foundation (CNCF). It was designed to collect metrics about your application and infrastructure. A Prometheus library has been built into the AppGate ZTNA appliance. This exposes an HTTP endpoint from where Prometheus can scrape metrics. These scraped metrics are then stored, and rules applied to aggregate or generate time series from the data. Grafana is the default means of visualizing this data.

Understanding Prometheus metrics

The first part of the name reflects the source of the metric:

apn = appliance

ctr = Controller

gw = Gateway

ptl = Portal

The remainder of the name points to the daemon/feature which is being monitored.

The color indicates the relevance of the metric:

Green = metrics are likely to be used in most Collectives,

Orange = metrics will be relevant in specific circumstances.

Black = metrics are somewhat specialized and are unlikely to be of value in a production environment.

Grey = metrics that have been deprecated and should not be used.

Labels are used with most metrics to cover the possibility of there being multiple results available from one metric: name, measure, type, usage, etc. The most interesting Label(s) in respect of each metric are detailed in the table below.

There are seven standard labels included in all metrics. These are: appliance_id="", appliance_name="",collective_id="", collective_name="", site_id="", site_name="", appliance_version=""

These can be individually excluded in the Prometheus Exporter settings for an appliance or the Metrics Aggregator. This helps produce more consistent output (e.g. you can ignore appliance version) and reduce the volume of the data ( _ids can be long and bloat the metrics considerably).

For certain metrics there are some suggested default thresholds which can be used to trigger alerts and take some corrective action.

Metric

Introduced or updated

Threshold reference

Type

Comment

Labels of interest

apn_active_connections

A

number

The number of active connections the operating system is tracking.

proto: {tcp, udp, icmp}

apn_active_connections_max

number

Maximum number of active connections allowed.

~

apn_audit_events

counter

Qty of local audit logs generated. Can be useful to show additional metrics not otherwise available.

type: of audit log

apn_audit_logs

counter

Monitors the flow of audit logs from the appliance. Logs are queued (from daemons), read (by external party) and after receive is confirmed, removed. Queued, read, received and removed should all be the same. Dropped and queued on startup would normally be zero.

Received or removed are very useful to show the appliance is properly connected and doing work.

type: {dropped, queued, queued on startup, read, received, removed}

apn_audit_queue_maxlimit

6.5.3

gauge

The limit of the number of audits in the source database.

~

apn_audit_queue_total

6.5.3

gauge

The number of audits on the source appliance not successfully forwarded.

~

apn_certificate_days_remaining

M

days

Days remaining before the given certificate expires

type: {CA, Portal access, Portal https, URL access, admin https, appliance}

apn_cpu_usage_percent

B

bytes | %

Current CPU usage. Matters more for Gateways which will eventually throttle users.

measure: {available, free, percent, used, total}

apn_disk

C

bytes | %

Current disk usage. Important to make sure Controllers do not run out of disk.

measure: {free, percent, used, total}

apn_disk_partition_statistic

D

bytes | %

Current disk partition usage. Important to make sure Controllers do not run out of disk.

name: {data, overlay, state}, measure: {free, percent, used, total}

apn_dns_cache_entries

number

The number of cached records (where request was a success) and the number of denials (where request was denied (NXDOMAIN, etc).

purpose: the instance of CoreDNS, type: {success, denial}

apn_dns_requests_total

counter

The DNS requests being handled by CoreDNS.

purpose: the instance of CoreDNS, type: the type of request i.e.AAAA

apn_dns_responses_total

counter

The number of responses Including standard DNS response codes such a NXDOMAIN.

purpose: the instance of CoreDNS, response_code: the DNS response code

apn_function_sessions

number

The number of current Client sessions. Same as shown in the appliance health status screen.

~

apn_function_status

number

The reported health of each of the appliance functions.

name: of the function reporting its status, status: {healthy (0), busy (1), warning (2), error (3)}

apn_function_suspended

number

The reported health of each of the appliance functions.

name: of the function reporting its status, reason: {ram, queue, heap, sessions, manual}, status: {normal (0), suspended (1)}

apn_image_size

bytes

Size of the image partition (where the SDP code lives).

~

apn_memory

E

bytes

Current memory usage. Gateways need monitoring as usage is proportional to users and conditions.

measure: {available, free, percent, total, used}

apn_nat_traversal_relayed_data_throughput_bits

bps

Average NAT traversal relayed data throughput over the given timeframe. Default timeframe is ten minutes.

apn_network_interface_statistic

counter

Network traffic measure which includes the bytes and packets

ifname: of the interface, measure: the type (and direction) of traffic such as "rx_packets".

apn_network_interface_speed

Mb/s

Current network traffic throughput.

ifname: of the interface, measure: {tx_speed, rx_speed}

apn_proxy_protocol_messages

counter

Proxy protocol message counter.

version: {v1, v2}

apn_snat

number

Number of UDP/TCP ports (or ICMP types) used by specific source_ip. Used to make sure enough IP aliases have been configured.

source_ip: generating the SNAT traffic

apn_spa_packet_authorization_time

mS

The time it takes to validate an SPA packet.

measure: {maximum, average, minimum} proto: {tcp-tls, udp-dns, udp-dtls}

apn_spa_packets

counter

Detail metrics for SPA (and non SPA) packets received. These are mainly specialist metrics for development use. In normal use expect to see mostly type: authorized.

For a view of unwanted connection attempts; use type: invalid for proto: tcp, and type: ignored for proto: udp.

type: of action taken, proto: {tcp, tcp-tls, udp, udp-dns, udp-dtls}

apn_spa_replay_attack_cache_entries

number

The cache keeps a record of all valid SPA packets to prevent replay attacks. This can consume up to 1GB of RAM so can be interesting on smaller appliances.

~

apn_status

number

The reported summary health of the appliance as reported on the Appliance Health Page.

This status differ from the Appliance status reported in the Admin UI, as that one is reflecting the Appliance from the point of view of a single Controller. This primarily affects the Admin UI Appliance status Offline. That will be reported as warning (if one Controller is unreachable), error (if no Controller is reachable) or not at all (if the Appliance is dead).

status: {healthy (0), busy (1), warning (2), error (3)}

apn_volume_number

1 or 2

The active volume - the disk is split into 2 volumes. One for the current (live) version and one for the previous version.

~

ctr_admin_authentication

counter | mS

Time and stats for admin authentications to Controllers (mainly LDAP, SAML, OATH, etc) excluding MFA.

measure: {success, error, total, average_time}

ctr_admin_authorization

counter | mS

Time and stats for admin authorizations by Controllers (includes running assignment criteria scripts, signing tokens etc).

measure: {success, error, total, average_time}

ctr_admin_mfa

counter | mS

Time and stats for use of MFA at admin sign-in

measure: {success, error, total, average_time}

ctr_client_authentication

K

counter | mS

Time and stats for Client authentications to Controllers that exclude MFA at sign-in.

measure: {success, error, total, average_time}

ctr_client_authorization

K

counter | mS

Time and stats for Client authorizations by Controllers (includes running assignment criteria scripts, ctr_client_evaluate_all_policies, collecting  and grouping them to Sites, signing tokens etc).

measure: {success, error, total, average_time}

ctr_client_average_time

ms

All average time for API calls for ctrClient and ctrAdmin values

ctr_client_csr

counter | mS

Time and stats for the creation of VPN certificates for use by the Client if required.

measure: {success, error, total, average_time}

ctr_client_enter_password

counter | mS

Time and stats for use of user interactions requiring Password.

measure: {success, error, total, average_time}

ctr_client_evaluate_all_policies

counter | mS

Time and stats for evaluating the Policies looking for any matches (part of authorization).

measure: {success, error, total, average_time}

ctr_client_evaluate_auto_update_criteria_script

counter | mS

Time and stats for evaluating the auto-update criteria scripts looking for any matches if configured.

measure: {success, error, total, average_time}

ctr_client_mfa

counter | mS

Time and stats for use of user interactions requiring MFA.

measure: {success, error, total, average_time}

ctr_client_new_ip_allocation

counter | mS

Time and stats for the allocation of an IP address to a Client at sign-in

measure: {success, error, total, average_time}

ctr_client_risk_engine_response

mS

Time it takes for the ZTP risk engine to respond

measure: {average time}

ctr_client_sign_in_with_mfa

counter | mS

Time and stats for Client authentications to Controllers that include MFA at sign-in.

measure: {success, error, total, average_time}

ctr_database_conflicts

-

Specialist metrics for development use

~

ctr_database_node_state

-

Specialist metrics for development use

~

ctr_database_raft_state

-

Specialist metrics for development use

~

ctr_database_replication

1 | 0

Indicates if all of the Controllers can replicate with each other (1:yes 0:no).

~

ctr_database_size

bytes

Size of the Controller database.

~

ctr_database_replication_slot_replay_lag

number

Quantity of records waiting to be synchronized with each of the other Controllers. Should be minimal before attempting an upgrade.

name: the name(s) of the other Controller(s)

ctr_evaluate_user_claim_script

counter

Time and stats for executing the user claim script(s).

ctr_ip_pool

I

number

IP pool statistics. IPv4 will be more useful than IPv6 (as the latter is unlikely to run out).

usage: {reserved, current, total}

ctr_license

J

number

Provides the current installed license and usage information.

measure: {entitled, used}

ctr_license_days_remaining

days

Controllers report the days remaining before the nearest license expiry date.

~

ctr_memory_heap

bytes

This relates to the java engine and shows amount of reserved memory.

measure: {initial,used, committed, max}

ctr_policy_evaluator

bytes

This relates to the javascript engine and shows the size of the cache being used.

measure: {cacheSize}

ctr_threads

number

This relates to the java engine and shows the number of concurrent threads.

~

gw_dns_forwarder_cache

counter

Details the activity of the DNS forwarder cache

type: {hit, miss, insert, expired, count}

gw_dns_forwarder_domain

counter

Count of the top domains that the DNS Proxy is resolving

name: the names of the top domains

gw_dns_forwarder_query

counter

The dns_forwarder uses cache so expect most results to be success and cache

result: {success, cache, nodata, timeout, notfound}. type: {a, aaaa}

gw_dns_resolver_cache

counter

Details the activity of the DNS resolver cache

result:{hit, miss, nodata}

gw_event_queue_size

number

This relates to the sessiond event queue. The length of the queue varies see Instance sizing. On lightly loaded Gateways Current may not yield very useful data as it is hit-and-miss as to whether there is anything in the event queue at the time of the sample.

measure: {current, length, max_used}

gw_fallback_site_usage

number

The number of users on a given Site that are there because of the fallback feature.

name: name of the Site that caused the user to fallback.

gw_ha_interface

counter

This monitors the (inside) network HA related traffic in non SNAT environments.

type: {arp, garp-replies, neighbor, unsolicited-neighbor-advertisements}

gw_http_action

counter

Actions performed withing the URL access function.

type: {allowed, blocked, reported}

gw_http_connection

counter

The amount of HTTP connections the Gateway handles.

type: {accepts, handled}

gw_http_open_connection

number

Current number of connections.

type: {reading, writing, waiting, active}

gw_http_requests

counter

Number of http requests.

~

gw_name_resolver

number

This monitors the current settings/behavior of the name resolver.

kind: aws, Azure, etc. measure: {update_interval, effective_update_interval, unstable_names}

gw_name_resolver_cache_count

counter

The usage of the resolver's cache

kind: aws, Azure, etc. measure: {evictions, insertions, misses}

gw_name_resolver_cache_ttl

number

Current configuration of the TTL for the resolver

~

gw_name_resolver_cache_value

number

The cache's time to live setting

kind: aws, Azure, etc. measure: {ttl}

gw_name_resolver_label

text

Name of the configured resolver

~

gw_name_resolver_names_missing_resolver

number

The number of resource names configured on Sites where the required resolver has not been configured.

~

gw_name_resolver_value

number

The number of resources discovered.

measure: the types of resource i.e. instances. kind: the type of resolver i.e. gcp.
name: the name given to the resolver on the Site

gw_policy_evaluator

bytes

This relates to the javascript engine and shows the size of the cache being used.

measure: {cacheSize}

gw_session_dropped_signin

counter

The number of sign-in attempts that were dropped because of resource shortages.

reason: {high_java_heap_usage}

gw_session_event_count

L

counter

Number of events handled by the system. Can be used with _timing to design your own averages.

  • domainResolverUpdate and nameResolverUpdate measure external hostname lookups.

  • login and reEvaluateConditions may include external js script exectime. (see gw_session_js_exectime)

  • The others are reflective of the Gateway performance (doing internal processing such as crypto operations).  

type: {domainResolveUpdate, handleRemedy, handleTunnelDisconnected, login, nameResolveUpdate, reEvaluateConditions, reGenerateExpressConnectorRules, updateTokens, validateTokens}

gw_event_queue_size_session

H

number

The size of the event handling queue. gw_event_queue_size_session {measure="max_used"} can be useful on lightly loaded system where "current" might report 0 much of the time.

measure: {current, length, max used}

gw_session_event_timing

L

number / counter

The amount of time it takes to process the different types of events.

measure: {average, accumulated}. type: {domainResolveUpdate, handleRemedy, handleTunnelDisconnected, login, nameResolveUpdate, reEvaluateConditions, reGenerateExpressConnectorRules, updateTokens, validateTokens}

gw_session_js_exectime

mS

Script execution time - which may include external API calls. Effectively this provides more detail when  gw_session_event_timing shows a problem.

type: {Entitlement, Condition, appshortcutscript}

gw_sessiond_heap

F

bytes

Sessiond uses Java - and this can consume a lot of RAM but is somewhat linked to the amount of RAM available. This metric allows the Java usage to be monitored.

measure: {initial,used, committed, max}

gw_sessiond_thread_count

number

All Java process threads. This is somewhat linked to the amount of RAM available and also reflects the number of events the Gateway is handling.

type: {current, peak}

gw_session_traffic_active

number

Number of active sessions for this traffic type.

type: {direct, directViaNatt, relay}

gw_session_traffic_allsitesup_averagetime

mS

Average uptime for all Sites for this traffic type.

type: {direct, directViaNatt, relay}

gw_session_traffic_allsitesup_maxtime

mS

Maximum uptime for all Sites for this traffic type.

type: {direct, directViaNatt, relay}

gw_session_traffic_allsitesup_mintime

mS

Minimum uptime for all Sites for this traffic type.

type: {direct, directViaNatt, relay}

gw_session_traffic_roundtrip_averagetime

mS

Average round-trip time for this traffic type.

type: {direct, directViaNatt, relay}

gw_session_traffic_roundtrip_maxtime

mS

Maximum round-trip time for this traffic type.

type: {direct, directViaNatt, relay}

gw_session_traffic_roundtrip_mintime

mS

Minimum round-trip time for this traffic type.

type: {direct, directViaNatt, relay}

gw_token_size

bytes

The size of the payload sent from the Client to the Gateway. There is a limit of 16MB for this.

measure: {average, min, max}

gw_vpn_client_metric

number

Client connectivity metrics relating to a given Site.

name: of the measure, measure: {min, max avg}, client_site_name: of the reporting Site

gw_vpn_memory_usage

number

Overall memory usage of the firewall daemon(s).

~

gw_vpn_resolved_actions

number

Current number of resolved Actions in use by the firewall. Monitored actions check the end-point.

measure: {monitored, unmonitored}

gw_vpn_resolved_actions_size

bytes

Memory used to store resolved Actions.

~

gw_vpn_rtt_sample

mS

Unaggregated round-trip time per vpnd instance.

traffic mode: {direct, direct-via-natt, relay}, extreme: {highest, lowest}, dn: a given user’s distinguished name

gw_vpn_rules

number

Number of active VPN rules in place

proto: {ipv4, ipv6}

gw_vpn_rules_size

bytes

Memory used to store the current rules held in the firewall

proto: {ipv4, ipv6}

gw_vpn_sessions

G

number

Number of (user) sessions being serviced by the gateway

gw_vpn_states

number

Number of connection states being held in the firewall

proto: {ipv4, ipv6}

gw_vpn_states_size

bytes

Memory used to store the current states held in the firewall

proto: {ipv4, ipv6}

lf_audit_logs

6.5.1

counter

Quantity of audit logs forwarded to the LogForwarder

source: hostname of source appliance, type: audit event type

log_audit_logs

6.5.1

counter

Quantity of audit logs forwarded to the LogServer

source: hostname of source appliance, type: audit event type

ptl_client

number

The status of the Portal's Clients. 'running' is the available Clients which is based on the size of the appliance.

type: {free, pending_removal, running, used}