Prometheus metrics thresholds

There are a few metrics where it is possible to set some default thresholds which should apply to most use cases.

Reference	Metric	Note	Measure	External Trigger Level AppGate ZTNA Trigger Level		Limit Effect	Cause	Possible Action
Reference	Metric	Note	Measure	Warning	Error	Limit Effect	Cause	Possible Action
A	apn_active_connections / apn_active_connections_max*100	This is a Linux limit	%	80%	90%	Users connections will be dropped	Too many TCP connections (users and hosts)	Increase the number of allowed connections or reduce the number of protected hosts
B	avg_over_time (apn_cpu_usage_percent [10m])	This is already a 15sec average	% time averaged over 10 minutes	85%	95%	Machine will slow down	Multiple	Diagnose the cause and fix, or Increase the number of vCPU
C	apn_disk	This is the whole disk	%	80%	90%	Various	Database size, Logs, Image files	Add a bigger disk
D	apn_disk_partition_statistic {measure="percent", func=~".ctr.", path="/mnt/data"}	Database is in /data so this mainly affects Controllers	%	80%	90%	Controllers can't sync	Database size too large	Add a bigger disk
E	avg_over_time (apn_memory {measure="percent"} [5m])	High concurrency of users mainly affects Gateways	% time averaged over 5 minutes	75%	90%	Auto suspend GW or system failure (if reaching 100%)	Too many events or rules in Gateways	Add more RAM
F	gw_sessiond_heap {measure="used"} / ignoring (measure) gw_sessiond_heap {measure="max"}*100	Used for event handling by sessiond mainly affects Gateways	%	70%	85%	Auto suspend GW	Too many events to handle	Add another Gateway
G	gw_vpn_sessions	This is very affected by the number of rules each user has been assigned	number	4000	8000	Auto suspend GW	Too many users to handle	Add another Gateway
H	gw_event_queue_size {measure="current"} / gw_event_queue_size {measure="length"}*100	{measure=length) has the value 15k, 30k, 50k,100k depending on the appliance specification.	number	50%	70%	Auto suspend GW	Too many events to handle	Add another Gateway
I	(ctr_ip_pool {usage="current",name=~".v4"}+ignoring (usage) ctr_ip_pool {usage="reserved",name=~".v4"}) / ignoring (usage) ctr_ip_pool {usage="total",name=~".v4"}100	This calculates the % of IPv4 addresses used.	%	75%	90%	Health: Failed to allocate IP from pool	IP pool too small	Add another IP pool
J	ctr_license {type="users", measure="used"} / ignoring (measure) ctr_license {type="users", measure="entitled"}*100	This calculates the % of "users" licenses used.	%	85%	95%	Health: There are more X than your license allows.	Insufficient licenses	Purchase additional licenses
K	ctr_client_authentication {measure="average_time"} + ctr_client_authorization {measure="average_time"}	The average time for a standard user to sign in to the Controller	mS	2000	5000	User sign in times out	Multiple	Diagnose the cause and fix, focus on external calls/scripts
L	gw_session_event_timing {measure="accumulated" / ignoring(measure) gw_session_event_count	The long term average execution time per event type	mS	2000	5000	Users experience slow response times	Multiple	Diagnose the cause and fix, focus on external calls/scripts
M	apn_certificate_days_remaining	Days remaining until certificate expiration per certificate type	days	30	0	Lost connection (different depending on certificate type)	Certificate expiring	Renew the certificate during calm hours to avoid interruptions (Appliance certificate is renewed automatically 1 day before expiry)

Documentation Index

Prometheus metrics thresholds