Prometheus metrics thresholds

Prev Next

There are a few metrics where it is possible to set some default thresholds which should apply to most use cases.

Reference

Metric

Note

Measure

External Trigger Level

AppGate ZTNA Trigger Level

Limit Effect

Cause

Possible Action

Warning

Error

A

apn_active_connections / apn_active_connections_max*100

This is a Linux limit

%

80%

90%

Users connections will be dropped

Too many TCP connections (users and hosts)

Increase the number of allowed connections or reduce the number of protected hosts

B

avg_over_time (apn_cpu_usage_percent [10m])

This is already a 15sec average

% time averaged over  10 minutes

85%

95%

Machine will slow down

Multiple

Diagnose the cause and fix, or Increase the number of vCPU

C

apn_disk

This is the whole disk

%

80%

90%

Various

Database size, Logs, Image files

Add a bigger disk

D

apn_disk_partition_statistic {measure="percent", func=~".*ctr.*", path="/mnt/data"}

Database is in /data so this mainly affects Controllers

%

80%

90%

Controllers can't sync

Database size too large

Add a bigger disk

E

avg_over_time (apn_memory {measure="percent"} [5m])

High concurrency of users mainly affects Gateways

% time averaged over  5 minutes

75%

90%

Auto suspend GW or system failure (if reaching 100%)

Too many events or rules in Gateways

Add more RAM

F

gw_sessiond_heap {measure="used"} / ignoring (measure) gw_sessiond_heap {measure="max"}*100

Used for event handling by sessiond mainly affects Gateways

%

70%

85%

Auto suspend GW

Too many events to handle

Add another Gateway

G

gw_vpn_sessions

This is very affected by the number of rules each user has been assigned

number

4000

8000

Auto suspend GW

Too many users to handle

Add another Gateway

H

gw_event_queue_size {measure="current"} / gw_event_queue_size {measure="length"}*100

{measure=length) has the value 15k, 30k, 50k,100k depending on the appliance specification.

number

50%

70%

Auto suspend GW

Too many events to handle

Add another Gateway

I

(ctr_ip_pool {usage="current",name=~".*v4"}+ignoring (usage) ctr_ip_pool {usage="reserved",name=~".*v4"}) / ignoring (usage) ctr_ip_pool {usage="total",name=~".*v4"}*100

This calculates the % of IPv4 addresses used.

%

75%

90%

Health: Failed to allocate IP from pool

IP pool too small

Add another IP pool

J

ctr_license {type="users", measure="used"} / ignoring (measure) ctr_license {type="users", measure="entitled"}*100

This calculates the % of "users" licenses used.

%

85%

95%

Health: There are more X than your license allows.

Insufficient licenses

Purchase additional licenses

K

ctr_client_authentication {measure="average_time"} + ctr_client_authorization {measure="average_time"}

The average time for a standard user to sign in to the Controller

mS

2000

5000

User sign in times out

Multiple

Diagnose the cause and fix, focus on external calls/scripts

L

gw_session_event_timing {measure="accumulated" / ignoring(measure) gw_session_event_count

The long term average execution time per event type

mS

2000

5000

Users experience slow response times

Multiple

Diagnose the cause and fix, focus on external calls/scripts

M

apn_certificate_days_remaining

Days remaining until certificate expiration per certificate type

days

30

0

Lost connection (different depending on certificate type)

Certificate expiring

Renew the certificate during calm hours to avoid interruptions (Appliance certificate is renewed automatically 1 day before expiry)