There are a few metrics where it is possible to set some default thresholds which should apply to most use cases.
Reference | Metric | Note | Measure | External Trigger Level Appgate SDP Trigger Level | Limit Effect | Cause | Possible Action | |
|---|---|---|---|---|---|---|---|---|
Warning | Error | |||||||
A | apn_active_connections / apn_active_connections_max*100 | This is a Linux limit | % | 80% | 90% | Users connections will be dropped | Too many TCP connections (users and hosts) | Increase the number of allowed connections or reduce the number of protected hosts |
B | avg_over_time (apn_cpu_usage_percent [10m]) | This is already a 15sec average | % time averaged over 10 minutes | 85% | 95% | Machine will slow down | Multiple | Diagnose the cause and fix, or Increase the number of vCPU |
C | apn_disk | This is the whole disk | % | 80% | 90% | Various | Database size, Logs, Image files | Add a bigger disk |
D | apn_disk_partition_statistic {measure="percent", func=~".*ctr.*", path="/mnt/data"} | Database is in /data so this mainly affects Controllers | % | 80% | 90% | Controllers can't sync | Database size too large | Add a bigger disk |
E | avg_over_time (apn_memory {measure="percent"} [5m]) | High concurrency of users mainly affects Gateways | % time averaged over 5 minutes | 75% | 90% | Auto suspend GW or system failure (if reaching 100%) | Too many events or rules in Gateways | Add more RAM |
F | gw_sessiond_heap {measure="used"} / ignoring (measure) gw_sessiond_heap {measure="max"}*100 | Used for event handling by sessiond mainly affects Gateways | % | 70% | 85% | Auto suspend GW | Too many events to handle | Add another Gateway |
G | gw_vpn_sessions | This is very affected by the number of rules each user has been assigned | number | 4000 | 8000 | Auto suspend GW | Too many users to handle | Add another Gateway |
H | gw_event_queue_size {measure="current"} / gw_event_queue_size {measure="length"}*100 | {measure=length) has the value 15k, 30k, 50k,100k depending on the Appliance specification. | number | 50% | 70% | Auto suspend GW | Too many events to handle | Add another Gateway |
I | (ctr_ip_pool {usage="current",name=~".*v4"}+ignoring (usage) ctr_ip_pool {usage="reserved",name=~".*v4"}) / ignoring (usage) ctr_ip_pool {usage="total",name=~".*v4"}*100 | This calculates the % of IPv4 addresses used. | % | 75% | 90% | Health: Failed to allocate IP from pool | IP pool too small | Add another IP pool |
J | ctr_license {type="users", measure="used"} / ignoring (measure) ctr_license {type="users", measure="entitled"}*100 | This calculates the % of "users" licenses used. | % | 85% | 95% | Health: There are more X than your license allows. | Insufficient licenses | Purchase additional licenses |
K | ctr_client_authentication {measure="average_time"} + ctr_client_authorization {measure="average_time"} | The average time for a standard user to sign in to the Controller | mS | 2000 | 5000 | User sign in times out | Multiple | Diagnose the cause and fix, focus on external calls/scripts |
L | gw_session_event_timing {measure="accumulated" / ignoring(measure) gw_session_event_count | The long term average execution time per event type | mS | 2000 | 5000 | Users experience slow response times | Multiple | Diagnose the cause and fix, focus on external calls/scripts |
M | apn_certificate_days_remaining | Days remaining until certificate expiration per certificate type | days | 30 | 0 | Lost connection (different depending on certificate type) | Certificate expiring | Renew the certificate during calm hours to avoid interruptions (Appliance certificate is renewed automatically 1 day before expiry) |