Gateways Are Showing A Warning After Upgrading To Version 6.4.x Gateway Is Suspended Due To High Resource Usage

Prev Next

AppGate SDP has introduced new capabilities in version 6.4 that will automatically Suspend a Gateway when under heavy load. When a Gateway is suspended, it keeps all existing connections ensuring connected users are not affected but doesn't allow new connections. This is designed to maintain the availability of the Gateway when working under very heavy loads preventing the resources on the Appliance being overconsumed. Once the load on a Gateway reduces to normal values, the Gateway will be automatically Resumed allowing users to establish new connections to the Appliance again.

The automatic Suspend and Resume of Gateways is based on resource utilization of the following monitors:

  • eventQueueUsage - The % of the event queue used (events awaiting processing by sessiond)

  • javaHeapUsage - The % memory used by sessiond vs the heap's maximum

  • numberOfSessions - The number of users' sessions on the Gateway

  • systemMemoryUsage - The % total memory usage in the Gateway

Each trigger has a default value that is defined in our Admin Guide for High Availability configuration in the "Gateway Failover When it is Suspended" section. When a High Trigger is met, the Appliance will go into a Warning state and display an error like the following:

Gateway is not accepting new connections due to high resource usage ({reason_str}). Users with active sessions are not affected. Sign-ins will resume automatically when resource usage allows.

How it Works

In an appropriately sized configuration of a site with two Gateways, during normal load, users will be able to establish new connections to Gateways as needed. Resource utilization would vary depending on the number and types of connections that are created as users connect or disconnect their sessions and corresponding connections to a Gateway. If there is a spike in connections causing, as an example, the RAM utilization to reach 85% of the total RAM available, the High Trigger for memory usage, the Gateway would be automatically Suspended. Once Suspended, the Appliance will go into a Warning state with the following warning:

Gateway is suspended due to high resource usage (ram). It will resume automatically when resource usage allows.

When the Gateways is automatically Suspended, all existing users would keep their existing connections, but the Suspended Gateway would not accept new connections. Any new users needing a connection to the Site would be diverted to an alternate Gateway that wasn't Suspended.

Once a Gateway is automatically Suspended, it will remain in this state until existing connections drop and, using the same example, the memory utilization decreases to 70% of the total RAM available, the Low Trigger for memory usage. When this Low trigger is met, the Gateway would be Resumed allowing new users to establish new connections.

How to Check if Resources are Being Overutilized on an Appliance

AppGate recommends companies implement continuous monitoring of the AppGate SDP system and provides recommendations in the System Monitoring and Logs section of the Admin Guide. As part of a monitoring plan, we recommend customers use the following related Prometheus metrics and establish recommended Thresholds for these metrics alerting appropriate personnel of a potential resource issue before a Gateway is automatically Suspended:

gw_sessiond_heap
gw_vpn_sessions
gw_event_queue_size
apn_memory

Beyond setting up automated and continuous monitoring, resource utilization on an Appliance can be checked manually in one of two ways:

Check the Gateway Logs

The cz-configd log file periodically records appliance details including memory utilization. Search for an event_type of 'appliance_function_suspended' for memory utilization.

Use the cz-metrics command

The cz-metrics command can be run manually on any Appliance to get the current metrics value where appropriate. As an example, the memory values of a Gateway can be obtained by running the following command from a command line:

cz@gw1:~$ sudo cz-metrics get apn_memory | grep percent

apn_memory{collective_id="32201451-1add-4260-b1ae-a2bd8fded8eb",collective_name="briangwt",appliance_id="8ffcaf3b-283c-4d11-8731-377de2f07907",appliance_name="gateway1",site_id="8a4add9e-0e99-4bb1-949c-c9faf9a49ad4",site_name="Default Site",appliance_version="6.4.3-40372-release",func="apn,gw",measure="percent"} 16.9521595744466

Actions to Take When Gateways are Automatically Suspended Often

Although automatically suspending a Gateway helps to ensure stability of an Appliance, if the resources of an Appliance or Site are not correctly sized for the number and types of user connections, Gateways may be suspended too frequently or not come out of a suspended state. If all Gateways at a site are Suspended, a maximum number of connections may be met impacting new users from connecting to a site at all. In this situation, there are several actions that can be taken to ensure users are able to connect to a site while ensuring stability.

Increase the resources of the Appliance(s)

Often, the Appliance resources can be increased to enable more concurrent connections without going past the High Trigger for an auto-suspend event. Increasing the RAM, as an example, increases the memory available for systemMemoryUsage and javaHeapUsage. Please refer to our Instance Sizing guide for appropriately sizing your Gateways at a site.

Add Additional Gateways to a Site

Adding additional Appliances to a site adds more redundancy and allows the user connections to be distributed across a larger pool of appliances. If the site is Cloud based, customer can also dynamically add or remove Gateways from a site using auto-scaling based on load.

Change the Auto-Suspend High and Low Triggers

Although AppGate has recommended default values to ensure the stability of an Appliance under heavy load, these threshold values can be altered using the cz-config command with the get and set gateway/suspendWatermarks actions. These thresholds must be set by establishing an ssh session with the Appliance and running the commands as root.

As an example, to reset the High and Low watermarks for memory usage, run the following commands from a command line on the Appliance:

sudo cz-config set -j gateway/suspendWatermarks/systemMemoryUsage/high 90.0
sudo cz-config set -j gateway/suspendWatermarks/systemMemoryUsage/low 85.0

To obtain the current High and Low values for all auto-suspend settings, run the following command as root from an ssh session:

sudo cz-config get gateway/suspendWatermarks

Note that resources may still fluctuate after a Gateway is suspended. Existing connections may cause memory to increase so any changes in the default watermarks should accommodate for these fluctuations.