This section provides information about the internal HA mechanisms that are built in to AppGate ZTNA at different levels. Before exploring these HA mechanisms, it is worth touching on a couple of areas that most people would understand as relating to HA working.
The first is the use of (external) load balancers. These can be used with AppGate ZTNA, however it is not recommended.
The second is the use of external gateways (VPNs) that support stateful HA, which is not required with AppGate ZTNA.
Use of external load balancers
External load balancers offer some advanced HA features, similar to those built into the system, and they can be used in front of both the Controller and Gateway. However two of the system's core security features, SPA and mTLS, mean there are some limitations in the use of a number of load balancer features, and other more advanced HA features are not available at all.
When an external load balancer is used, UDP-TCP SPA mode cannot be enabled as it won't be able to handle UDP and TCP traffic. As load balancers do not have any equivalent cloaking feature they will also now be visible on the internet. This defeats one of the core principals defined in the Software Defined Perimeter model.
There is also an impact on the use of an external load balancers when using TCP SPA mode in conjunction with mTLS. These two core security features are designed to protect you from man-in-the-middle attacks, which makes any application-layer load balancer unusable. This means the load balancer must work exclusively on DNS or on the TCP/IP layer.
Using round-robin DNS is a possibility, in which you can edit the Client Hostname/IP (Gateway) or the Profile DNS name (Controller) - maybe issuing different ones to two groups of users in two different locations. The DNS service will choose one of the IP addresses based on round-robin or specifically predefined criteria such as geolocation. The round-robin DNS service needs to know the appliance is up and running, otherwise the client can be directed to an unavailable appliance. Configure the load balancer to use the Healthcheck Server to probe each appliance and ensure it is available. In this scenario, the client has no knowledge of the next IP address (it has no list), so if it receives a timeout message from the connected Controller, the user would be unable to authenticate.
A TCP/IP layer load balancer is the other possibility, where you should set the DNS record for Client Hostname/IP (Gateway) or the Profile DNS name (Controller) to point to the load balancer. This will be the only hostname/IP address the client connects to. Traffic is forwarded to one of the appliances based on predefined criteria which ideally includes knowing the appliance is up and running, otherwise the client will be directed to an unavailable appliance. The load balancer should use the Healthcheck Server to probe each appliance and ensure it is available. In this scenario, the client has no knowledge of the next IP address (it has no list), so if it receives a timeout message from the connected Controller then the user would be unable to authenticate.
Care also needs to be taken if the load balancer is being used with Controllers and is set up to direct all traffic to the primary Controller in the first instance. In this case, IP pool sizing will need to be significantly increased because of how they are split between all the Controllers.
Instead of an external load balancer, AppGate recommends the use of the internal HA mechanisms for both Controller and Gateway which are designed to optimize scalability, security, and performance.
Gateway stateful failover/Roaming
Part of any HA mechanism includes giving users a seamless experience when the system performs a real-time failover. Unlike normal firewalls which have undefined actors sitting on one or both sides, AppGate ZTNA has only authenticated users on the one side and protected hosts on the other side. AppGate ZTNA also has separate up and down rules to further prevent reverse connections from the protected hosts side. This means that the firewall does not need to be stateful as state is not used as part of the access controls.
Once a secure tunnel has been established with a Gateway, the Client will push the user’s Entitlement and Claims tokens to the Gateway. The Gateway starts a firewall service on a separate thread and uses the Entitlement token information to define an initial set of firewall rules. In the event of connection interruption, the Client will try to establish a new tunnel either with the same Gateway (if there's only one serving the Site) or with another Gateway (if more than one has been assigned to the Site). Once the new connection has been established, the Client will again push the Entitlement and claims token to start a new firewall service. The Gateway will recreate the firewall rules based on the same Entitlement token information used previously. Communication can then continue without loss of any pending network connections established over the previous tunnel, thereby achieving a continuous service which should be transparent to the user and to the applications on the user's device. As there is no Gateway-to-Gateway communication in AppGate ZTNA, the original Gateway has no information about any new Gateway the user may now be using so will continue to maintain that user’s session for five minutes after which time it is killed.
For the best user experience, the protected host must see any failed-over connection coming from the same IP address, so it is not possible to achieve stateful failover at the TCP/IP level with SNAT enabled. However if you only have one (SNAT enabled) Gateway on a given Site, then in the event of connection interruption or a roaming event, the Client will resume its communication with the Site seamlessly.
The remainder of this section looks in more detail at some of the internal HA mechanisms built in to AppGate ZTNA.
AppGate ZTNA HA
A key feature of the AppGate ZTNA system is the use of linearly scalable stateless appliances. For AppGate ZTNA, this means that the Client has to manage the Controller/Gateway HA mechanisms. The Client profiles (for Controllers) and subsequent Entitlement tokens (one for each Site) provide all the information the Client needs for full HA working in support of a robust resilient architecture.
The Client uses a built-in, round-robin, load-balancing algorithm to pick Controllers and Gateways. Appliances will either accept this new connection, or if it is busy/suspended, will reject the new connection and the Client has to try again.
In the event of a Gateway disconnection, a new connection will automatically be established with only the briefest of interruptions. Users should not have to sign in to the application again. This new connection could be accepted by the same Gateway or a different Gateway.
There are even some advanced mechanisms, such as cross-collective HA, which allow Clients to fail over to a different Collective when no Controllers are available.
HA functionality extends beyond Controllers and Gateways. Connector HA is discussed in Connectors and is configured by setting up Connector HA pairs and multi-LogForwarder HA operation is discussed in Audit logs.
Controller/Gateway HA mechanisms
The HA support explained below goes beyond what can be achieved using external load balancers. This should be read in conjunction with the Controllers section.
Internal HA mechanisms operate in the following situations
If the Client is unable to establish a connection to a Controller or Gateway in the first instance. This might be caused by UDP-TCP SPA. Configuring SPA override (on one alternative appliance) might help in this case.
If the Controller tells the Client to failover to a different Controller. There might be high load on the Controller, or it has run out of IP pool space, or it is seeing time-outs while contacting say an LDAP server (requires multiple Controllers per Collective).
Cross-Collective HA where Clients try a different Collective when no Controllers are available (requires configuration of a profile group and a device Policy that includes Client Profile Settings).
Failover to another Gateway in the load-balance group in the event a Gateway does not accept a new connection from a Client.
Failover to another Gateway in the load-balance group in the event of a Gateway becoming unresponsive (requires multiple Gateways per Site).
Failover to Gateways in a different load-balance group (still in the same Site) in the event of a Gateway becoming unresponsive (requires the use of special Gateway weightings).
Failover to a more healthy Gateway when a Gateway reports that a monitored Action is in an unhealthy state (requires configuration of an Action).
The fallback Site will be used if no Gateways are available on a given Site after a minimum period of 30 seconds (requires configuration in Site and in Policy).
Controller failover using internal load balancing
Use Global Settings to change the profile DNS name
The internal load balancing relies on the Profile DNS name which is created when the 1st Controller is created. This name is automatically shared across all Controllers in a Collective. For this (and any other profile DNS names you use), you should create a multiple A record DNS entry (one for each Controller in the Collective). This profile DNS name will subsequently appear in Global Settings as the Global DNS name which you can edit if required. The AppGate ZTNA Client will find up and use the profile DNS name from the Client profile. When set up correctly the DNS query will return multiple IP addresses which the Client will try, one by one, until it finds one that replies.
If a new DNS name is used (Global or Custom) in a new Client profile, this will require that the Certificate of each Controller is renewed manually before the profile can be used is used.
Cross-collective HA using profile groups
Use Client Profiles to set up a profile Group
A Profile Group contains multiple Client profiles that can originate from any Collective. Each Client profile added to the group is assigned a weighting (which behaves similarly to the Gateway weightings). When none of the Controllers for a given Client profile are available, then the Client checks if there is another Client profile in the same profile group. If there are other profiles, the next one is picked according to the set weighting and the Controllers for that Client profile are all tried.
Client profiles can have the same weighting. When this is done, the users will be distributed across these Collectives equally.
Cross-Collective HA could also be used to provide an alternative way to authenticate to just one Collective. The highest weighed Client profile in the group could use OIDC; the lower weighted one LDAP. In the event of OIDC failing, the Controllers would eventually appear to be unavailable, and the Clients would then try the LDAP Client Profile where the users would be able to sign in.
Gateway Failover when it is suspended
Use cz-config to alter the suspend thresholds
When the Client tries to connect to a Gateway within the load balance group, the Gateway may or may not accept the new connection. If it has been suspended manually it will reject new connections and the Client will have to try again. Gateways can also be suspended automatically for a number of reasons. This is designed to maintain the availability of the Gateway when working under heavy loads.
Auto-suspend
Trigger | Detail | High (suspends) | Low (resumes) |
|---|---|---|---|
eventQueueUsage | The % of the event queue used (events awaiting processing by sessiond) | 70 | 50 |
javaHeapUsage | The % memory used by sessiond vs the heap's maximum | 90 | 75 |
numberOfSessions | The number of users' sessions on the Gateway | 8000 | 7500 |
systemMemoryUsage | The % total memory usage in the Gateway | 85 | 70 |
These values are the built-in defaults, however these can be adjusted if required using cz-config.
Gateway Failover redundancy
The number and size of Gateways needed to achieve HA can be calculated to minimize the risk of downtime. If you want all clients to connect even if one of the Gateways goes offline, the following usage can be accepted during normal operation so Gateways do not go into auto-suspend:
Number of Gateways | Event queue usage | Heap usage | Number of sessions | Memory usage |
|---|---|---|---|---|
2 | 35% | 45% | 4000 | 43% |
3 | 47% | 60% | 5350 | 57% |
4 | 53% | 68% | 6000 | 64% |
5 | 56% | 72% | 6400 | 68% |
In case there are only two Gateways in the load balancing group, the second Gateway needs to be able to handle all the connections from the first Gateway if that Gateway goes offline. As seen in the table, going from two, to three, or four Gateways greatly improves the possible utilization of each Gateway during normal operation. The acceptable utilization during normal operation can be calculated with the following formula: (x * ( n - 1) ) / n, where x is the suspend limit (e.g. 90%) and n is the number of Gateways during normal operation.
Gateway Failover within the load balance group
Use Appliances > Functions > Secure Tunnel Settings to set the Client Tunneling - Load Balance Weighting Factor
When the Client connects to a Site, it will try to connect to one of the Gateways based on the weighting factor which biases the resulting outcome of the random selection process. The weights sent by the Controllers can vary depending from which location the Client is connecting. Once connected, the Client monitors the traffic going over each Gateway connection and when there is no application traffic, a keep-alive message is sent every five (5) seconds. As soon as the Client detects a connection interruption, then it will try to establish a connection to an alternative Gateway for that Site.
Weighting
Each Gateway is given a three digit Client Tunneling - Load Balance Weighting Factor, this sets the weighting which affects the likelihood of a given Gateway being tried by the Client. This can end up affecting the distribution of Client connections within the Site. If the Gateways are all weighted the same: 100, 100, 100 then the outcomes will be equal. If one has a weighting of 100 and another 50, then the outcomes will be in a ratio of 2:1. The absolute number assigned as the weighting factor is not important as it does not represent an actual percentage or probability; what is important is the ratio between them.
Example
Three Gateways have a weight factor of 10, 10, 5, each of the first two are twice as likely to be used than the third:
Gateway | Weight | % Probability of being tried first by the Client |
|---|---|---|
A | 10 | 10/25*100 = 40% |
B | 10 | 10/25*100 = 40% |
C | 5 | 5/25*100 = 20% |
The actual distribution will depend on the availability of the Gateways. If Gateway B was offline then the actual distribution would end up looking like this:
Gateway | Weight | Availability | % Probability of being tried first by the Client |
|---|---|---|---|
A | 10 | 100% | 10/15*100 = 66.6% |
B | 10 | 0% | 0% |
C | 5 | 100% | 5/15*100 = 33.3% |
In this situation Gateways A and C need to handle the additional load imposed while B is offline.
Gateway Failover to another load balance group (in the same Site)
Use the Appliances > Functions > Secure Tunnel Settings to set the Client Tunneling - Load Balance Weighting Factor
When the Client connects to a Site, special weighting factors can be used to change the order in which the Gateways will be tried to improve the failover behavior and user experience. This is useful when the Client detects an interruption and is trying to find an alternative Gateway for that Site.
Zero weighting
It is possible to assign a zero weight to a Gateway. In this situation the Client will always try to connect to all other Gateways (with a higher weight) first; however, when none are available the zero weight ones will be used. This may be useful for configuring a reserve appliance that is only used if the Client believes all other Gateways are unreachable.
NOTE
A zero weighted Gateway will almost always have some Clients connected some of the time. This might be because of temporary connection issues to higher weighted Gateways or because a Gateway (or two) might have been down during an upgrade.
Example
Two Gateways (A & B) use the primary ISP and two Gateways (C & D) use the reserve ISP.
Gateway | Weight | % Probability of being tried first by the Client |
|---|---|---|
A | 100 | 100/200*100 = 50% |
B | 100 | 100/200*100 = 50% |
C | 0 | 0/200*100 = 0% |
D | 0 | 0/200*100 = 0% |
In this example, A and B are the primary Gateways. However the company has reserve Gateways C and D, which will only be fully used in the event of the primary ISP fails.
Priority Groups
In the event of a failure such as an ISP failure at a given location, the Client will try all the higher weighted Gateways in the Site before moving on to try the zero weighted Gateways (which might use a different ISP). This may result in a lengthy loss of connectivity for the Client if there are a number of Gateways on the Site. To reduce this down time, it is possible to specify priority groups where just two sequential failures within a priority group trigger the Client to move to the next priority group.
Priority groups come into existence automatically when every Gateway uses at least a four digit (>999) Client Tunneling - Load Balance Weighting Factor for a given Site. The last three digits still represent the weighting factor and work exactly as before. The preceding digits represent the priority group number. The groups use a numerical priority group order when trying to establish a connection to the Gateways.
Example
The three Gateways with weighting factor of 10100 would be considered as equal members of group 10. The three Gateways with weighting factor of 1100 would be considered as equal members of group 1. Because the group number also represents the priority, then the higher group number will always be tried first. In this example that would be the three Gateways in group 10. After unsuccessfully trying two of the three Gateways in group 10, then the next Gateway tried will be any from group 1.
Gateway | Weight | Group | % Probability of being tried first by the Client |
|---|---|---|---|
A | 10100 | 10 | 100/300*100 = 33% |
B | 10100 | 10 | 100/300*100 = 33% |
C | 10100 | 10 | 100/300*100 = 33% |
D | 1100 | 1 | 100/300*100 = 33% only in event of 2 of A, B or C being down |
E | 1100 | 1 | 100/300*100 = 33% only in event of 2 of A, B or C being down |
F | 1100 | 1 | 100/300*100 = 33% only in event of 2 of A, B or C being down |
NOTE
If the three Gateways on a Site have weighting factors of 500, 2000, 2500 then priority groups would NOT be used; instead the 3/4 digit numbers would be considered as weighting factors resulting in a 10%, 40%, 50% split.
Gateway Failover triggered by a Monitored Action
Use the Entitlement Action to enable monitoring
When configuring an Entitlement there is an option to Enable Monito ring. This can be done for HTTP and TCP up Action types. Monitored applications are checked for return traffic and in the event no return traffic is received within the response timeout period, then the application is marked as unhealthy. This can also be useful for detecting a more general network failure if a commonly used application is monitored. The application would report as unhealthy when for instance the network switch handling the inside connection from a Gateway fails.
Once connected, the Client may receive an unhealthy application report from a Gateway. The Client will first check its unhealthy report history from any recent connections to other Gateways. The Client remembers unhealthy reports for 10 minutes, after which the Gateway is assumed to be healthy again. It then chooses the most healthy Gateway to fail over to; if picking between equally healthy Gateways, then this will use the normal weighting process. A zero weighted Gateway will be used if it is the most healthy Gateway that the Client knows about.
Site Failover to the fallback Site
Use Sites to select the fallback site
Use Policies to enable the use of the fallback Site
The Client will try to find a working Gateway on a fallback-enabled Site for a minimum period of 30 seconds. After this time the Client will switch to use the fallback Site instead. Because the fallback process is controlled by the Client, it would not matter if the affected Site was also hosting the Controllers as they have no involvement in the process.
This fallback mechanism is best suited to organizations still operating with a WAN linking several locations, with external access to the WAN through multiple Sites at different locations. The fallback Site could provide an alternative for all users of a given Site or it might provide a back door for system administrators to access the WAN to resolve the outage. To cover the various use cases, specific users must also have Use fallback Site enabled in their Policy. This Policy-based model ensures that the fallback Site can be sized appropriately in terms of RAM and CPU.
RAM is especially important to size appropriately for the fallback Site as in normal operation (no fallback) sessiond will still set up all the Entitlements for the users of the Site. VPNd only applies these Entitlements when the user connects. This means the fallback Site will use about 50% of any additional RAM required for the extra Entitlements, and this will jump to 100% when they are applied in the event of fallback. See Instance Sizing for more details. Failure to plan in extra RAM (and CPU) resources required will mean the fallback Site can quickly become overloaded with all the users from the failed Site(s). Also ensure that any required routes are in place on the fallback Site.