Controllers

Prev Next

Appgate SDP has built-in support for HA Controllers, enabling highly available, multi-master Controllers to be deployed. This section explains how Controllers should be configured and managed.

The built-in HA mechanisms use a round-robin, load-balancing algorithm running on the Client, with additional traffic redirection mechanisms on the Appliance. These redirect mechanisms allow the Controller to tell the Client to fail over to a different Controller in specific operational situations.

The Controller's internal load balancing relies on the Profile DNS name created when the first Controller is configured. This name is automatically shared across all Controllers in a Collective. For this and any other profile DNS names you use, you should create a multiple A record DNS entry for each Controller.

Determining the number of Controllers

Between one and six Controllers may be specified, but for HA deployments a minimum of two Controllers are required. This will provide database replication and two alternative locations for Clients to sign in, which with only one Controller being used for administrative access might work. However, the database algorithm has been optimized for three or more nodes, so with just two nodes it can end up in a split brain situation where both nodes think they are the leader. In this situation it is hard to manage conflicts and means the system is slow to reach consensus. This can then have an adverse effect on DB migrations and upgrades, so the preferred option is to have three Controllers. Having three also provides more stability advantages during changes and when performing maintenance/upgrades; with one Controller down, there is still a working HA mechanism. It also means the allocation of IP addresses to new users is less degraded (66% vs 50%), allowing for longer single outages before users are affected. The use of five Controllers is also a good number when more are needed for capacity or geo-location reasons.

HA Controllers are not designed to operate in an active-passive mode, so therefore should be set up with one as a backup server which is kept off-line except in the case of an emergency. The database will be so far out of date by that time that it is unlikely to be able to replace the failed Controller. You should therefore ensure all Controllers remain online and able to communicate between themselves at all times.

Use the Appliances > Functions UI to set up Controllers

Before you start

Ensure the Client-to-appliance communications have been set up appropriately.

Ensure the appliance-to-appliance communications have been set up appropriately.

Additionally, every Controller must sit in a network that allows them to communicate out:

  • to configured Identity Providers

  • to any configured MFA Providers

  • using any scripts that have been configured

Configuring Controllers

The first appliance in a new Collective will always be the Controller. At creation, the Appliance Hostname (a FQDN should be used) and the Profile DNS name are set.

Setting or changing a controller's appliance hostname

Choose a unique hostname (FQDN) for the Appliance Hostname/IP in System Settings. This hostname may be longer than 39 bytes but must be unique within the first 39 bytes (39 characters). This hostname will be used by the other appliances within the Collective including any other Controllers.

It is not recommended to use an IP address for the following reasons:

  • For the correct HA operation (DNS round robin)

  • To allow the underlying IP address to be changed, as you cannot easily change the hostname

  • To avoid issues in which the Controller can't talk to itself when behind a NATing firewall

Finalize the hostnames and update all relevant DNS records. If you do not want to publish a public DNS record for these hostnames then add them to the hosts file on all the other appliances in the Collective. Only then try to add an additional Controller; remember it is not possible to change the hostname of a working HA Controller.

Making changes to the hostname on a Controller

When changing a Controller's hostname you should always consider whether that connection will be trusted, because the Certificate needs to be valid for that connection. If you have only one Controller it is only going to have to connect to itself and required changes, including the Certificate update, are handled automatically. So just change the Appliance Hostname/IP in System Settings.

When you need to change a hostname while using HA Controllers:

  • remove the Controller function from the appliance you want to change (Gateway, LogForwarder, and LogServer roles can remain).

  • ensure the appliance is reporting healthy.

  • ensure that both the old hostname and the new hostname can be resolved by all other members of the Collective (as well as by the appliance itself). If DNS cannot resolve any of them, add them to the hosts files.

  • any change you make to the Appliance Hostname/IP in System Settings will trigger automatic certificate renewal.

    • If your changes have been made to another field, such as Extra Hostnames in Certificate, then you must renew the appliance certificate manually after saving the changes.

  • once the appliance is reporting healthy (and the new certificate has been created/distributed), re-enable the Controller function on that appliance.

Adding Controllers

To add an additional Controller you must first add an appliance (or re-configure an existing appliance). The candidate Controller appliance must have already been registered with the Controller correctly. All Controllers and the candidate Controller must have a status of Healthy. If any are showing as Busy, then you should wait. If their status is Warning or Error then some remedial action should be undertaken before any changes or updates to the system are made.

Always perform a Controller backup before making configuration changes related to HA Controllers.

WARNING

Only ever add or remove one Controller at a time. Never try to add or remove Controllers until all Controllers have a status of Healthy.

When an additional Controller is enabled from the UI of an existing Controller, it will synchronize with the existing Controller's bi-directional replication (BDR) database. The BDR database used in Appgate SDP is configured to utilize a multi-master model for replication. Multi-master requires that every Controller can see every other Controller on both ports.

Because the appliance is already activated, the first thing that happens when you enable the Controller function is a connectivity check. This checks port 443 to ensure full bidirectional connectivity exists between all existing Controllers and the candidate Controller. If the checks fail, a warning is shown and the connectivity issue must be resolved before trying again. If the checks pass, the form can be saved and the task of adding the new Controller begins.

The Controllers will then have to reach consensus over what should be written to the new Controller, at which point synchronization of the database takes place. The next check performed is for any database insert conflicts. These might already exist in the database but do not harm the day-to-day operation of the Controllers. Even so, these conflicts need to be resolved before the new Controller can be added.

Synchronization can now take place, but it should be noted that this process may take a considerable amount of time based on the size of the database, available bandwidth, and the connection latency. For large databases spanning different continents it might take up to an hour. The system should be left to complete this task without any further admin interventions.

Once complete, all Controllers will be “as equals” and share the task of being the Controller. Appgate SDP does not store status information in the database; almost everything in the system is handled in real time or by tokens. This means most of the information in the database is configuration data or static information such as users' allocated IP addresses.

Once running in HA, all Controllers perform an SPA-TCP connectivity check on port 443 once every 15 minutes with all other Controllers, and report a (Controller) error on the dashboard if a connection fails.

Using HA Controllers

IP pools

Controllers act as a type of DHCP server for the multi-tunnel adapter used in the Client. When using multiple Controllers, the defined IP Pool is distributed equally across all Controllers. When a Controller is unavailable, the IP Pool will be degraded for a period of time. This is not normally an issue as existing IP addresses can be renewed by any other Controllers. However, you might want to follow the IP Pools size recommendations in order to make an allowance for this.

HA Administration

Managing HA Controllers is no different from managing a single Controller. Once running with HA Controllers, an admin can manage using any node in the system and the changes will be replicated automatically.

However, just one Controller should be used for administration. This is vital when using scripts that make repeated API REST calls. It also avoids any possible database conflicts when the same form is being edited at the same time on different Controllers. In the case of multiple admins committing changes, the last update will always win.

If you are using a load-balancer in front of the HA Controllers, never use the load-balancer's hostname for administration.

Removing a Controller

To remove a Controller, uncheck the Controller function and wait for the appliance to return to a Healthy state. After removing a Controller from the HA Controllers, all its data relating to the database is wiped and the p12 key is deleted.

NOTE

A functioning Controller cannot be deactivated. First you must disable the Controller function on the appliance and then wait for it to become healthy.

Maintenance and troubleshooting procedures

When adding or removing Controllers, they must have a status of Healthy. If any are showing as Busy, then you should wait. If Warning or Error is displayed, then some remedial action should be undertaken first.

The Appliances page displaying appliance statuses, highlighting one as warning and another as healthy.

Maintenance scenarios

Controllers can sometimes end up in an unhealthy state because of disk or network problems when adding a new Controller. There is some guidance about the upgrading of HA Controllers including how to roll-back if problems with the network or power have caused the upgrade to fail.

If you have Controllers that are experiencing issues, there is a useful command available in sdpctl that can be used to force-disable a Controller:

sdpctl appliance force-disable-controller [hostname|ID...] [flags]

This is the recommended first step in most situations. This command will disable the Controller function on the target Controller(s), and notify the remaining Controllers of this change.

If the above command does not resolve the issue, there are detailed steps below for several scenarios. The scenarios outlined below have three HA Controllers: A, B, and C.

Scenario 1: A Controller fails to join

A and B are already Controllers, and you try to add Controller C from A's admin UI.

Symptoms

One of the 'add Controller' checks fails.

Remediation

  • Resolve connectivity issues and/or database insert conflicts (see A warning ‘BDR conflict detected’)

  • Enable the Controller function for C and try again.

Symptoms

The Controller may fail to join later on. If the other Controllers remain stuck on Busy, wait for the BDR join to timeout (15 mins). The following errors will appear:

Controller database controller: BDR node missing: C

Controller database controller: New IP allocations are disabled

Remediation

Disable the Controller function for C using A's admin UI,  then run the following command on A:

sudo cz-config bdr update-bdr-group

  • Identify and fix the reason for the failure using the logs gathered from all the Controllers, including C (the one that could not be added).

  • Enable the Controller function for C and try again.

Scenario 2: A Controller fails to leave (but other nodes think it has left)

This can happen in the unlikely scenario that the Controllers have some network latency/issues. A and B have agreed on the new state but C failed to agree.

NOTE

This situation will not create IP allocation conflicts because the Appgate SDP system knows C is not a Controller, so will not be sending any Clients there to sign in.

Symptoms

Controllers A and B are healthy on the dashboard, while Controller C has an error on the Controller role, saying it is still a Controller even though it is not a Controller.

Remediation

  • Follow the same instructions for "Removing a Controller while that Controller is inaccessible"

Scenario 3: A Controller is gone forever

You try to remove Controller C using Controller A’s admin UI but C is gone forever (such as when someone deletes a VM by mistake).

Symptoms

Controllers A and B will eventually time out while waiting for C. The default time out for this is six minutes.

The dashboard will show the following Controller error:

BDR node not replicating: C

Remediation

  • Disable Controller function on Controller C, then run the following commands on Controller A:

    sudo cz-config bdr remove-node-record C

    sudo cz-config bdr update-bdr-group

  • Deactivate or delete the appliance record for the now disabled Controller C.

  • Create a new spare appliance (to become the replacement Controller) and ensure it has registered correctly.

  • Enable the Controller function on the new spare appliance and allow time for it to synchronize.

Scenario 4: Removing a Controller while that Controller is inaccessible

You try to remove Controller C using Controller A’s admin UI but C is not accessible (maybe someone changed a firewall rule by mistake).

Symptoms

After waiting for 6 minutes the the barrier will timeout and the following errors will appear:

Controller database controller: Unexpected BDR node: C

Controller database controller: New IP allocations are disabled

Remediation

  • To make sure that Controller C is no longer servicing Clients, on Controller C run:

    sudo cz-config bdr force-appliance-ready

  • To remove the extraneous BDR node record, on Controller A run the following command:

    sudo cz-config bdr remove-node-record C

Now, only this warning should remain:

Controller database controller: New IP allocations are disabled

  • Run the following command on the remaining Controllers (A and B):

    sudo cz-config bdr repartition-ip-allocations

The node should now have been removed successfully.

Scenario 5: A warning ‘New IP allocations are disabled’

If a Controller is completely non-operational, refer to "Removing a Controller while that Controller is inaccessible" or "A Controller is gone forever". In some situations HA Controllers may not recover full operation after some maintenance.

Symptoms

Controller C is showing "New IP allocations are disabled"

Remediation

  • Run the following command on Controllers A, B, and C to safely re-enable IP allocations:

    sudo cz-config bdr repartition-ip-allocations

Scenario 6: A Controller remains in 'maintenance mode' after an upgrade

If a Controller is left in maintenance mode after a Collective has been upgraded, there may have been a connectivity check failure between the Controllers at some point during the upgrade. If the other Controllers have upgraded successfully, then in most situations it is safe to manually take the remaining Controller out of maintenance mode.

Symptoms

Controller C is showing "Controller is running in maintenance mode".

Remediation

  • Run the following command on Controller C to re-enable the Controller:

    cz-config controller disable-maintenance

Scenario 7: A warning ‘BDR conflict detected’

With the possibility of administrators using the admin UI on multiple Controllers, the scenario in which you might end up with BDR conflicts becomes more likely. We therefore recommend only using one Controller as the point for administration.

Symptoms

Error message indicating a BDR conflict in the controller database authorization.

Remediation

For example, if the conflict is about an Entitlement you might need to delete entries in entitlement_condition and action tables referencing that entitlement_id.

  • Run the following command on all Controllers showing the warning and follow any instructions:

    sudo cz-config bdr resolve-conflicts

For advanced usage only, the full set of cz-config bdr commands are detailed in Appliance Troubleshooting.