Instance sizing

Prev Next

Introduction

Below are outlined some suggestions for the specifications/configurations for Appgate SDP appliances when running them in the Cloud or on virtual hosts. Appliance functions can all be enabled independently and specific combinations of functions are allowed for co-habitation (on a single appliance) so allow for all the functions that have been enabled.

Consideration should also be made for failover events. Probably the most important requirement is for Gateways to have sufficient resources to handle the additional load the extra users may impose. Additional vCPU will be required for the packet shuffling and event handling. Additional RAM for the users and event queue. As a minimum, this should allow for a single Gateway outage on a Site, which means each Gateway should be able to handle its own load and the load of one failed Gateway so design for: c .

Memory size (RAM)

Each appliance has some built-in checks for available memory and this compared to the roles selected to see if there is enough. Where there is insufficient memory then a dashboard admin message will be displayed: either an Info message if the memory is >85% of the above values or an error message if <85%.

For test and non-production environments only

These are the MINIMUM memory requirements you should specify for the function to operate. Do not use these for production environments as they will need to be increased considerably to operate reliably.

Minimum memory requirements

Function

Standalone

Add LogForwarder

Add Gateway

Add LogServer

 

 

+0GB

+1GB

+2GB

Controller        

2GB

Y

Y

Y

Gateway

2GB

Y

 

Y

LogForwarder

1GB

 

Y

 

LogServer

4GB

 

Y

 

Metrics Aggregator

1GB

Y

Y

 

Connector

1GB

Y

 

 

Portal

4GB

 

 

 

For production environments

Appliance

Appliances automatically allocate RAM for the appliance operation (daemons, OS, etc) based on the size of the machine. A small machine (4vCPU/8GB) might allocate about 2GB for the appliance operation. A large machine (32VCPU/64GB) might allocate about 6GB for the appliance operation. This increase is based on two parameters - installed RAM and the number of VCPU - and is most pronounced in the case of Controllers and Gateways which are designed to take advantage of multi-threading and use Java which is memory hungry. This base requirement is irrespective of the functions enabled on the appliance which will normally require additional RAM allocating.

Controller

Once operational the Controller is transactional so only uses memory while issuing tokens - since Controllers only handle some 10s of transactions simultaneously memory usage will never grow significantly. For production systems allow 4GB - 6GB for the Controller function depending on the size of the user-base. So for a small Controller a total of 6GB (2 appliance +4 function) should be specified (the minimum) and for larger machines supporting a larger user-base up to 12GB (6 appliance +6 function) should be specified.

Gateway

This analysis for Gateways was performed on v6.3. Earlier versions (especially prior to v6.2.2) do not contain all the optimizations present in the latest version and can consume very considerably more RAM than the figures shown here.

Apart from the appliance's RAM requirement, a Gateway's RAM usage is made up of two additional elements: user rules (sessiond and vpnd) and event handling (sessiond).

User rules

RAM usage mainly defined by:

  • The number of users. As has already been explained, the Appgate SDP Gateway uses one Micro-firewall service per user which uses some memory.

  • The number of rules. Especially active ones in each micro-firewall instance.

For smaller deployments

Gateways are very efficient users of RAM - for smaller deployments (below 1000 users and 1000 rules), as a rule of thumb allow 10KB for every user-rule added. So a 4vCPU 8GB machine, with 4GB allowed for rules, should be able to support 1000 users, each with 400 rules.

For larger deployments

For larger deployments there is increasing efficiency and with many users each with many rules the rule of thumb needs adapting; at 1000 users and 1000 rules, allow 5KB for every user-rule. As the numbers of users and rules increases this can be reduced further to as little as 2KB per user-rule. So a 16vCPU 32GB machine, with 24GB allowed for rules, should be able to support 6000 users, each with 2000 rules.

Summary table:

Entitements (rules)

Users

8000

4000

2000

1000

500

250

500

1000

2000

4000

8000

10KB per user-rule

5KB per user-rule

2KB per user-rule

Sessiond uses RAM for both user rules and event handling. And sessiond's overall RAM usage is defined by the maximum Java heap size allowed (37.5% of the RAM). So this means there is a relationship between rules and events; having a very large number of rules will limit the Gateway's event handling capability and vice verse.

Event handling (and the event queue)

The Appgate SDP system is specifically designed to respond to changes in resolved names (such as when and auto-scaled instance is added). However it is not designed to handle large numbers of frequently changing resolved names (such as when the hostname (wrongly) points to a round robin DNS load balancer and returns a different result every time!). The Gateways runs a dynamic queuing system to handle all these changes on a per user basis (because they cannot all be done at once). If the system has 500 users all with 20 actions that change then the queue is immediately 10,000 in size. Larger Gateways will have larger queues and faster Gateways will process the queue in less time but care still needs to be taken to ensure the host definitions you choose resolve to (relatively) consistent values that are appropriate for scale of the use case you have.  

The table below shows how the max event queue adapts based on the total memory installed. The number of worker threads (processing the event queue) is also increased according to the amount of memory (to handle the bigger event queue). Having excess vCPU available for the worker threads will help to keep the queue size down and will therefore limit memory usage.

Memory GBytes

Max event queue

minimum queue worker threads

maximum queue worker threads

0-3

15k

50

250

4-7

30k

100

500

8-15

50k

250

1000

16-31

100k

500

1000

>32

100k

1000

2000

It is very hard to estimate the event handling requirements of a Gateway. There will be very few events in a system where users just sign in and access static (protected) hosts with no additional Conditions/evaluations. However there can be very many events if Conditions are re-evaluated frequently or if there are dynamic (protected) hosts.

Significant amounts of additional RAM may be will be consumed by sessiond for event handling so you should allow between 0.5MB and 2MB per user depending on the nature of the environment. Then carefully monitor the event queue and Java heap usage metrics once the system is in production use. To protect the Gateways from 'overload', if the Java heap memory reaches 85% of max, then new sign-in attempts to the Gateway will be blocked.

Summary table:

In this summary table we have made the assumption that systems with more users and  will probably also have more event complexity. So whilst we suggest an initial value of 1MB per user be used, this table might provide some more guidance.

Entitlements (rules)

Users

8000

4000

2000

1000

500

250

500

1000

2000

4000

8000

0.5MB per user

1MB per user

2MB per user

Overall RAM requirement

Given the number of variables at play here, it is not possible to produce 'the answer'. For instance, measuring the RAM requirements for a very small deployment on a very big appliance is pointless as the appliance usage will dwarf the Gateway's RAM usage. And the RAM usage of a system with a large static ruleset will be very different from one which has a very dynamic ruleset.

 

For the Ax-G5 appliance (16GB RAM) the rules of thumb might translate to these user - rule combinations:

 

Entitlements (rules)

Users

8000

4000

2000

1000

500

250

500

OK

 

 

 

 

 

1000

Risky

 

 

 

 

 

2000

Not OK

 

 

 

 

 

4000

 

 

 

 

 

 

8000

 

 

 

 

 

 

If you look at the 1000 user 2000 Entitlement scenario then the appliance might use 3GB, the rules 10GB, and events 1GB. Since the appliance has 16GB then this scenario should be OK.

 

For the Ax-H4 appliance (64GB RAM) the rules of thumb might translate to these user - rule combinations:

 

Entitlements (rules)

Users

16000

12000

8000

4000

2000

1000

1000

OK

 

 

 

 

 

2000

Not OK

 

 

 

 

 

4000

 

 

 

 

 

 

6000

 

 

 

 

 

 

8000

 

 

 

 

 

 

 

If you are specifying your own VM for a Gateway, then for deployments with 1000+ rules per user the Gateways can start to use significant amounts of RAM. Use the chart below to get an idea of how many rules / users can be supported on each of the standard increments of RAM size available.

Bar graph showing suggested RAM specifications based on user and rule counts.

For example: If you plan on having 2,000 users per Gateway, each with about 4,000 rules (individual Entitlements) then you should specify a 32GB instance.

Fallback Site

The fallback process is controlled by the Client which after a minimum period of 30s will decide when a Site is unresponsive and switch to the fallback Site. To make this as seamless as possible the sessiond on the fallback Site already has a copy of the user-rules. This should be taken into account - maybe adding half the amount for user rules.

Failover

In the event of a failover (such as during upgrades) then there may be additional load applied to a Gateway so an allowance should be made for this. Again this is entirely dependent on the design of the Collective and the number of Gateways in the Site. See the introduction (above).

Connector

The Connector will use more RAM as the number of Resource group Clients configured increases. One Client is assumed in the pre-prod table above. Allow 2GB appliance + 25MB for every Client configured, (an extra 1GB would allow 40 Clients). A maximum of 60 Clients can be used on one Connector.  

LogForwarder

The LogForwarder will use more memory if configured in Performance mode because logs are saved to RAM not to disk. Allow 2GB appliance +  250MB for performance mode.

Metrics Aggregator

The Metrics Aggregator only runs core services (which run on all appliance types). Allow 2GB appliance.

LogServer

The LogServer can use all the memory available to it. For production use It should have at least 16GB total. If it is deployed in medium-scale environments or if you plan to use it extensively to run complex reports etc. then 32GB total should be specified. LogServers must also stay within the production use operational parameters.

Portal

The Portal can use a lot of memory. Allow 2GB appliance + 2GB per 25 users for production use.

NOTE

Always ensure you specify sufficient memory. The system does NOT use disk caching when short of memory - so when you are out of memory then the system will behave unpredictably.

Disk size and type

Size

The base appliance itself does not require a very large disk size. Image & state sectors occupy 5GB for the primary partition and same again for the secondary one. There is a maximum of 4GB for local logs.

  • The Controller requires additional space for the database - typically this is not that large however larger systems might push towards 10GB. During Controller upgrades it is possible to have one or more image downloads, 2 partitions, 2 or more running instances of the Controller (database). The base numbers can therefore get multiplied up quite quickly. We therefor need a 100GB of disk space for production systems so that in the event of problems there is some overhead before running out of disk space..

  • The Gateway, LogForwarder, Metrics Aggregator, Portal and Connector requires no additional space as they have nothing to save except some temporary log records. So 20GB-30GB will suffice.

  • The LogServer may require significant space for the Open Search database. The minimum recommended disk size is 128GB. LogServers must also stay within the Production use operational parameters.

Increasing disk size

For all supported cloud platforms there is a process (lvmresizer) that will automatically increase the size of the data volume. Simply shut down the appliance, re-size the disk for the appliance and re-start it. Any new un-partitioned space at the end of the existing volume will be added onto the data volume. It is also possible to add more disk space to an existing appliance from the command line. Refer to the Increasing appliance disk size.

Type

The appliances make frequent small writes to disk however, the total amount of data being written at any one time is small. A SATA type of interface has very adequate performance for an Appgate SDP appliance, so specifying more expensive SAS type of interface will offer no benefits. The IOPS numbers for a large and busy Appgate SDP appliance far exceed the performance of a conventional disk. If Guaranteed Logging is enabled on a large & busy appliance, then a significant system slow down will be encountered once the disk IOPS limit is reached.

Controllers and LogServers can get very in busy in large Collectives especially when scripting and API calls are in use and/or during upgrades - so SSD type of disks should always be specified. For AWS EBS-optimized instances should be used. Caching has also been shown to be very beneficial so this should be specified and enabled where possible. Faster bus speed has also been shown to help considerably.

I/O

Controllers only handle tokens (at a modest rate) so a 1Gb/s NIC is all that you would need. For all appliance functions it is worth considering the audit log traffic as this can be generated in quite some volume. Consider what happens when things are not working correctly; you might have users failing and re-trying increasing the number to log messages considerably.

I/O matters to Gateways more than for any other appliance function. With sufficient vCPU a Gateway will be constrained by its I/O so make sure you specify what you need. When running visualized Gateways, always configure the host to use SR-IOV - this reduces CPU load and increases throughput.  

Instance sizing (CPU)

Virtualised Controllers only begin to achieve acceptable performance once they have sufficient CPU to prevent frequent context switching. 2-4vCPU instances can be overwhelmed by inbound connections in busy Collectives; 8-12vCPU is the sweet spot; 16vCPU instances will not suffer context switching but are now limited by the speed/latency of inbound connections and the architecture of the Controller itself.

Complex Policy/Entitlement set-ups are not helped by adding more vCPU as they are limited by other factors such as disk performance. In fact, for very complex set-ups (10k Policies) there is little difference between 8vCPU and 16vCPU. In situations where there is high concurrency (of users), then specifying larger machines in the range 16-20vCPU will provide some benefit.

Analysis has shown that the typical performance of a correctly specified Cloud Controller is about 4-6 users/vCPU/sec, a low latency visualized Controller is about 8-10 users/vCPU/sec, and a well specified appliance is about 12-14 users/vCPU/sec. This does not take in to account other external factors such as communicating with IdPs, and running assignment criteria and user claim scripts.

Virtualized Gateways use vCPUs much less efficiently (about half) than hardware appliances to achieve a given throughput. Allow about 1vCPU for every 1Gb/s of throughput. 10Gb/s is the maximum you can expect from one virtualized Gateway (using 8+ vCPU at near 100%). For >10Gb/s use multiple Gateways but never use >4 Gateways on one virtual host as the mono-culture starves the host of certain key resources. For event handling assume a virtualized Gateway can manage about 6 events/vCPU/sec - allow about 1-2vCPU per 1000 users for this.

Graph comparing Ax-G5 and KVM performance across varying active vCPUs and throughput.

The LogServers must also stay within the Production use operational parameters.

The Portal is not included in the table below since it assumed it will not be run in the cloud. If specifying on on a virtual host then specify 2vCPU initially. Add 1vCPU per 25users.

The LogForwarder/Metrics Aggregator is not included in the table below since it is a very light application and can be added to any appliance without changing the specification.

The Connector is not included in the table below since it assumed it will not be run in the cloud. If specifying on on a virtual host then assume specify 2vCPU initially. If more throughput is required >1Gb/s or if >10  Clients have been specified then add another 2vCPU.

The table below is provided more for CPU sizing than for RAM sizing - the table assumes each user has less than 100 rules (individual Entitlements). See above for more information about RAM usage.

REF

Controller

Gateway

Controller + Gateway

LogServer

Portal

Azure type [core/mem]
Throughput

AWS type [core/mem]
Throughput

GCP type [core/mem]
Throughput

 

sign-in/min ~

MAX users@speed

sign-in/min  ~

max users@speed

usage

users

A

Testing

Testing

-

-

-

Standard_A1_v2 [1/2]

t2.small [1/2]

e2-small [2/2]

B

200

500

Testing

Testing

Testing

Standard_B2s / Standard_A2_v2 [2/4]

t2/t3(a).medium [2/4]

e2-medium [2/4]

C

400

500@1Gb/s^

200 / 250@1Gb/s^

-

-

Standard F2s_v2 [2/4]

c5/c6i.large [2/4]

c2(d)-highcpu-2 [2/4] *

D

1000

2000@1-2Gb/s^

500 / 1000@1-2Gb/s^

-

75

Standard F4s_v2 [4/8]

c6i.xlarge [4/8]

c2(d)-highcpu-4 [4/8] *

E

-

-

-

Production

-

Standard D4s_v3/v4 [4/16]

m5/m6i.xlarge [4/16]

n2(d)-standard-4 [4/16]

F

>2000

4000@2-4Gb/s^

1000 / 2000@2-4Gb/s^

Production

150

Standard F8s_v2 [8/16]

c5/c6in.2xlarge [8/16]

c2(d)-highcpu-8 [8/16] *

G

>4000

6000@6-8Gb/s^

2000 / 3000@6-8Gb/s^

Production

300

Standard F16s_v2 [16/32]

c5/c6in.4xlarge [16/32]

c2(d)-highcpu-16 [16/32] *

This table provides a guide to typical use cases and assumes AES-NI is in use. To find CPU type type: cat /proc/cpuinfo. Intel CPU family 3 and newer have AES-NI. The Dashboard will show a warning when a CPU is detected that does not support AES-NI.

Example: The Azure F8s_v2 has 8vCPU, 4 to allow for the 4000 users and 4 for the 4Gb/s throughput requirements. This would allow for something like 20 events/sec (1 event per user per 3mins). So if you had an IP address change that affected all 4000 users, it could take about 3 minutes to process.

* if using 'standard' it will have more RAM.
^ the CPU should be able to sustain these speeds however the actual network speed could be more or less than that stated on some instance types.
~ includes Token renewals as well as initial sign-in

For details of how to install Appgate SDP into these environments please refer to the documentation in the respective marketplaces.