System monitoring and logs

Prev Next

As with most other Enterprise access solutions, Appgate SDP is likely to form a critical part of a company’s network infrastructure as such the use of monitoring tools and audit logs should play a vital part in any deployment.

Example Grafana dashboard for Controllers

Example Grafana dashboard for Controllers

What to monitor

System monitoring (and logs) should not just be about the three headlines of cpu, ram and disk. They should be used to answer three much broader questions:

Infrastructure is operating as intended

The Appgate SDP Collective can be a complex beast spread across many locations and networks. It can include multiple redundant instances – which because of the core design principal can operate with a high degree of autonomy. A Site with only three operational Gateways out of five will look fine until you run out of available capacity.

Key here is to ensure that the whole system is operational in the way you expected. If you have multiple Controllers, are they all active? If you have a primary and a backup system – are they both available? If you are exporting audit logs – are they actually being exported or just filling the disk?

It may not be possible to answer all your infrastructure questions from Appgate SDP logs/metrics alone, but there is a surprising amount you can do. For the rest you may need to gather additional information from other related systems to get a complete picture of the infrastructure.

Functions are performing to the required specification

Functions refer to the seven different functions that you can enable on an Appliance: Controller, Gateway, Connector, Portal, LogForwarder, LogServer, and Metrics Aggregator.

It is important to show some metrics that confirm the function is doing what is intended. If something is meant to be forwarding logs – then how many is it forwarding? You might have designed for a maximum of 1000 log records per minute – but what is the commissioned system actually doing?

As well as getting a sense of well-being from seeing the function doing its job, the metrics can inform us about how well the job is being done. You might have 2 Controllers and see the Ax-H3 appliance can sign-in user in an average of 200mS, whereas the reserve machine (which is a VM) takes 600mS for the same thing. This sort of (historical) information can be very useful for planning capacity or making hardware investments.

Given most systems will be operating in a HA configuration, then clear graphical metrics which show appliances are in good balance with each other is another important consideration.

When thinking about a function’s ‘ability to perform’ think about things which might prevent this happening. DoS is an obvious one to consider – certain metrics can be used to suggest a DoS attack is beginning on a specific appliance.

Also many of the metrics are provided so it is easy to spot sudden changes, differences between appliances or trends over time. The absolute values, although interesting, are there more for completeness than as an exact measure. By example; there are a number of metrics relating to sign-in times on the Controller, several of which may be running concurrently, so adding them up in the hope of getting a total is somewhat futile!

Operation does not impinge on any physical limits

Without a clear understanding of the inner working of the Appgate SDP system it can be hard to know what to monitor and which audit logs matter. Many customers have opted for the obvious metrics they always use on all their systems – such as CPU, RAM and Disk. It does no harm measuring this low hanging fruit, but a likely outcome will be to reveal; either design features such as the use of Java which allocates (a lot of) heap space but does not usually give it back because heap re-sizing is expensive operation; or design faults such as a file which is not being cleaned up correctly consuming an increasing amount of disk space.

Controllers for instance should not normally consume any significant RAM to perform their function. If it is, then there is likely to be some backlog forming such as long running scripts. It can operate at near 100% CPU but will get quite unhappy when it runs short of disk space!  

Portal supports only a stated maximum number of users (Client sessions), so monitoring the number of active users is very important; but measuring RAM might be misleading as the product is designed to operate based on the amount of RAM available and will always use a minimum of 25%.

Do not forget about limits imposed by others. The system relies on things like DNS, The DNS servers or any firewalls between the Appgate SDP appliances and the DNS server might have some sort of per second limit. Exceeding this might result in requests being dropped and a user’s access being denied. So, monitoring DNS requests/sec might be very relevant in some deployments.

Monitoring

It is important to make plans to monitor the Appgate SDP system continuously and where appropriate, trigger alerts and take any corrective actions before user access is impacted. Additional monitoring does not need to be there from the outset, as the Collective performs about 50 Site, appliance, and functional healthchecks anyway. When the system is unhealthy then this is shown in the dashboard in either the Sites widget or the appliances widget. from where you can get to the appliance health details:

The Appliance Health Details window of the admin UI.

This proves only a very short term snapshot of the Collective's health; so as you scale up the number of users, it is important to have longer term monitoring in place to get a clear understanding of how the Collective is performing.  

The Appgate SDP system has two different, but related, means of providing metrics that can be used for monitoring the system.

  • SNMP is available via the Appgate MIB.

  • Prometheus metrics which are an equivalent to those in the Appgate MIB. This is what we recommend for enterprise usage.

The built in healthcheck status is also available in metrics, so this makes a great starting point. apn_function_status and apn_status metrics echo the status you see on the dashboard (0=healthy, 1=busy, etc).

Prometheus Exporter

The Prometheus Exporter runs an HTTP server listening on a given port serving the appliance's Prometheus metrics at GET /metrics.

The Metrics Aggregator

The Metrics Aggregator provides a means of collecting, grouping by Site and then exporting Prometheus metrics from the Collective. This avoids the need to configure the Prometheus exporter on each individual appliance and the need to configure firewalls to allow inbound access to each appliance in the Collective.

The Aggregator looks the same as the Exporter from a connectivity point of view - so it's just another metrics end-point that required a Prometheus server to connect to it, and scrape the metrics at a defined interval.

The main difference is that the Metrics Aggregator presents the metrics from multiple appliances (selected by Site) in one place. This simplifies the configuration and reduces the number of inbound firewall rules required to just one.

Multiple Aggregators can be configured - so you could for instance have separate ones deployed, one in each geographic region.

There is a white paper titled monitoring-the-appgate-sdp-system which goes into more detail about what to monitor. This is based around three example Grafana dashboards we have posted in Github repository.

System logs

The Appgate SDP system records every type of system, user and admin action continuously. These can be exported and where appropriate can be used to trigger alerts and take any corrective actions when specific events are detected.

Appgate SDP has two different types of log records: daemon logs and audit logs. Daemon logs are typically used to examine the workings of the Appgate SDP system itself and audit logs are used to record the actions performed by the system.

These two log types are processed independently of one another by the Appgate SDP system. The Audit Logs can be handled by one of two different appliance functions:  

LogServer

Provides a local log server for use within the Collective. Once enabled, you will need to sign-out and sign-in before the Audit Logs tab appears in the admin UI.

This is not intended for enterprise usage but can be used to commission a new system or in smaller deployments that fall within these Audit Logs.

LogForwarder

Provides a means of collecting, grouping and securely distributing audit logs within an enterprise environment.

The LogForwarder appliance function allows  the Appgate SDP system to be configured to export these log records to an external system designed to retain persistent logs. LogForwarders includes connectors for many of the leading SIEMs such as Splunk, Kibana, Azure Monitor, etc.

Multiple LogForwarders might be used because there may be more than one destination in use for logs from different appliances. You should also consider the using multiple LogForwarders where the volume of logs is high; i.e. when collecting logs from multiple busy Gateways. Several LogForwarders can be configured to send logs to the same destination.