Appliance Troubleshooting

This section provides information to help you use logs and daemon information and provides useful commands for Appliance troubleshooting. The Appgate SDP appliance runs a customized version of Ubuntu 22.04. The information below is based on standard Linux but does include some information that is more Appgate SDP centric where relevant. The User/device troubleshooting section has more information, particularly in respect to Gateways.

Health warnings and errors

The Collective performs about 50 Site, appliance, and functional healthchecks. The results of these healthchecks are shown in the dashboard in the Sites widget or the appliances widget. From there you can get to the Appliance Health Details, where any warnings or errors will be shown. It is important to take any corrective actions before user access is impacted. Table below suggests some actions to perform when your Collective is reporting that it is unhealthy:

Source	Error Level	Urgency	Message	Action to be taken
Appliance	Offline	High	Current Controller cannot reach the appliance on appliance-hostname:443	Verify the Appliance is running. Verify the hostname of the Controller(s) are resolvable on the appliance and that port 433 is open to the Controller(s). Verify the time on the appliance. SSH to the appliance and run: "nc -zv controller-hostname 443". That should return succeeded if the Collective is in TCP-SPA mode. If not, verify the network firewall for 'Man In the Middle' interference.
Appliance	Error	Medium	Certificate with subject <name> for <appliance name> has expired. You must replace it now if it's in use. <Cert name> has expired, you must renew it now.	When an appliance certificate has expired, this warning will appear and the appliance will stop accepting a connection. The certificates automatically renew since version 6.1. So this message would appear only if the appliance was offline.
Appliance	Error	High	I/O Stalled	This error indicates an underlying storage issue. The disk I/O stalled for several seconds is an indication of either an availability issue or capacity problem of the underlying storage system. Please verify the hardware diagnostics or check your hypervisor stack for storage issues.
Appliance	Error	Low	I/O Error	This error is caused by the underlying storage system of the hardware or hypervisor. Verify that the storage is working correctly, run storage diagnostics, and verify potential read or write issues on the underlying disk system. I/O errors could lead to data corruption.
Appliance	Error	Medium	Geoip database missing	The appliance cannot download geoip data from https://bin.appgate-sdp.com nor https://updates.maxmind.com. Make sure your appliance has access to the internet and DNS is working correctly. If no geoip data is required or no external server connection is allowed, it can be disabled on the Settings > Global Settings page.
Appliance	Error	Medium	Failed to read ntp status	NTP status cannot be verified. Go to the appliance and verify if the command `sudo ntpq -np` returns any errors. Most likely the appliance has a DNS or connectivity issue, as it cannot receive the current time from the configured NTP servers.
Appliance	Error	High	Failed to perform Healthcheck for <appliance name>	The healthcheck service is not running on this appliance. Check cz-configd logs for more info.
Appliance	Error	Medium	Not connected to any Controller	The appliance is not able to reach any of the Controllers in the Collective. Make sure the appliance can reach Controller TCP port <default 443>. If UDP-SPA is enabled, make sure it can connect to UDP port 53 and 443 of the Controllers. Also, ensure that the time is set the same on the appliances.
Appliance	Error	High	Customization error	The appliance has a broken customization script. Download the appliance logs and verify the logs_by_daemon/cz-customization.log file.
Appliance	Error	High	Stuck initializing cloud instance	The appliance is expecting cloudinit information and is not receiving it. Verify your network settings and check with your cloud provider that it is sending the cloudinit information. Additionally, make sure DHCP settings for DNS and the default Gateway are enabled, as they are typically required in most platforms to receive cloudinit information.
Appliance	Error	Medium	The following services are not running:	This error is generated when certain daemons are not started when they should. Sign in to the appliance and verify the status of the daemon: `sudo systemctl status <daemon name>`
Appliance	Error	Medium	High volume usage <name> [X%]	The specific volume on the disk is >90% full. Check what is taking up space and remove files that are not required, such as old core dumps.
Controller	Error	High	Unable to connect dbd instance	The Controller cannot reach the database daemon. Check cz-dbs for status and contact support.
Controller	Error	High	Unexpected state for running Controller	Please contact support.
Controller	Error	High	IP Pool <name> has X Ips allocated out of Y (Error on 90+% usage)	This Controller is running out of IPs from the IP pool. Check the IP pool page in the admin UI under Identity > IP Pools. Check the currently used IPs vs total size of that pool. If the used IPs is almost equal to the total size, you can either change the lease time to a lower number (to clear out some IPs), or you can add additional ranges. When adding additional ranges, make sure those ranges are routed properly for each Site that is not using SNAT. If the currently used IPs is at least 50% lower than the total size, the reason is probably that your users can only reach one of the Controllers. Make sure you fix the connectivity issue and the new Controller will be able to assign the unused IPs.
Gateway	Error	High	Failed to query cz-sessiond for status	The cz-sessiond daemon does not seem to be working correctly. Try restarting the daemon with sudo systemctl restart cz-sessiond.
Gateway	Error	High	Very high number of active connections	The connection tracking table has reached 95%. Once it reaches 100% your user might experience dropped application sessions. Verify the conntrack settings on the appliance with the following command: sudo sysctl -a \| grep nf_conntrack and verify _max with _count value. If needed, the max value can be bumped with sudo sysctl -w net.nf_conntrack_max=<New Value>. In 6.1 and 6.2 it requires a customization. In later versions, check the cz-config command to change conntrack limits.
Portal	Error	High	Failed to query cz-nginx@urlaccess for status	Run sudo systemctl status cz-nginx@urlaccess to check that it is started correctly.
Portal	Error	High	cz-nginx@urlaccess: Shared memory size is not enough to save all the HTTP up Action objects + the auxiliary data	Add more memory, or reduce the amount of http UP actions.
LogServer	Error	Medium	Opensearch is down or starting up.	Run sudo systemctl status cz-opensearch to check that it is started correctly.
LogServer/LogForwarder	Error	Medium	cz-logd: Unable to connect to elasticsearch	LogServer or LogForwarder is unable to communicate with Elasticsearch. Check to see if Elasticsearch is down.
LogServer/LogForwarder	Error	Medium	cz-logd: Unable to prepare POST request for inserting data into elasticsearch	Unable to post inserts into Elasticsearch. Check connectivity to the http port of Elasticsearch. Check that the configuration of Elasticsearch matches the configuration of the LogForwarder.
LogServer/LogForwarder	Error	Medium	cz-logd: Unable to create health request for elasticsearch	LogForwarder or LogServer is unable to query health information from Elasticsearch. Check the connectivity and configuration of Elasticsearch, or in the case of a LogServer, if the Elasticsearch service is running at all.
LogServer/LogForwarder	Error	Medium	cz-logd: Unable to create index into elasticsearch, got status: X	LogServer/LogForwarder is unable to create indexes in Elasticsearch. Check the Elasticsearch configuration.
LogServer/LogForwarder	Error	Medium	cz-logd: Elasticsearch status is not green/yellow	LogServer/LogForwarder status of Elasticsearch is not green. Check Elasticsearch server status and fix the status to green/yellow.
LogForwarder	Error	Medium	cz-logd: Unable to get stream, does it exist?, streamname: X, details: Y	LogForwarder is configured with a stream that does not exist in AWS. Check configured name or create it in AWS.
LogForwarder	Error	Medium	cz-logd: Unable to get delivery stream, does it exist?, streamname: %v, details: %v	LogForwarder is unable to get delivery stream. Check IAM roles for the Kinesis output.
LogForwarder	Error	Medium	cz-logd: Could not compile filters, X	LogForwarder has been configured with a filter but is not compiling Check the configuration of filters for the LogForwarder.
LogForwarder	Error	Medium	cz-logd: No credials provided	LogForwarder, Kinesis output: no credentials or faulty credentials provided. Check the Kinesis LogForwarder configuration.
LogForwarder	Error	Medium	cz-logd: Unable to get AWS region, details: X	LogForwarder is unable to get info about an AWS region. Ensure that the correct region is configured for the Kinesis output.
LogForwarder	Error	Medium	cz-logd: Unable to create AWS session, details: X	LogForwarder is unable to create an AWS session using the AWS SDK.
Appliance	Warning	Low	Geoip database was last updated X days ago	The appliance missed receiving the latest geoip data from https://bin.appgate-sdp.com or https://updates.maxmind.com. Make sure your appliance has access to the internet and DNS is working correctly. You can manually force an update using sudo /etc/cron.daily/geoIpDbUpdate --force
Appliance	Warning	Medium	High volume usage <name> [X%]	The specific volume on the disk is >75% full. Check what is taking up space and remove files that are not required, such as old core dumps.
Appliance	Warning	Medium	Certificate with subject <name> for <appliance name> is expiring. You must replace it before <date>. <Cert name> is expiring. You must renew it before <expiration date>.	There is a 30 day warning when an appliance certificate is about to expire. Press the appliance renew certificate option in the System > Appliances menu. Renewing the certificate will restart all services on this appliance.
Appliance	Warning	Medium	The following services have debug logs enabled:	Running with debug logs enabled may harm performance. Switch back to normal logs as soon as possible.
Appliance	Warning	Low	Configuration from Controller is incompatible with this appliance	The configuration from the Controller does not match the configuration format of this appliance. This might be because of a version incompatibility.
Appliance	Warning	Low	cz-ffwd: Unable to connect to X@X (X)	Appliance is unable to connect a websocket connection from the appliance to the LogServer/LogForwarder. Check connectivity between appliances (default TCP port 443, UDP ports 53 and 443).
Appliance	Warning	Low	This system has a CD drive attached	The appliance is running on VMWare and has still a CD drive attached. Go to the VMWare console and remove the attached CD Drive from the virtual machine.
Controller	Warning	High	There are more X than your license allows.	You went over the amount of users. Access to only the first licensed users will be granted.
Controller	Warning	High	IP Pool <name> is too small to be utilized by this Controller.	This error occurs if you assign an IP pool that has less IPs then the amount of Controllers. For example, a /30 with 6 Controllers will give this error.
Controller	Warning	High	Controller appliance certificate does not include the client profile DNS name <name>. You must renew it to allow client connections to this Controller.	The client profile DNS name included in the client profile is not present as a SAN in the appliance's certificate. The certificate can be renewed from the Appliances page in the admin UI.
Controller	Warning	Medium	X of the user licenses are in use.	You are almost running out of user licenses. Contact support or sales to update your license count.
Controller	Warning	Medium	X of the Portal licenses are in use.	You are almost out of Portal licenses. Contact support or sales to update your license count.
Controller	Warning	Medium	X of the service licenses are in use.	You are almost out of service licenses. Contact support or sales to update your license count.
Controller	Warning	Medium	Controller is running in maintenance mode	The Controller is running in maintenance mode due to an ongoing upgrade. If the upgrade failed and your node is still in maintenance mode, you can take it out of maintenance with the following command `sudo cz-config set -j Controller/maintenance false`. Be careful to use only when the upgrade has been cancelled.
Controller	Warning	Medium	Database node not replicating	This Controller is unable to replicate the database with another Controller. Check the connectivity between the two controllers. Bi-directional connectivity is required.
Controller	Warning	Medium	IP Pool <name> has X Ips allocated out of Y (warning between 75-90% usage)	This Controller is running out of IPs from the IP pool. First, check the IP pool page in the admin UI under Identity > IP Pools. Check the Currently used IPs vs total size of that pool. If used IPs is almost equal as the total size, you can either change the lease time to a lower number (to clear out IPs), or you can add additional ranges. When adding additional ranges, make sure those ranges are routed properly for each Site that is not using SNAT. If the currently used IPs is at least 50% lower than the total size, the reason is probably that your users can only reach one of the Controllers. Fix the connectivity issue and the new Controller will be able to assign the unused IPs.
Controller	Warning	Low	BDR conflict	This error would occur if different Controllers have conflicting versions of the data. This would occur when there was a temporary network connectivity issue between different Controllers. Most of these conflicts will be automatically resolved by accepting the latest update of the record. You can run the following command to resolve the remaining conflicts: `sudo cz-config bdr resolve-conflicts` if it keeps appearing please contact support.
Controller	Warning	Low	The following entitlements are using the deprecated Risk Based Access feature. Please migrate them to condition-based access. <entitlement name>, <entitlement name>	The same functionality can be achieved using conditions and checking the risk score criteria.
LogForwarder	Warning	Medium	cz-logd: Not connected to X	A output from the LogForwarder named X in the LogForwarder configuration is not connected. Check connectivity from LogForwarder to log destination.
LogForwarder	Warrning	Medium	cz-logd: Unable to perform log-retention, check access or elasticsearch status. Details: X	LogForwarder has Elasticsearch output configured. But the LogForwarder is unable to perform remove of indexes in the Elasticsearch. Check connectivity or configuration of the Elasticsearch.
LogForwarder	Warning	Medium	cz-logd: Failed to send logs to X	Connectivity for a http based output is not working. Check connectivity from LogForwarder to output destination X.
LogForwarder	Warning	Medium	cz-logd: Kinisis was unable to handle all records, not enough shards?	LogForwarder has Kinesis configured for output. Kinesis output is getting throttled, so might require additional shards configured.
LogForwarder	Warning	Medium	cz-logd: Kinesis error codes: %v	LogForwarder has Kinesis configured for output, but AWS is returning an error code. In many cases these are IAM based errors that need to be fixed in the AWS configuration.
LogForwarder	Warning	Medium	cz-logd: Firehose was unable to handle all records, throughput exceeded?	LogForwarder has Kinesis-firehose configured and is getting throttled. More resources needed to be configured on the AWS side to handle the load.
LogForwarder	Warning	Medium	cz-logd: unable to connect to LogForwarding destination %s (%s) (TLS) %v	LogForwarder has TCP based output configured. Check connectivity towards configured output.
LogForwarder	Warning	Medium	cz-logd: tcp output (%s) is slow, incoming amount of logs exceeds outgoing amout	The amount of generated logs is larger than the amount that is being sent. This might indicate a slow destination SIEM, a large amount of logs are being generated by the appliance, or a very slow connection to the destination SIEM.
Gateway	Warning	High	Gateway appliance certificate does not include the profile hostname <name>. You must renew it to allow client connections to this gateway.	The client hostname name included in the configuration is not present as a SAN in the appliance's certificate. The certificate can be renewed from the Appliances page in the admin UI.
Gateway	Warning	High	cz-sessiond: High watermark event queue	Gateway is struggling to keep up with fw-rules generation and sessions. Make sure you are running version 6.1.x or later, add more Gateways to handle the load, refactor entitlements to be less demanding, change dynamic rules to be more static.
Gateway	Warning	High	cz-sessiond: No revocation has been received for X secs	Gateway to Controller communication is not working. The Controller is unable to push revocation list to the Gateway, and the Gateway is unable to pull them from the Controller
Gateway	Warning	Medium	The following DNS names are unstable: <DNS names>	The DNS names used in the entitlements are not always returning the same answer. This causes the FW rules to be updated all the time, but also could lead to different IPs being resolved on the client vs the Gateways. To solve this, create a DNS policy for the DNA name or domains and add it to the DNS forwarder configuration. Then replace the dns://<name> in the entitlement with a *.domain.com<Domain name>. This will make sure the DNS result sent to the client is the same as the one used by the Gateway. This is needed to address public DNS names.
Gateway	Warning	Medium	cz-sessiond: The following applications are reported as unhealthy: X, Y, Z	Gateway has flagged the applications X, Y, Z as unhealthy. See manual about App Monitoring feature.
Connector	Warning	Low	Connector client X: Waiting for configuration Connection failed.	Client can't connect to the Controller. Check the global client profile DNS name and make sure the connector can resolve it.

Run Commands (admin UI)

Run Commands will open the Remote Commands window.

There are eight limited remote commands available which can be run on this appliance, thus avoiding any immediate requirement to SSH to remote machines to perform basic diagnostics.

•addressshow

•dig

•ip route show

•netcat

•nptq

•ping

•tcpdump

•traceroute

Most of the commands have a Timeout field that accepts a value in seconds.

NOTE
The max number of concurrently running commands allowed is five.

Daemon Log commands (SSH)

journalctl

To see live logs:

journalctl -f

To show a specific service:

journalctl -u cz-configd

journalctl -u cz-vpnd@0 (vpnd instance number 0)

journalctl -u "cz-vpnd@*" (vpnd all instances)

journalctl -u appgatedriver@<client name> (Connector Client tun driver)

journalctl -u appgateservice@<client name> (Connector Client service)

journalctl -u cz-nginx@admin (admin/API interface)

journalctl -u cz-nginx@urlaccess (HTTP up Action type)

journalctl -u cz-nginx@portal (Portal)

journalctl -u cz-nginx@main

To only show logs from a certain importance

journalctl -p warning

To show logs in reverse order :

journalctl -r

To show logs since last boot:

journalctl -b

To show logs in a specific time range:

journalctl --since="2017-06-01 12:17:16" --until="2017-06-02"

All the above flags can be combined. For more information, see: http://manpages.ubuntu.com/manpages/xenial/man1/journalctl.1.html

SYSLOG

journalctl should normally be sufficient for looking at the logs.

However, in the case that the binary logs are corrupt, you can fall-back to /var/log/syslog to get logs (/var/log/upstart was used in earlier versions of the product)

Saving logs

Appgate SDP has two types of log records: daemon logs and audit logs. Daemon logs are used to examine the workings of the Appgate SDP system and audit logs are used to record the actions performed by the system.

Logs are automatically saved on the appliance. Logs can be downloaded from System>Appliances for examination locally or copied to another machine using the secure copy command: scp or sftp.

Debug log level

The types of system events that are stored in the appliance's Debug Log depends on the appliance Debug log level setting.
For troubleshooting purposes, you may wish to run the appliance in 'DEBUG' mode. If you run the appliance in DEBUG mode, remember to reset the Debug Log Level to a lower mode once the appliance is running satisfactorily to ensure optimal performance of the system.

For details of changing Debug log levels, refer to: System Logs > Debug Logs

Troubleshooting commands (SSH)

You can use the SSH command line to run the following troubleshooting commands:

NOTE
There is a list of more cz-config commands in the cz-setup and cz-config commands section.

Current version	`cat /usr/share/cz-image/version`
Current state of the appliance	`cat /mnt/state/last-state`
Up time and system load	`uptime`
Reboot the appliance (Some Clients may need to reconnect)	`sudo reboot`
Restart appliance services in lieu of 'reboot' command	sudo service cz-configd restart
Restart an arbitrary daemon	`sudo service <daemon> restart`
Bypass root (requires access to GRUB menu)	Reboot the appliance. Press 'e' to edit the GRUB menu. Append `cz-login=root` to the kernel line as shown below. Then issue 'Ctrl-x' as listed in the grub menu instructions.
Collect appliance diagnostics	`sudo cz-system-info [--full]` This command will collect a full set of appliance diagnostic information, including license usage, and save it to /tmp/cz-system-info.txt.gz. The option --full dumps additional information about the state of certain processes. The file will be owned by root. Alternatively, running the command(s) below will make the file be owned by the cz user which may make it easier to scp or sftp it from the appliance. `sudo cz-system-info; sudo chmod a+r /tmp/system-info.txt.gz` It can be downloaded from there using SCP; example command: `scp -i ~/keys/mykey cz@controller.myco.com:/tmp/cz-system-info.txt.gz ~/Desktop/cz-system-info.txt.gz`
Remove a core dump (and warning in dashboard)	Core dumps are under /mnt/data/core so sudo rm (remove) any files from there.
Update the geoIP database now	`sudo /etc/cron.daily/geoIpDbUpdate`
View the memory available and used on the appliance	`free -mh`
View the processes, for example those using the most CPU	`htop`
View running processes	`ps -aux or ps -aux \| grep <pid>` Alternatively, issue: `ps -ef`
View the current system firewall rules	`iptables -L -v` For IPv6: `ip6tables -L -v`
View network interface addresses	`ip addr show` `ip link show (link information)` For IPv6: `ip -6 addr show` `ip -6 link show`
View routes	`ip route show` For IPv6: `ip -6 route show`
View configuration files	Daemons: The configuration file for each Appgate SDP daemon is stored under `/etc/cz-<daemon-name>`. For example: sessiond: configuration is available under /etc/cz-sessiond Nginx: configuration is stored under the directory: /etc/nginx Rsyslog: configuration is stored under /etc/rsyslog.conf System: the current combined system configuration is stored under `/mnt/state/config/current` and contains the following files: local.JSON: The appliance local configuration. remote.JSON: The appliance remote configuration. To view these files use the commands jql (local) and jqr (remote) from anywhere in the file system. The previous combined system configuration is stored under `/mnt/state/config/previous`
These advanced tools are detailed here to help recover from situations when add/remove of a Controller has failed. Please contact support before trying to use any of these tools.	`sudo cz-config <option>` `bdr status [--show-parted-nodes --exclude-raft --JSON --table-fmt]` Display human readable BDR status for all controller databases `--show-parted-nodes` Show also the nodes already parted when showing the status `--exclude-raft` Exclude RAFT status `--JSON` Output JSON `--table-fmt` Table format when not outputting JSON. Use psql for compatibility with terminals like putty (VALUES: fancy_grid, psql)--show-parted-nodes `bdr force-single-controller-ready` Force appliance to single_controller_ready state, use with caution `bdr force-appliance-ready` Force appliance to appliance_ready state, use with caution `bdr clear-barrier` Clear all BDR barriers, use with caution `bdr remove-node-record REMOVE_NODE_RECORD` Forcefully remove a node from BDR on the current node `bdr update-bdr-group` Take the current BDR leave barrier in the name of a dead node `bdr enable-ip-allocation` Enable IP allocation for current node `bdr disable-ip-allocation` Disable IP allocation for current node `bdr repartition-ip-allocations` Re-partition IP allocations to match current controllers `bdr resolve-conflicts [--run-local-node --keep-conflict-history]` `--run-local-node` Run the query clean-up query on this node only, by default it runs on every node `--keep-conflict-history` Don't delete the conflict history `bdr --help` Access help

System internals (excluding LogServer) showing Daemons

Diagram illustrating appliance architecture with components like Controller, Gateway, and Client connections.