There are three things to check if your users are having difficulty connecting TO the Appgate SDP Controllers or Gateways.
In the case of the Controller - the Client logs will include Info : REST connection could not be established. 'Connection failed.'. Will try the next IP address.
The first area to check is the SPA traffic. If you are not familiar with SPA there is a separate page explaining the operation and benefits of SPA.
SPA issues
SPA can be the cause of initial connectivity problems - but once these are identified and fixed then SPA is handled like all other TLS based network traffic. To diagnose problems you will need to open cz-proxyd.log as this is the front door to all Appgate SDP appliances and handles all the SPA packets. If you know the IP address of the connecting party - search for this.
TLS Connections (using SPA) will always send 2 UDP packets and then 1 TLS packet as part of the the SPA process (irrespective of the SPA mode). Here are some issues you should be looking for in the logs:
UDP packet loss
If UDP-TCP SPA is being used you should ALWAYS see both of these lines (unless you are using a DTLS tunnel in which case you will only see the DNS one):
INFO [spa] SPA-DTLS from 14.137.17.12:51114 has been authorized by using the key with name "keyname"
INFO [spa] SPA-DNS from 14.137.17.12:52423 has been authorized by using the key with name "keyname"
Only one needs to get through, but if one is missing then it is likely that some firewall somewhere will need a rule change to allow both through.
If a valid SPA key was used, the system will update iptables to allow TCP traffic from this address, thus allowing the subsequent SPA-TLS packet through to proxyd.
TCP packet loss
After the two (UDP) packets above (or if TCP-SPA is being used) - a new (TCP) connection log record will look like:
INFO [server] New connection from 14.137.17.12:49538
INFO [spa] SPA-TLS from 14.137.17.12:49538 has been authorized by using the key with name "keyname"
INFO [server] Connection established with /run/czd/cz-nginx-main-client.socket
If you see the UDP packets but these records are missing, then you are likely to be encountering one of these:
UDP differential packet routing
A problem has been seen mainly with certain mobile ISPs, where valid UDP SPA packets have arrived but but are from a different IP than the TCP connection uses.
In this case the system will update IP tables to allow TCP but from the address the UDP packets came from, so the SPA-TLS packet will not be allowed to get through to proxyd.
Firewall filtering of TLS
Some application-aware firewall which have strict filtering set for TLS. This filtering will be detecting the (SPA) extensions we use in the TLS Client hello packets and dropping them.
The firewall will require a rule change; if allow all TLS does not work, then try allow TCP 443; also check for any specific application filters and remove them.
Unauthorized SPA packet
A new connection log record might also look something like:
INFO [server] New connection from 64.22.1.39:43631
ERROR [spa] SPA message has expired! Current time: 2021-10-21 13:04:56 +0000 UTC SPA message timestamp: 2021-10-20 21:17:24 +0000 UTC Difference: 15h:47m:32s Max allowed difference: 3600 seconds
ERROR [spa] SPA-TLS from 64.22.1.39:43631 failed to get authorized. Dropping!
SPA packets include a temporal element designed to prevent replay attacks. If the connecting party has a badly configured NTP server then the time on that machine might be off sufficiently for the packet not to be authorized. Check the current time and NTP settings on the connecting device.
Invalid SPA packet
When there is an attempt to connect to the open TCP port (when UDP-TCP SPA is not in use) and no SPA packet is present you will see:
INFO [server] New connection from 80.216.53.180:44462
WARNING [server] 80.216.53.180:44462 has closed the connection
Effectively this is seen as an invalid SPA packet, so the Appgate SDP system will not establish a TLS connection in this situation.
The second area to check is DNS resolution of the Controllers and Gateways (which may also cause issues after the tunnel is up).
DNS resolution issues
Here are some are some typical examples of DNS related issues.
There is no DNS record for the Profile DNS Name used by the Clients to connect to the Controller
When the first Controller is created then a Profile DNS Name is chosen that is shared across all Controllers. A DNS record must exist for this which has multiple A records - one for each Controller in the Collective.
The Controller's hostname is part of some internal match domain
When the Client starts, it does a DNS lookup of the Controller appgate.b.com - which works fine as it uses local DNS. Subsequently it fails to connect:
The internal DNS server(s) has no entry for appgate.b.com
Client connects to Controller/Gateway, gets back a search domain b.com and the related internal DNS server(s) (ie what is set in identity provider)
Client configures the OS so requests for xxx.b.com should first go to the internal DNS servers - which are routed through our VPN tunnel
Time passes and the Client needs to talk to Controller again for some reason such as Token renewal
Because the cached DNS local entry has expired, a new DNS lookup of appgate.b.com is done which will be to the internal DNS server(s).
Each DNS server is tried in turn, with a (typically) 2s timeout between and eventually you get the Warning: Lookup of 'appgate.b.com' failed: Timeout when looking up 'appgate.b.com'.
The internal DNS server(s) have an entry for the Controller appgate.b.com which provides an internal IP address.
Once connected the Client caches the external IPs for the Controllers and Gateways to try to avoid this issue. However if this fails for some reason and the device DNS is configured to use the internal DNS servers first, then because appgate.b.com is listed, an internal IP address will be returned. The externally located Client will not be able to route to this internal IP address so will not be able to talk to the Controller once connected through the Gateway.
The third area that might cause issues is if the TLS connection cannot be established (even if the TCP connection is there).
TLS issues (man-in-the-middle)
There can be issues establishing the secure TLS connection.
| Because the system uses mutual TLS it will not accept having a man-in-the-middle. This can happen when there is firewall or web gateway trying to intercept the traffic between the Client and the appliances. When this happens the user is likely to see an Invalid Certificate warning. The best way to identify the culprit is to download the Client logs and unzip them. Open log.log and search for unknown CA. You should find ERR_CONNECT_CA_CERT_ERROR which confirms a certificate mis-match. Info : Trying to connect 'server.company.com/1.2.3.4' with SPA key name '<name>'. Info : SSL certificate of controller 'server.company.com' was not accepted. Trying the next CA certificate stored in the system. Open driver.log and search for unknown CA. Several lines above this entry you should see the Server certificate with CN=whatever.com appgatedriver[11164]: [Site] Server certificate DN: CN=whatever.com appgatedriver[11164]: [Site] Server certificate issuer: CN=R3,O=Let's Encrypt,C=US appgatedriver[11164]: [Site] Server certificate SAN.DNS: api.whatever.com appgatedriver[11164]: [Site] Connection closed: end of file - Server reported: unknown CA appgatedriver[11164]: [Site] [<Gateway>] Closing (Disconnected by gateway - Server reported: unknown CA) This CN should help to identify the device which in intercepting the TLS traffic. Once this is identified then the configuration should be changed to allow all TCP 443 traffic (without interception). |
