Misc / RAC

Troubleshooting DNS Errors on Client Connections with tcpdump and strace

Recently we had an issue where a client wasn’t using the fully qualified domain name in the connect string and causing a number of errors to be logged on our DNS appliance. It wasn’t having any impact as the connections were still working but we wanted to clean up the lookup errors on our appliance.

The connections were coming in to a 2 node cluster running on Redhat 5.5.

First thing to check was our DNS search path on the RAC servers which can be seen via /etc/resolv.conf.

Please note names of servers and hosts have been masked : ) You can see our search path has “in.domain.com.au” following by “domain.com.au”. The lookup errors we were seeing on the DNS appliance was “rac-scan.in.in.domain.com.au”. This suggested to me that the client connect string was using the host with “(HOST=rac-scan.in)” and when it tries to resolve with the first parameter in the search path it gets the lookup error.

As this was a shared environment with 20+ databases it was going to be tricky trying to pick which connect string had the issue.

First thing to try was tcpdump.

The -s0 captures whole packets, the -A prints ASCII and the -vvv prints as verbose as it can. Unfortunately this did not give us the answer and we could not capture the problematic connect string. What it did show us was things like:

Lots of good information but nothing to help us at this stage. Next point of attack was to strace the listener process. As this is a RAC environment the connection points are best examined at the scan listener entry point. We can get the PID of the scan listener with

Once we have the PID we can now trace the server process and see if anything useful pops up.

Initially I ran the strace without the -s flag. On our system it prints out a default 32 bytes string size which was not large enough to capture the connect string I was looking for so I upped it to 256. As connections were coming in thick and fast I only needed to trace for a few minutes so the file wasn’t too large. I then grep’d out the scan address and found the following:

We can see “(HOST=rac-scan.in)” in our scan1.out file with the associated SERVICE_NAME and pinpoint the database which is causing the lookup errors.

All we had to do now was contact the application owners and get them to change the connect string to use a fqdn.

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA * Time limit is exhausted. Please reload CAPTCHA.