Intermittent database connection issue, related to DNS lookups

Just wondering if this is a known issue, or something other people have seen?

I am sometimes seeing intermittent DNS lookup failures, as follows (in indico.log):

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "indico-db.internal.ourdomain.tld" to address: Name or service not known

Usually, when retrying the same operation, it works the second time.

Our service was running Indico 2.3.2, and we have just moved to 3.2.9, on a new application server. The database is hosted on AWS Aurora, so hasnā€™t changed during the migration (apart from the ā€˜indico db upgradeā€™ we ran during the upgrade).

I have no reason to believe this actually is a DNS issue. We use the AWS VPC DNS, which has been completely reliable. We did not see this issue with Indico 2.3.2.

Any ideas?

ā€œName or service not knownā€ is a standard error coming from the OS. Typically it means thereā€™s something wrong with DNS or networking. Itā€™s certainly not something Indico has any influence on, and Iā€™m quite certain that psycopg2 (the database client lib) is also just passing on the error it gets when trying to do the DNS lookup.

If your local DNS lookups are handled through systemd-resolved, maybe check its logs if thereā€™s anything useful in there?

Thanks for the reply, and I agree with all your points, that Indico itself and the DB client lib are not likely to be at fault. Given that the AWS DNS has been utterly reliable for us, it doesnā€™t really make sense! :slight_smile:

Good point about the systems-resolved logs, but thereā€™s nothing of concern there.

We have never seen this before in any server, until now, when we are seeing it on both our new test Indico server, and then the new production Indico server. Hence my ā€˜desperationā€™ in asking whether anyone else has seen this?

thanks

I am now running:

resolvectl monitor

to see if I can get any clues.

Even after changing the resolved log-level to debug, it steadfastly refuses to log queries that result in NXDOMAIN. Weird, and unhelpful, thanks systemd.

Iā€™d like to fiddle about with the database connection pool settings, since the ā€˜DNSā€™ errors I see mostly coincide with a change in the number of connections the DB has open from the app server.

Any suggestions?

PS, the docs here Settings ā€” Indico 3.3.1 documentation point me at the Flask-SQLAlchemy documentation but the link is broken.

That sounds really strange, if you have too many connections youā€™d get a different error but not a DNS error. I will fix the broken link in the docs.

Thanks!

Sorry to mislead, I donā€™t think weā€™re seeing too many connections being opened, I just wanted (on the test system to start with) to make some changes to force Flask-SQLAlchemy to get rid of unused connections much quicker, so as to maybe force this error to happen more often, to give me a better chance of seeing what triggers it ā€¦ or something. Just trying anything I can think of at the moment. :laughing:

As a precursor to these DB/DNS errors, I am sometimes seeing this error:

Traceback (most recent call last):
  File "/opt/indico/.venv/lib/python3.9/site-packages/indico/core/settings/util.py", line 54, in get_setting
    value = cache[cache_key]
KeyError: (<class 'indico.core.settings.proxy.SettingsProxy'>, 'legal', 'tos_url', frozenset())

Mean anything useful?

Thanks

Where do you see this? This KeyError is caught and simply indicates a cache miss, it should never get loggedā€¦

Hi. I see it in indico.log. Have dumped the latest one here:

OK, now it makes more sense:

  • The KeyError is normal because itā€™s a cache miss
  • In the except case, the setting is loaded from the database
  • That database operation failed due to the DNS problem, so any exception that was leading to the final uncaught one got logged.

OK thanks. I am still trying to track down how/why these DNS errors are occurring.

BTW I can see our Indico instance has been around since 2010. In fact, we have events dating back to at least 1999, but I can see from the event logs that these older events were all added from 2010 onwards, as a way of recording things for posterity. :slight_smile: If youā€™re interested, this is the SKAO project (skao.int), and we expect the project to continue for the next 50 years at least.

Ah, so you probably migrated from legacy v1.2 at some point. That likely explains why you bad bad data for one of the API settings.

PS: Fun fact, Some CERN colleagues and I met someone from SKAO recently and while the meeting was focused mainly on other topics, Indico did come up briefly as well :wink:

Our DNS issues are solved. I was helped over on the Debian forums - see [1]. The issue is a known systemd-resolved problem [2]. This component isnā€™t a default part of Debian 12 if you install from an ISO, but it is in the official Debian 12 AWS AMIs. Removing it:

apt purge --auto-remove systemd-resolved

makes DNS just work all the time.

[1] https://forums.debian.net/viewtopic.php?t=158784
[2] https://github.com/systemd/systemd/issues/29069

2 Likes

BTW the failures only seem to affect resolution of CNAMEs, and also itā€™s somehow related to IPv6. Using a simple bash script to repeatedly use netcat to open a port to the AWS RDS DB using its main cluster endpoint (a CNAME) will show something like 0.5%-1.0% name resolution failures, but changing the script to point it at the actual RDS DB instance endpoint (an A record) means you get no failures. Also telling netnet to use IPv4 (ā€œnc -4ā€ - which clearly affects how it performs name resolution) also means no failures. A bit mad. :sunglasses:

IMPORTANT, my workaround above is incorrect!! See the systemd bug to see the proper thing to do (removing package libnss-resolve).