Intermittent database connection issue, related to DNS lookups

ahmgithubahm · April 2, 2024, 9:01am

Just wondering if this is a known issue, or something other people have seen?

I am sometimes seeing intermittent DNS lookup failures, as follows (in indico.log):

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "indico-db.internal.ourdomain.tld" to address: Name or service not known

Usually, when retrying the same operation, it works the second time.

Our service was running Indico 2.3.2, and we have just moved to 3.2.9, on a new application server. The database is hosted on AWS Aurora, so hasn’t changed during the migration (apart from the ‘indico db upgrade’ we ran during the upgrade).

I have no reason to believe this actually is a DNS issue. We use the AWS VPC DNS, which has been completely reliable. We did not see this issue with Indico 2.3.2.

Any ideas?

ThiefMaster · April 2, 2024, 9:05am

“Name or service not known” is a standard error coming from the OS. Typically it means there’s something wrong with DNS or networking. It’s certainly not something Indico has any influence on, and I’m quite certain that psycopg2 (the database client lib) is also just passing on the error it gets when trying to do the DNS lookup.

If your local DNS lookups are handled through systemd-resolved, maybe check its logs if there’s anything useful in there?

ahmgithubahm · April 2, 2024, 9:16am

Thanks for the reply, and I agree with all your points, that Indico itself and the DB client lib are not likely to be at fault. Given that the AWS DNS has been utterly reliable for us, it doesn’t really make sense!

Good point about the systems-resolved logs, but there’s nothing of concern there.

We have never seen this before in any server, until now, when we are seeing it on both our new test Indico server, and then the new production Indico server. Hence my ‘desperation’ in asking whether anyone else has seen this?

thanks

ahmgithubahm · April 2, 2024, 9:30am

I am now running:

resolvectl monitor

to see if I can get any clues.

ahmgithubahm · April 2, 2024, 10:17am

Even after changing the resolved log-level to debug, it steadfastly refuses to log queries that result in NXDOMAIN. Weird, and unhelpful, thanks systemd.

ahmgithubahm · April 2, 2024, 3:54pm

I’d like to fiddle about with the database connection pool settings, since the ‘DNS’ errors I see mostly coincide with a change in the number of connections the DB has open from the app server.

Any suggestions?

PS, the docs here Settings — Indico 3.3.1 documentation point me at the Flask-SQLAlchemy documentation but the link is broken.

ThiefMaster · April 2, 2024, 3:57pm

That sounds really strange, if you have too many connections you’d get a different error but not a DNS error. I will fix the broken link in the docs.

ahmgithubahm · April 2, 2024, 4:02pm

Thanks!

Sorry to mislead, I don’t think we’re seeing too many connections being opened, I just wanted (on the test system to start with) to make some changes to force Flask-SQLAlchemy to get rid of unused connections much quicker, so as to maybe force this error to happen more often, to give me a better chance of seeing what triggers it … or something. Just trying anything I can think of at the moment.

ahmgithubahm · April 2, 2024, 4:34pm

As a precursor to these DB/DNS errors, I am sometimes seeing this error:

Traceback (most recent call last):
  File "/opt/indico/.venv/lib/python3.9/site-packages/indico/core/settings/util.py", line 54, in get_setting
    value = cache[cache_key]
KeyError: (<class 'indico.core.settings.proxy.SettingsProxy'>, 'legal', 'tos_url', frozenset())

Mean anything useful?

Thanks

ThiefMaster · April 2, 2024, 5:11pm

Where do you see this? This KeyError is caught and simply indicates a cache miss, it should never get logged…

ahmgithubahm · April 3, 2024, 10:51am

Hi. I see it in indico.log. Have dumped the latest one here:

gist.github.com

https://gist.github.com/ahmgithubahm/39f10fb7f5e7f4dbe3402e4cc17b1c33

gistfile1.txt

2024-04-03 10:42:44,841  INFO     2e25126258a14b99  -       indico.rh                 GET /register/?next=/event/936/contributions/8849/ [IP=nnn.nnn.nnn.nnn] [PID=1470607]
2024-04-03 10:42:44,852  ERROR    2e25126258a14b99  -       indico.flask              (psycopg2.OperationalError) could not translate host name "indico-db.internal.ourdomain.tld" to address: Name or service not known

(Background on this error at: https://sqlalche.me/e/14/e3q8)
Traceback (most recent call last):
  File "/opt/indico/.venv/lib/python3.9/site-packages/indico/core/settings/util.py", line 54, in get_setting
    value = cache[cache_key]
KeyError: (<class 'indico.core.settings.proxy.SettingsProxy'>, 'announcement', 'enabled', frozenset())

During handling of the above exception, another exception occurred:

This file has been truncated. show original

ThiefMaster · April 3, 2024, 10:55am

OK, now it makes more sense:

The KeyError is normal because it’s a cache miss
In the except case, the setting is loaded from the database
That database operation failed due to the DNS problem, so any exception that was leading to the final uncaught one got logged.

ahmgithubahm · April 3, 2024, 11:32am

OK thanks. I am still trying to track down how/why these DNS errors are occurring.

BTW I can see our Indico instance has been around since 2010. In fact, we have events dating back to at least 1999, but I can see from the event logs that these older events were all added from 2010 onwards, as a way of recording things for posterity. If you’re interested, this is the SKAO project (skao.int), and we expect the project to continue for the next 50 years at least.

ThiefMaster · April 3, 2024, 11:38am

Ah, so you probably migrated from legacy v1.2 at some point. That likely explains why you bad bad data for one of the API settings.

PS: Fun fact, Some CERN colleagues and I met someone from SKAO recently and while the meeting was focused mainly on other topics, Indico did come up briefly as well

ahmgithubahm · April 8, 2024, 2:59pm

Our DNS issues are solved. I was helped over on the Debian forums - see [1]. The issue is a known systemd-resolved problem [2]. This component isn’t a default part of Debian 12 if you install from an ISO, but it is in the official Debian 12 AWS AMIs. Removing it:

apt purge --auto-remove systemd-resolved

makes DNS just work all the time.

[1] https://forums.debian.net/viewtopic.php?t=158784
[2] https://github.com/systemd/systemd/issues/29069

ahmgithubahm · April 9, 2024, 7:46am

BTW the failures only seem to affect resolution of CNAMEs, and also it’s somehow related to IPv6. Using a simple bash script to repeatedly use netcat to open a port to the AWS RDS DB using its main cluster endpoint (a CNAME) will show something like 0.5%-1.0% name resolution failures, but changing the script to point it at the actual RDS DB instance endpoint (an A record) means you get no failures. Also telling netnet to use IPv4 (“nc -4” - which clearly affects how it performs name resolution) also means no failures. A bit mad.

ahmgithubahm · April 11, 2024, 8:06am

IMPORTANT, my workaround above is incorrect!! See the systemd bug to see the proper thing to do (removing package libnss-resolve).