Just wondering if this is a known issue, or something other people have seen?
I am sometimes seeing intermittent DNS lookup failures, as follows (in indico.log):
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "indico-db.internal.ourdomain.tld" to address: Name or service not known
Usually, when retrying the same operation, it works the second time.
Our service was running Indico 2.3.2, and we have just moved to 3.2.9, on a new application server. The database is hosted on AWS Aurora, so hasnāt changed during the migration (apart from the āindico db upgradeā we ran during the upgrade).
I have no reason to believe this actually is a DNS issue. We use the AWS VPC DNS, which has been completely reliable. We did not see this issue with Indico 2.3.2.
āName or service not knownā is a standard error coming from the OS. Typically it means thereās something wrong with DNS or networking. Itās certainly not something Indico has any influence on, and Iām quite certain that psycopg2 (the database client lib) is also just passing on the error it gets when trying to do the DNS lookup.
If your local DNS lookups are handled through systemd-resolved, maybe check its logs if thereās anything useful in there?
Thanks for the reply, and I agree with all your points, that Indico itself and the DB client lib are not likely to be at fault. Given that the AWS DNS has been utterly reliable for us, it doesnāt really make sense!
Good point about the systems-resolved logs, but thereās nothing of concern there.
We have never seen this before in any server, until now, when we are seeing it on both our new test Indico server, and then the new production Indico server. Hence my ādesperationā in asking whether anyone else has seen this?
Even after changing the resolved log-level to debug, it steadfastly refuses to log queries that result in NXDOMAIN. Weird, and unhelpful, thanks systemd.
Iād like to fiddle about with the database connection pool settings, since the āDNSā errors I see mostly coincide with a change in the number of connections the DB has open from the app server.
That sounds really strange, if you have too many connections youād get a different error but not a DNS error. I will fix the broken link in the docs.
Sorry to mislead, I donāt think weāre seeing too many connections being opened, I just wanted (on the test system to start with) to make some changes to force Flask-SQLAlchemy to get rid of unused connections much quicker, so as to maybe force this error to happen more often, to give me a better chance of seeing what triggers it ⦠or something. Just trying anything I can think of at the moment.
OK thanks. I am still trying to track down how/why these DNS errors are occurring.
BTW I can see our Indico instance has been around since 2010. In fact, we have events dating back to at least 1999, but I can see from the event logs that these older events were all added from 2010 onwards, as a way of recording things for posterity. If youāre interested, this is the SKAO project (skao.int), and we expect the project to continue for the next 50 years at least.
Ah, so you probably migrated from legacy v1.2 at some point. That likely explains why you bad bad data for one of the API settings.
PS: Fun fact, Some CERN colleagues and I met someone from SKAO recently and while the meeting was focused mainly on other topics, Indico did come up briefly as well
Our DNS issues are solved. I was helped over on the Debian forums - see [1]. The issue is a known systemd-resolved problem [2]. This component isnāt a default part of Debian 12 if you install from an ISO, but it is in the official Debian 12 AWS AMIs. Removing it:
BTW the failures only seem to affect resolution of CNAMEs, and also itās somehow related to IPv6. Using a simple bash script to repeatedly use netcat to open a port to the AWS RDS DB using its main cluster endpoint (a CNAME) will show something like 0.5%-1.0% name resolution failures, but changing the script to point it at the actual RDS DB instance endpoint (an A record) means you get no failures. Also telling netnet to use IPv4 (ānc -4ā - which clearly affects how it performs name resolution) also means no failures. A bit mad.