I’m referencing these here as their a similar set of issues to describe the behavior No disk space left after big zip file into /opt/indico/tmp.
And midway down into Export large number of posters from indico - #10 where they get into similar problems.
Currently running 3.3.6
Basically, twice in the last two weeks we’ve filled up the available disk space.
The first time was two weeks ago and we assumed that it was a big conference that let out and hundreds of users rushing download the materials which included video. We cleared out our cache and tmp folders and added about 200gb of disk for what we thought was more than enough headroom. However…
This morning starting at ~4:00am indico got crushed by external traffic that triggered the “download material” function in many different events simultaneously, rapidly filling the storage on the server. We thought at first it might have been the same thing.
But my senior engineer pulled the web access logs, between 04:00 and 04:59 today indico was hit 26213 times, but only from 3 IP addresses. Looks like some kind of scanner or possibly a botnet/crawler hit a bunch of download material buttons. Not too certain about the first event anymore.
Seemed to be mostly concentrated in about a dozen different events. With only 4 events with huge numbers of *.zips from the download materials in the attachment-packages folder. We’re asking our cyber folks now. But whatever it is, it basically filled both the archives and the /opt (we don’t use disk quotas for indico) overnight.
We’re looking for ways to try to throttle (something like the LATEX_RATE_LIMIT config but for package materials?) or if we can’t run the cleanup tool process more proactively than every 24 hours. Right now, we’re going to throw a lot more head room at the disk. But, last time we had less than a hundred gigs of headroom left on archive before it filled it up. This time we had more than 230gb. We’re going to probably add a TB, but that still doesn’t quite solve the problem if it fills up aggressively like this in a 24 hour period.
Any ideas or features we missed looking into this problem? Any specific logs you’d want to look at for reference?