Sorry for the questions, but management is breathing down my neck because last week some lawyer bot found an image in a 12-year-old presentation that now belongs to an american company and is now issuing us with a warning. So how do you deal with a) old data sets, b) the flood of bots and crawlers, some of which you want to have to feed search engines, and c) access rights to posts, or do you delegate responsibility for ensuring that conference posts do not contain copyright-protected content to the conference managers? I look forward to hearing your comments and seeing if I can find something I haven’t already told the chiefs.
our instance is not that long running, but we also offer separate document repositories and encourage our users to put all non-trivial documents there (and motivate them to do so by explicitely not guaranteeing long term file hosting in indico). The repositories are well received as they also add bibliographic information (such as DOI) which makes citing easier. There it’s also possible to embargo or even tombstone problematic submissions and links will redirect to an information page.
Battling bots we do very little. We have an anubis instance for suspicious bots and a rate limit of 100 rpm per IP.
I would highly recommend to split this topic in two. Copyright violations and annoying (usually AI-related) crawlers are completely different things.
For copyright you should probably have a ToS that puts the responsibility on your use rs (and you just take down reported content, if they want to harass the author of the presentation w/ the problematic content then it’s up to them).
For crawlers, if they cause disruption for you, then using something like Anubis in front of it would indeed work.