> Perhaps someone at their end screwed up a loop conditional, but you'd think some monitoring dashboard somewhere would have a warning pop up because of this.
If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.
The orgs are not ruthless like that, anything less than a certain % of the org revenue is not worth bothering unless it creates _more_ work to the person responsible for it than fixing it does.
Add some % if person who gets more work from the problem is not the same as the person who needs to fix it. People will happily leave things in a broken state if no one calls them out on it.
Facebook just decided that instead of loading the robots.txt for every host they intend to crawl, they'll just ignore all the other robots.txt files and then access this one a million times to restore the average.
For some reason, Facebook has been requesting my Forgejo instance's robots.txt in a loop for the past few days, currently at a speed of 7700 requests per hour. The resource usage is negligible, but I'm wondering why it's happening in the first place and how many other robot files they're also requesting repeatedly. Perhaps someone at Meta broke a loop condition.
As facebookexternalhit is listed in the robots.txt, it does look like it's optimistically rechecking in the hope it's no longer disallowed. That rate of request is obscene though, and falls firmly into the category of Bad Bot.
While 7700 per hour sounds big, pretty much any dinky server can handle it. So I don't think it's a matter of DDoS. At this point it's just... odd behaviour.
especially for a txt file. I don't know anything really about webdev but I'm pretty sure serving up 7700 plaintext files with roughly 10 lines each an hour isn't that demanding
Do crawlers follow/cache 301 permanent redirects? I wonder if you could point the firehouse back at facebook, but it would mean they wouldn't get your robots.txt anymore (though I'd just blackhole that whole subnet anyway)
Has anyone done research on the topic of trying to block these bots by claiming to host illegal material or talking about certain topics? I mean having a few entries in your robots like "/kill-president", "/illegal-music-downloads", "/casino-lucky-tiger-777" etc.
I'm sure their crawler can handle a zip bomb. Plus it might interpret that as "this site doesn't have a robots.txt" and start scraping that OP is trying to prevent with their current robots.txt.
Even if they haven't added any cache control headers, what kind a of lazy Meta engineer designed their crawler with to just pull the same URL multiple times a second?
Is this where all that hardware for AI projects is going? To data centers that just uncritically hits the same URL over and over without checking if the content of a site or page has chanced since the last visit then and calculate a proper retry interval. Search engine crawlers 25 - 30 years ago could do this.
Hit the URL once per day, if it chances daily, try twice a day. If it hasn't chanced in a week, maybe only retry twice per week.
Forgejo does set "cache-control: private, max-age=21600", which is considerably more than one second, but I grant it uses the "private" keyword for no reason here.
I recently started maintaining a MediaWiki instance for a niche hobbyist community and we'd been struggling with poor server performance. I didn't set the server up, so came into it assuming that the tiny amount of RAM the previous maintainer had given it was the problem.
Turns out all of the major AI slop companies had been hounding our wiki constantly for months, and this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.
Millions upon millions of requests, hundreds of GB's of bandwidth. Thankfully we're using Cloudflare so could block all of them except real search engine crawlers and now we don't have any problems at all. I also made sure to constrain Apache's limits a bit too.
From what I've read, forums, wikis, git repos are the primary targets of harassment by these companies for some reason. The worst part is these bots could just download a git repo or a wiki dump and do whatever it wants with it, but instead they are designed to push maximum load onto their victims.
Our wiki, in total, is a few gigabytes. They crawled it thousands of times over.
If you've been in any big company you'll know things perpetually run in a degraded, somewhat broken mode. They've even made up the term "error budget" because they can't be bothered to fix the broken shit so now there's an acceptable level of brokenness.
Surely it's more likely that it's just cheaper to pay for the errors than to pay to fix the errors.
Why fix 10k worth of errors if it'll cost me 100k to fix it?
Add some % if person who gets more work from the problem is not the same as the person who needs to fix it. People will happily leave things in a broken state if no one calls them out on it.
How does one learn these skills, I can see them being useful in the future
Is this where all that hardware for AI projects is going? To data centers that just uncritically hits the same URL over and over without checking if the content of a site or page has chanced since the last visit then and calculate a proper retry interval. Search engine crawlers 25 - 30 years ago could do this.
Hit the URL once per day, if it chances daily, try twice a day. If it hasn't chanced in a week, maybe only retry twice per week.
And it's quite a trivial feature at that.
Turns out all of the major AI slop companies had been hounding our wiki constantly for months, and this had resulted in Apache spawning hundreds of instances, bringing the whole machine to a halt.
Millions upon millions of requests, hundreds of GB's of bandwidth. Thankfully we're using Cloudflare so could block all of them except real search engine crawlers and now we don't have any problems at all. I also made sure to constrain Apache's limits a bit too.
From what I've read, forums, wikis, git repos are the primary targets of harassment by these companies for some reason. The worst part is these bots could just download a git repo or a wiki dump and do whatever it wants with it, but instead they are designed to push maximum load onto their victims.
Our wiki, in total, is a few gigabytes. They crawled it thousands of times over.