Wednesday, June 19, 2013

Update your robots.txt, and nginx.conf

[update: not sure if robots.txt is respected]
We recently had an interesting afternoon when our servers were all but brought down by a crawler called 80legs. This crawler showed no mercy on our poor site and bore down with great vengeance and furious anger on articles so old that the cache didn't have them.
We get a lot of crawlers, if you've spent hours gazing at your instance-Z2.logs like i have (beats a washing machine!), you've seen a lot of them. What was surprising was the amount of traffic it generated: it was more like a DDoS attack than a crawling service.
The thing with 80legs is: Crawls are initiated by their users. Any user. There is no check to see if the user is affiliated with the site to be crawled. You can create an account with a fake mailaddress. If they have, as they boast, 50,000 servers across the world getting 10,000 pages (the maximum amount on a free account) in no time, i can imagine other sites also having trouble with this. 
So after the initial quick fix (blocking the user agent '008' in nginx), i contacted the company to ask what measures they take to keep unthinking (or even malicious) users from bringing down sites. The answer was swift, simple and sucky: nothing. They apologized, and added our site to a list.
80legs says they obey robots.txt, and i think it's wise that you update yours. But the wikipedia page has some examples of cases where robots.txt was not respected, and a Twitter search indicates there are still problems. So the best approach seems to be not to rely on robots.txt, and set up your webserver / cache to ignore requests from this user agent.