I've done crawling at a small startup and I've done crawling at a big tech company. This is not crawling more politely than big tech.
There are a few things that stand out, like:
> I fetch all robots.txts for given URLs in parallel inside the queue's enqueue function.
Could this end up DOS'ing or being "impolite" just in robots.txt requests?
All of this logic is per-domain, but nothing mentioned about what constitutes a domain. If this is naive, it could easily end up overloading a server that uses wildcard subdomains to serve its content, like Substack having each blog on a separate subdomain.
When I was at a small startup doing crawling, the main thing our partners wanted from us was a maximum hit rate (varied by partner). We typically promised fewer than 1 request per second, which would never cause perceptible load, and was usually sufficient for our use-case.
Here at $BigTech, the systems for ensuring "polite", and policy-compliant crawling (robots.txt etc) are more extensive than I could possibly have imagined before coming here.
It doesn't surprise me that OpenAI and Amazon don't have great systems for this, both are new to the crawling world, but concluding that "Big Tech" doesn't do polite crawling is a bit of a stretch, given that search engines are most likely doing the best crawling available.
I think a default max of 1 request every 5 seconds is unnecessarily meek, especially for larger sites. I'd also argue that requests that browsers don't slow down for, like following redirects to the same domain or links with the prefetch attribute, don't really necessitate a delay at all.
If you can detect a site has a CDN, metrics like time-to-first-byte are low and stable and/or you're getting cache control headers indicating you're mostly getting cached pages, I see no reason why one shouldn't speed up - at least for domains with millions of URLs.
I disagree with using HEAD requests for refreshing. A HEAD request is rarely cheaper and sometimes more expensive for some websites than a GET If-Modified-Since/If-None-Match. Besides, you're going to fetch the page anyway if it changed, so why issue two requests when you could do one?
Having a single crawler per process/thread makes rate limiting easier, but it can lead to some balancing and under-utilization issues with distributed crawling due to the massive variation in URLs per domain and site speeds, especially if you use something like a hash to distribute them. For Commoncrawl, I had something that monitored utilization and shut down crawler instances which would redistribute URLs pending from the machines shutting down to the machines left (we were doing it on a shoestring budget using AWS spot instances, so it had to survive instances going down randomly anyway).
I'd say one of the best polite things to do when crawling is to add a URL to the crawler user agent pointing to a page explaining what it is and maybe letting people opt-out or explain how to update their robots.txt to let them out-out.
Was all of our posting on the net on forums, HN, Reddit, digg, Slashdot, etc. just to train the AI of the future? I think about this a lot. AI has that "annoying forum poster" tone to everything and now I can't unsee it when I (rarely) use it. Maybe I'm just post-internet. I've been thinking about that a lot also. I'm tired of 99.75% of the internet.
I'm just beginning to learn about curl and wget. Can anyone recommend similar resources to this one that emphasize politeness?
For example, I'd like to grab quite a few books from archive.org, but want to use their torrent option, when available. I don't like the idea of "slamming" their site because I'm trying to grab 400 books at once.
Thank you kindly for the recommendation! It's much appreciated. I look forward to using that along with --rate to lower the requests themselves, since --limit-rate is for the average speed, not necessarily the max speed.
If you don't mind, I'd like to pick your brain for two other questions. :)
Question 1. Do you know of any sort of commonly used "normal" speed? I have been using --limit-rate 50k. I have the ability to go much faster, but I don't know how fast is too fast. 100k? 500k? 1m? 100g?! 1m is probably too much, but I'm not sure by how much.
I was thinking there might be a way to click around the site with DevTools -> Network and observe how quickly things are moving around, then stay under that threshold, but I don't know if there's a more obvious solution I'm not thinking of.
Question 2. Regarding `robots.txt`, the linked article mentions:
> If a site doesn't specify a crawl-delay in robots.txt, I default to one request every five seconds. If I get 429s, I slow down.
Is the author trying to say: "If `robots.txt` DOES specify a `crawl-delay` or `limit-rate` value, curl and wget will AUTOMATICALLY obey that specified value"?
Or, is it simply: "I MANUALLY check foo.bar/robots.txt and MANUALLY configure `crawl-delay` and/or `limit-rate` to the specified value. Otherwise, I set it to 5 (or higher, if I start getting 429'd)"?
I'm guessing the latter, but it'd be sweet if it's the former. It would make sense for an automatic tool to have an automatic configuration.
I've done crawling at a small startup and I've done crawling at a big tech company. This is not crawling more politely than big tech.
There are a few things that stand out, like:
> I fetch all robots.txts for given URLs in parallel inside the queue's enqueue function.
Could this end up DOS'ing or being "impolite" just in robots.txt requests?
All of this logic is per-domain, but nothing mentioned about what constitutes a domain. If this is naive, it could easily end up overloading a server that uses wildcard subdomains to serve its content, like Substack having each blog on a separate subdomain.
When I was at a small startup doing crawling, the main thing our partners wanted from us was a maximum hit rate (varied by partner). We typically promised fewer than 1 request per second, which would never cause perceptible load, and was usually sufficient for our use-case.
Here at $BigTech, the systems for ensuring "polite", and policy-compliant crawling (robots.txt etc) are more extensive than I could possibly have imagined before coming here.
It doesn't surprise me that OpenAI and Amazon don't have great systems for this, both are new to the crawling world, but concluding that "Big Tech" doesn't do polite crawling is a bit of a stretch, given that search engines are most likely doing the best crawling available.
It’s probably a huge liability to not have very advanced and compliant crawlers.
Accidentally ddosing several businesses seems like an expensive lawsuit.
I think a default max of 1 request every 5 seconds is unnecessarily meek, especially for larger sites. I'd also argue that requests that browsers don't slow down for, like following redirects to the same domain or links with the prefetch attribute, don't really necessitate a delay at all.
If you can detect a site has a CDN, metrics like time-to-first-byte are low and stable and/or you're getting cache control headers indicating you're mostly getting cached pages, I see no reason why one shouldn't speed up - at least for domains with millions of URLs.
I disagree with using HEAD requests for refreshing. A HEAD request is rarely cheaper and sometimes more expensive for some websites than a GET If-Modified-Since/If-None-Match. Besides, you're going to fetch the page anyway if it changed, so why issue two requests when you could do one?
Having a single crawler per process/thread makes rate limiting easier, but it can lead to some balancing and under-utilization issues with distributed crawling due to the massive variation in URLs per domain and site speeds, especially if you use something like a hash to distribute them. For Commoncrawl, I had something that monitored utilization and shut down crawler instances which would redistribute URLs pending from the machines shutting down to the machines left (we were doing it on a shoestring budget using AWS spot instances, so it had to survive instances going down randomly anyway).
I'd say one of the best polite things to do when crawling is to add a URL to the crawler user agent pointing to a page explaining what it is and maybe letting people opt-out or explain how to update their robots.txt to let them out-out.
A few implementation details from building a hobby crawler
If you have cache headers, why use HEAD? Are servers more likely to handle HEAD correctly than including them on the GET?
> Are servers more likely to handle HEAD correctly than
In my experience there are a lot of servers that don't handle HEAD at all, let alone correctly.
Was all of our posting on the net on forums, HN, Reddit, digg, Slashdot, etc. just to train the AI of the future? I think about this a lot. AI has that "annoying forum poster" tone to everything and now I can't unsee it when I (rarely) use it. Maybe I'm just post-internet. I've been thinking about that a lot also. I'm tired of 99.75% of the internet.
AI has that "annoying forum poster" tone to everything
Does it? What I've seen has been more like "annoying customer service representative" instead.
It would be pretty sad if AI stole all the content from a forum killing it with the added load only to regurgitate it.
If the author reads this, you have a misspelling of "diaspora" in the first sentence.
Also padding: 1em would go a long ways to making the page readable.
Agreed. Reader View to the rescue (again /sigh).
This is timely as I’m just building out a crawler in scrapy. Thanks!
I'm just beginning to learn about curl and wget. Can anyone recommend similar resources to this one that emphasize politeness?
For example, I'd like to grab quite a few books from archive.org, but want to use their torrent option, when available. I don't like the idea of "slamming" their site because I'm trying to grab 400 books at once.
Some basic stuff on curl rate limiting at https://everything.curl.dev/usingcurl/transfers/rate-limitin...
Thank you kindly for the recommendation! It's much appreciated. I look forward to using that along with --rate to lower the requests themselves, since --limit-rate is for the average speed, not necessarily the max speed.
If you don't mind, I'd like to pick your brain for two other questions. :)
Question 1. Do you know of any sort of commonly used "normal" speed? I have been using --limit-rate 50k. I have the ability to go much faster, but I don't know how fast is too fast. 100k? 500k? 1m? 100g?! 1m is probably too much, but I'm not sure by how much.
I was thinking there might be a way to click around the site with DevTools -> Network and observe how quickly things are moving around, then stay under that threshold, but I don't know if there's a more obvious solution I'm not thinking of.
Question 2. Regarding `robots.txt`, the linked article mentions:
> If a site doesn't specify a crawl-delay in robots.txt, I default to one request every five seconds. If I get 429s, I slow down.
Is the author trying to say: "If `robots.txt` DOES specify a `crawl-delay` or `limit-rate` value, curl and wget will AUTOMATICALLY obey that specified value"?
Or, is it simply: "I MANUALLY check foo.bar/robots.txt and MANUALLY configure `crawl-delay` and/or `limit-rate` to the specified value. Otherwise, I set it to 5 (or higher, if I start getting 429'd)"?
I'm guessing the latter, but it'd be sweet if it's the former. It would make sense for an automatic tool to have an automatic configuration.
I doubt big tech cares enough if they are doing this to a website. They just want to fiercely battle the competition and make profits