Big SEO’s with Crawlers: Lets See Your Stats
OK I’m just ecstatic with my new crawler, I think nobody but Google has one better than me, and I’m ready for a good old fashion show-and-tell. Multi-threaded programming is a bear to deal with and I’ve written several crawlers in different languages. For years I’ve been plagued with several complex problems:
* Complex code that is difficult to maintain and difficult to setup on a server
* Memory leakage
* Configurability
So the latest design is just 192 lines of Python in a single file, has a single configuration file, and takes about 5 minutes to setup on a standard Linux machine. I ran it last night and was delighted with the results:
Test Run
Tested 139,740 urls
Completed in 2 hrs, 13 mins
3.6 GB of html
Average filesize: 25.05 KB
Averaging
18.2 urls/second
1.572 million urls/day
Hardware and Environment
3 year old Dell Poweredge SC240
Pentium 4
3.5 GB of RAM
Average CPU load: 0.16
Average physical RAM used: 950 MB
OS: Ubuntu 7.10 (Gutsy Gibbon)
Filesystem: ReiserFS 3
Network connection:
Residential cable modem 5Mbps down (of which 100% is consumed when its running so likely to be faster on a fatter pipe)
Even better this code is infinitely extensible. We’ll spread it across as many machines as necessary to download the entire internet.
Your per day number is pretty ridiculous. Do you mind going into how you’re storing the pages?
So we get to hear you brag but not see the code?
At my last job on of the developers did a similar thing in Ruby, while the rest of us toiled in Perl to get something similar. I’d say that you’ve got quite a winner on your hands.
When you say you’re testing URLs, what exactly are you testing? Are you checking for keywords or regex matches, or are you verifying that a found URL exists and then pulling URLs from the (new) URL?
@Jordan
The pages are stored in flat file all in the same directory and are named by ID. Usually performance gets pretty lousy when you put too many files in one dir but after reading this article on filesystem performance I tried formatting my storage drive in ReiserFS3 and it was all good. Before that I used to use the hashing subdirectory method that blogger mentions which works well but if you don’t have to do all that checking within the crawler you can go that much faster.
http://ygingras.net/b/2007/12/too-many-files%3A-reiser-fs-vs-hashed-paths
@Michael
I attempted the same thing in Ruby but could never achieve stellar results from it. I’m considering releasing the code. As for the checking, this code is purely retrieving content from the web. Other processes will come into play after retrieval for some of the things you are speaking of as well as dumping the content in to a lucene index.
This is the coolest thing I’ve heard of in a while. I was wondering what it would take to write a crawler after a project that I was interested in pursuing took me in that direction. I never did anything with it but this is fascinating reading.
p.s. Love your theme…
“this code is purely retrieving content from the web” — so is this essentially a site downloader, i.e. functionally similar / identical to what HTTrack does (http://www.httrack.com/)
@Ramon
Its only downloading HTML for SEO analysis
Hey Tony,
I’ve had nutch throttled back and it still filled a 30-40megabit line. That’s off of one server.
I forget exactly how much it downloaded, but it was in the range of the number of url’s you’re talking about. And again, that was throttled back. Not sure what it would have done wide open.
@wheel
Nutch is indeed impressive but damn is it difficult to get up and running. I spent some time exploring it to see if I could mould it to fit my needs and decided it would require too much hacking and would violate one of my requirements of a simple setup.
How I can got this code? Whether free or paid. If you charge for this I’m willing to pay for it.