Comments on: Big SEO’s with Crawlers: Lets See Your Stats http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/ It's Just Links Wed, 14 Sep 2011 13:47:04 +0000 http://wordpress.org/?v=2.9.1 hourly 1 By: Paulo Ricardo http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-177405 Paulo Ricardo Fri, 31 Jul 2009 12:10:52 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-177405 How I can got this code? Whether free or paid. If you charge for this I'm willing to pay for it. How I can got this code? Whether free or paid. If you charge for this I’m willing to pay for it.

]]>
By: tony http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-66988 tony Tue, 25 Mar 2008 16:02:56 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-66988 @wheel Nutch is indeed impressive but damn is it difficult to get up and running. I spent some time exploring it to see if I could mould it to fit my needs and decided it would require too much hacking and would violate one of my requirements of a simple setup. @wheel
Nutch is indeed impressive but damn is it difficult to get up and running. I spent some time exploring it to see if I could mould it to fit my needs and decided it would require too much hacking and would violate one of my requirements of a simple setup.

]]>
By: wheel http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-53012 wheel Thu, 31 Jan 2008 12:32:41 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-53012 Hey Tony, I've had nutch throttled back and it still filled a 30-40megabit line. That's off of one server. I forget exactly how much it downloaded, but it was in the range of the number of url's you're talking about. And again, that was throttled back. Not sure what it would have done wide open. Hey Tony,

I’ve had nutch throttled back and it still filled a 30-40megabit line. That’s off of one server.

I forget exactly how much it downloaded, but it was in the range of the number of url’s you’re talking about. And again, that was throttled back. Not sure what it would have done wide open.

]]>
By: tony http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-50618 tony Sat, 19 Jan 2008 21:12:45 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-50618 @Ramon Its only downloading HTML for SEO analysis @Ramon
Its only downloading HTML for SEO analysis

]]>
By: Ramon http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-50611 Ramon Sat, 19 Jan 2008 19:45:14 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-50611 "this code is purely retrieving content from the web" -- so is this essentially a site downloader, i.e. functionally similar / identical to what HTTrack does (http://www.httrack.com/) “this code is purely retrieving content from the web” — so is this essentially a site downloader, i.e. functionally similar / identical to what HTTrack does (http://www.httrack.com/)

]]>
By: Vivevtvivas http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-49073 Vivevtvivas Sun, 13 Jan 2008 01:16:25 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-49073 This is the coolest thing I've heard of in a while. I was wondering what it would take to write a crawler after a project that I was interested in pursuing took me in that direction. I never did anything with it but this is fascinating reading. p.s. Love your theme... This is the coolest thing I’ve heard of in a while. I was wondering what it would take to write a crawler after a project that I was interested in pursuing took me in that direction. I never did anything with it but this is fascinating reading.

p.s. Love your theme…

]]>
By: tony http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-47377 tony Fri, 04 Jan 2008 20:40:16 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-47377 @Jordan The pages are stored in flat file all in the same directory and are named by ID. Usually performance gets pretty lousy when you put too many files in one dir but after reading this article on filesystem performance I tried formatting my storage drive in ReiserFS3 and it was all good. Before that I used to use the hashing subdirectory method that blogger mentions which works well but if you don't have to do all that checking within the crawler you can go that much faster. http://ygingras.net/b/2007/12/too-many-files%3A-reiser-fs-vs-hashed-paths @Michael I attempted the same thing in Ruby but could never achieve stellar results from it. I'm considering releasing the code. As for the checking, this code is purely retrieving content from the web. Other processes will come into play after retrieval for some of the things you are speaking of as well as dumping the content in to a lucene index. @Jordan
The pages are stored in flat file all in the same directory and are named by ID. Usually performance gets pretty lousy when you put too many files in one dir but after reading this article on filesystem performance I tried formatting my storage drive in ReiserFS3 and it was all good. Before that I used to use the hashing subdirectory method that blogger mentions which works well but if you don’t have to do all that checking within the crawler you can go that much faster.
http://ygingras.net/b/2007/12/too-many-files%3A-reiser-fs-vs-hashed-paths

@Michael
I attempted the same thing in Ruby but could never achieve stellar results from it. I’m considering releasing the code. As for the checking, this code is purely retrieving content from the web. Other processes will come into play after retrieval for some of the things you are speaking of as well as dumping the content in to a lucene index.

]]>
By: Michael Thompson http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-47363 Michael Thompson Fri, 04 Jan 2008 18:18:19 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-47363 So we get to hear you brag but not see the code? ;) At my last job on of the developers did a similar thing in Ruby, while the rest of us toiled in Perl to get something similar. I'd say that you've got quite a winner on your hands. When you say you're testing URLs, what exactly are you testing? Are you checking for keywords or regex matches, or are you verifying that a found URL exists and then pulling URLs from the (new) URL? So we get to hear you brag but not see the code? ;)

At my last job on of the developers did a similar thing in Ruby, while the rest of us toiled in Perl to get something similar. I’d say that you’ve got quite a winner on your hands.

When you say you’re testing URLs, what exactly are you testing? Are you checking for keywords or regex matches, or are you verifying that a found URL exists and then pulling URLs from the (new) URL?

]]>
By: Jordan Glasner http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/comment-page-1/#comment-47343 Jordan Glasner Fri, 04 Jan 2008 16:06:44 +0000 http://www.tonyspencer.com/2008/01/03/big-seos-with-crawlers-lets-see-your-stats/#comment-47343 Your per day number is pretty ridiculous. Do you mind going into how you're storing the pages? Your per day number is pretty ridiculous. Do you mind going into how you’re storing the pages?

]]>