Ruby Could Replace my Python Crawler Pretty Soon
One of my developers just sent me some truly incredible stats about Ruby 1.9 and its threading performance.
20 threads * 100,000 iterations
Ruby 1.9 = 1.54 s.
Ruby Enterprise = 3.01 s.
JRuby 1.1.2 = 5.82 s.
Jython 2.2.1 = 11.86 s.
Python 2.5.2 = 12.32 s.
Ruby 1.8.7 = 22.68
Since our attempt at testing Ruby as a crawler really wasn’t all that much slower than Python it could be really interesting to see what will happen with Ruby 1.9.
The blog post about the test (Its in Polish)
I’m convinced Twitter needs a complete rewrite
… from scratch. Yeah the performance is incredibly horrible and I really feel like I could take a small chunk of that $15 million and immediately make the performance rock, but I am starting to feel like the developers who built it complete cocked it all up. The damn thing can’t even store my replies right and I’ve heard others complain of this. I take it is bad AJAX code and I think they should reconsider using AJAX for basic functions.
Example:
I replied to @benwills‘ tweet :
with:
but if you click on the ” in reply to benwills” link in my reply it goes to the wrong tweet.
Twitter is giving Rails a bad name
Uggh. Rebuild it already. Its only a few actions. It wasn’t built for this kind of app.
Python, C, Perl, whatever.
Google Crawling HTML Forms IS Harmful to Your Rankings
A couple of months ago Google officially announced it would be “exploring some HTML forms to try to discover new web pages“. I imagine more than a few SEO’s were baffled by this decision as was I but were probably not too concerned about the decision as Google promised us all “this change doesn’t reduce PageRank for your other pages” and would only increase your exposure in the engines.
During the month of April I began to notice a lot of our internal search pages were not only indexed but outranking the relevant pages for a user’s query. For instance, if you Googled “SubConscious Subs” the first page to appear in the SERP’s would be something like:
http://raleigh.ohsohandy.com/ads/search?q=tables
rather than the page for the establishment:
http://raleigh.ohsohandy.com/review/27571-sub-concious-subs
This wasn’t just a random occurrence. It was happening a lot and in addition to the landing pages being far less relevant for the user, they weren’t optimized for the best placement in the search engines so they were appearing in position #20 instead of say position #6. These local search pages even had pagerank usually between 2 and 3.
Hmm, Just How Bad is This Problem
Eventually I began to realize how often I was running into this in Google, noticed my recent, slow, decline in traffic and it occurred to me this may be a real problem. I’ve never linked to any local search pages on OhSoHandy.com and I couldn’t see that anyone else had either. I queried to find out how many search pages Google had indexed:
Whoa. 5,000+ pages of junk in the index with pagerank. I slept on it for a night, got up the next morning and plugged in
Disallow: /ads/search?q=*
in robots.txt (and threw in a meta robots noindex on those pages for safe measure). Within a week we saw a big improvement in rankings due to properly optimized pages trumping crap and traffic is up 25% since the change and back to trending upwards weekly instead of stagnant, slow decline.
Bit of Advice
The robots.txt disallow works but it is slow to remove the URL’s from Google’s index. I added the meta noindex tag to the search pages a week later and saw much faster results.