OK I’m just ecstatic with my new crawler, I think nobody but Google has one better than me, and I’m ready for a good old fashion show-and-tell. Multi-threaded programming is a bear to deal with and I’ve written several crawlers in different languages. For years I’ve been plagued with several complex problems:
* Complex code that is difficult to maintain and difficult to setup on a server
* Memory leakage
* Configurability
So the latest design is just 192 lines of Python in a single file, has a single configuration file, and takes about 5 minutes to setup on a standard Linux machine. I ran it last night and was delighted with the results:
Test Run
Tested 139,740 urls
Completed in 2 hrs, 13 mins
3.6 GB of html
Average filesize: 25.05 KB
Averaging
18.2 urls/second
1.572 million urls/day
Hardware and Environment
3 year old Dell Poweredge SC240
Pentium 4
3.5 GB of RAM
Average CPU load: 0.16
Average physical RAM used: 950 MB
OS: Ubuntu 7.10 (Gutsy Gibbon)
Filesystem: ReiserFS 3
Network connection:
Residential cable modem 5Mbps down (of which 100% is consumed when its running so likely to be faster on a fatter pipe)
Even better this code is infinitely extensible. We’ll spread it across as many machines as necessary to download the entire internet.
Big SEO’s with Crawlers… what are your stats?
Since rebranding some of our old classifieds sites and relaunching the system as OhSoHandy.com in a newly built Ruby on Rails app we’ve received a handful of emails complaining about strange behavior that always involved links not appearing for the user.
How do you read the rest of the postings or see any pictures that were uploaded?!?! There are no links on the classifieds to keep reading them. Please help since I am new to the website.
At first I discounted this as user error. “These fools don’t know how to use the internets!” DELETE.
Read more…
Sometimes you add a column to a table in a migration and then you want populate the new column with some data. Run your migration and while your column has been created in the database, your data does not populate. The problem is that those columns are not accessible via ActiveRecord and so you just need to tell it to update itself:
add_column :user, :favorite_beer, :string
User.reset_column_information #<<<<<<<< Here is the ActiveRecord reload
tony = User.find_by_name "Tony Spencer"
tony.favorite_beer = "Terrapin Rye Pale Ale"
tony.save
We’ve been using Basecamp for some time now to manage multiple projects and I have really enjoyed it except for the lack of integrated issue/bug tracking. I’ve tried hacking to-do lists and categorizing messages but I just can’t make Basecamp work for our issue tracking even though I don’t need fancy features. I just want to rapidly log/assign issues to team members, change status, and reassign back to me when the issue is completed.
For years I’ve been using Mantis and it works but its quirky and rather slow to work with as the interface isn’t designed all too well. There is also some stupid bug that makes it impossible for me to sort issues by different columns. I’ve just signed up for Lighthouse and here are a few pros and cons I’ve noticed immediately:
- As a technical manager I like to be able to enter bugs/issues quickly w/out using the mouse. Basecamp to-do lists are very nice this way as I can quickly type, tab, and hit space bar to enter an item and assign it to someone. The create ticket feature forces me to pickup the mouse and click several places which slows things down. It would also be very nice if it tickets were created with AJAX as to-do items in BC are done so I can very quickly fill up peoples queue . (Hey my guys work fast so I have to enter bugs fast!)
- It’s not very apparent which project I’m currently managing. Only the small drop down on the right lets me know. I wish Lighthouse would make the current project name more prominent like in Basecamp. Also it would be quicker to bounce around between projects if they were a list of links rather than a select list.
- There is no issue tracking in Basecamp which is why I am giving this nice looking app a try. However, I would continue to use Basecamp for other aspects of the project. It would be great if they could drop in my URL to a project in Basecamp when I create the project in Lighthouse so it could provide me that link in the right nav so I could jump back there.
- I like the ability to add an avatar to users in Lighthouse. Helps to make it easier to see who did what and gives it a personal touch.
- The “feature updates” box is taking up too much of the real estate on every page and never goes away.
- The top header is a little too big and is wasting space above the fold hindering me from seeing more without scrolling.
- I like the ability to pay with PayPal subscription which got me up and running very quickly
- The ability to create a simple “Page” is nice. Currently we have a writeboard in one project in Basecamp that we keep all info about our server setup in such as gems to install, cron jobs, where files exist, and how to deploy. The problem with that is I can’t share it with everyone without adding everyone to that project and it really isn’t specific to that one project. Pages solves that in Lighthouse. I will now also add pages like coding best practices, and subversion how to’s.
I know I published a lot of negatives here but on the whole I’m liking this hosted app and would love to get away from stinking Mantis and managing my own bug tracking system. I’ll post more updates as we use it more.
Update to Lighthouse Issue Tracking
It looks like they removed the banner that was wasting space which is nice. However, one BIG problem I discovered:
I cannot use a “pre” tag to drop in HTML and not have it rendered by the browser which makes it very hard for me to show a designer or developer some html I want them to use.
Also I can now tab to the field where you select a user to assign a ticket to but I still cannot change that field without picking up the mouse and clicking on it.
Damn I wish there were a simple interface for entering bugs that looked something like this