Tenderlove Making

A Mechanize Spider

A friend of mine pointed me at the Fear perl module today, and it has inspired me some on Mechanize. I couldn’t believe the size of a spider using the Fear API: ~~~ perl url(“google.com”); &$_ » self while $; ~~~ That is really amazing! I also can’t read it….. After looking at the Fear innards, I finally understood the code, so I tried to reproduce it with Mechanize. This is what I came up with: ~~~ ruby agent = WWW::Mechanize.new stack = agent.get(ARGV[0]).links while l = stack.pop stack.push(*(agent.click(l).links)) unless agent.visited? l.href end ~~~ To get this to work, I added the “visited?” method to the yet to be release 0.4.6 version of Mechanize. I’ve got a few more lines, but still pretty small. I still don’t like my spider though because it will visit any domain. I don’t really want it to try to read the entire internet, so I added the following line at the top of the while loop: ~~~ ruby next unless l.uri.host == agent.history.first.uri.host ~~~ Can I make it shorter? I’m not sure yet. Do I want it to be shorter? Not sure about that either.

« go back