Tenderlove Making

A Mechanize Spider

A friend of mine pointed me at the Fear perl module today, and it has inspired me some on Mechanize. I couldn’t believe the size of a spider using the Fear API:

url("google.com");
&$_ >> _self while $_;

That is really amazing! I also can’t read it….. After looking at the Fear innards, I finally understood the code, so I tried to reproduce it with Mechanize. This is what I came up with:

agent = WWW::Mechanize.new
stack = agent.get(ARGV[0]).links
while l = stack.pop
  stack.push(*(agent.click(l).links)) unless agent.visited? l.href
end

To get this to work, I added the “visited?” method to the yet to be release 0.4.6 version of Mechanize. I’ve got a few more lines, but still pretty small. I still don’t like my spider though because it will visit any domain. I don’t really want it to try to read the entire internet, so I added the following line at the top of the while loop:

next unless l.uri.host == agent.history.first.uri.host

Can I make it shorter? I’m not sure yet. Do I want it to be shorter? Not sure about that either.

« go back