A Mechanize Spider
May 24, 2006 @ 7:30 pmA friend of mine pointed me at the Fear perl module today, and it has inspired me some on Mechanize. I couldn’t believe the size of a spider using the Fear API:
url("google.com");
&$_ >> _self while $_;
That is really amazing! I also can’t read it….. After looking at the Fear innards, I finally understood the code, so I tried to reproduce it with Mechanize. This is what I came up with:
agent = WWW::Mechanize.new
stack = agent.get(ARGV[0]).links
while l = stack.pop
stack.push(*(agent.click(l).links)) unless agent.visited? l.href
end
To get this to work, I added the “visited?” method to the yet to be release 0.4.6 version of Mechanize. I’ve got a few more lines, but still pretty small. I still don’t like my spider though because it will visit any domain. I don’t really want it to try to read the entire internet, so I added the following line at the top of the while loop:
next unless l.uri.host == agent.history.first.uri.host
Can I make it shorter? I’m not sure yet. Do I want it to be shorter? Not sure about that either.