Category: mechanize

More Ruby Mechanize Dev

Posted by – August 8, 2006

I’ve been working away at mechanize lately. It seems like my creativity comes in bursts. As soon as I make a small bugfix release, I get inspired and write a whole bunch of stuff. Anyway, Eric pointed out that I didn’t have support for multi-select lists, so I added that for 0.5.2. I’m excited to release 0.5.3 already because I updated it so that when operating on an WWW::Mechanize::List, if the list doesn’t respond to the method being called, it will try it on the first element of the list. So, what does that mean? You can click on links without having to specify the first one all the time. You can do this:

agent.click page.links.text('Something')

Instead of doing this:

agent.click page.links.text('Something').first

But it is still an array, so if you have multiple links with the text ‘Something’, you can index in to the array like so:

agent.click page.links.text('Something')[2]

I’ve also been toying around with adding a “click” or “select” method to most objects. Then you can select the first radio button with either of the following lines:

agent.click form.radiobuttons.first
form.radiobuttons.first.click

I think this may help in the future if I want to get Javascript support. I’ve currently got this working with select lists, so this will select the second option from a dropdown:

form.selectlist.options[2].select

As for WWW::Mechanize 0.6.0, I’ve got Hpricot support in the trunk with all unit tests passing. This means that browsing the web and scraping those pages at the same time will be a breeze! I just hope that Why releases a gem for Hpricot on rubyforge soon! I also want to see if I can get Mechanize working with Selenium IDE so that scripts generated by Selenium IDE can be executed using mechanize and not through the browser.

New Stuff in Ruby Mechanize 0.5.0

Posted by – June 22, 2006

I’ve been working on the next pretty major release of Ruby WWW::Mechanize 0.5.0. I’ve decided to break some interfaces with this version, but I think it will all be for the better.

The first major change is that I’ve done is to unify the name space. There were a bunch of classes scattered around under WWW, like WWW::Link for instance. I’ve moved everything under WWW::Mechanize. The names will be a bit longer, but more consistent. This shouldn’t break too much code unless the code specifically uses a class name.

One of the best new features, in my opinion, is the addition of Pluggable Parsers.
More

Mechanize One Liners

Posted by – May 26, 2006

I thought I’d try to come up with some useful one liners for Mechanize. Here goes:

Fetch a page and print to stdout:

puts WWW::Mechanize.new.get(ARGV[0]).body

List all links in a page:

WWW::Mechanize.new.get(ARGV[0]).links.each { |l| puts l.text }

Visit all links on a page:

(a = WWW::Mechanize.new).get(ARGV[0]).links.each { |l| puts a.click(l).body }

List all links that match a pattern:

WWW::Mechanize.new.get(ARGV[0]).links.text(/[a-z]/).each { |l| puts l.text }

Visit all links that match a pattern:

(a = WWW::Mechanize.new).get(ARGV[0]).links.text(/[a]/).each { |l| puts a.click(l).body }

Smaller Spider:

(mech = WWW::Mechanize.new).get(ARGV[0])
(a = lambda { |p|
  mech.page.links.each { |l| mech.click(l) && p.call(p) if ! mech.visited? l }
}).call(a)

A Mechanize Spider

Posted by – May 24, 2006

A friend of mine pointed me at the Fear perl module today, and it has inspired me some on Mechanize. I couldn’t believe the size of a spider using the Fear API:

url("google.com");
&$_ >> _self while $_;

That is really amazing! I also can’t read it….. After looking at the Fear innards, I finally understood the code, so I tried to reproduce it with Mechanize. This is what I came up with:

agent = WWW::Mechanize.new
stack = agent.get(ARGV[0]).links
while l = stack.pop
  stack.push(*(agent.click(l).links)) unless agent.visited? l.href
end

To get this to work, I added the “visited?” method to the yet to be release 0.4.6 version of Mechanize. I’ve got a few more lines, but still pretty small. I still don’t like my spider though because it will visit any domain. I don’t really want it to try to read the entire internet, so I added the following line at the top of the while loop:

next unless l.uri.host == agent.history.first.uri.host

Can I make it shorter? I’m not sure yet. Do I want it to be shorter? Not sure about that either.