Nokogiri’s Slop Feature 3

Posted by Aaron Patterson on December 04, 2008

Oops! When I released nokogiri version 1.0.7, I totally forgot to talk about Nokogiri::Slop() feature that was added. Why is it called "slop"? It lets you sloppily explore documents. Basically, it decorates your document with method_missing() that allows you to search your document via method calls.

Given this document:

doc = Nokogiri::Slop(<<-eohtml)
<html>
  <body>
    <p>hello</p>
    <p class="bold">bold hello</p>
  <body>
</html>
eohtml

You may look through the tree like so:

doc.html.body.p('.bold').text # => 'bold hello'

The way this works is that method missing is implemented on every node in the document tree. That method missing method creates an xpath or css query by using the method name and method arguments. This means that a new search is executed for every method call. It's fun for playing around, but you definitely won't get the same performance as using one specific CSS search.

My favorite part is that method missing is actually in the slop decorator. When you use the Nokogiri::Slop() method, it adds the decorator to a list that gets mixed in to every node instance at runtime using Module#extend. That lets me have sweet method missing action, without actually putting method missing in my Node class.

Here is a simplified example:

module Decorator
  def method_a
    "method a"
  end

  def method_b
    "method b: #{super}"
  end
end

class Foo
  def method_b
    "inside foo"
  end
end

foo = Foo.new
foo.extend(Decorator)

puts foo.method_a # => 'method a'
puts foo.method_b # => 'method b: inside foo'

foo2 = Foo.new
puts foo2.method_b # => 'inside foo'
puts foo2.method_a # => NoMethodError

Module#extend is used to add functionality to the instance 'foo', but not 'foo2'. Both 'foo' and 'foo2' are instances of Foo, but using Module#extend, we can conditionally add functionality without monkey patching and keeping a clean separation of concerns. You can even reach previous functionality by calling super.

But wait! There's more! You can stack up these decorators as much as you want. For example:

module AddAString
  def method
    "Added a string: #{super}"
  end
end

module UpperCaseResults
  def method
    super.upcase
  end
end

class Foo
  def method
    "foo"
  end
end

foo = Foo.new
foo.extend(AddAString)
foo.extend(UpperCaseResults)

puts foo.method # => 'ADDED A STRING: FOO'

Conditional functionality added to methods with no weird "alias method chain" involvement. Awesome!

I love ruby!

Cross Compiling Ruby Gems for win32

Posted by Aaron Patterson on November 21, 2008

While I was developing nokogiri, I had to learn how to cross compile gems for win32. I don't have a compiler on windows, so I had to do this on OS X. I just want to dump a few notes here so that other people might benefit, and so that I won't forget in the future.

As far as I can tell, there are 4 major steps to getting your native gem cross compiled for windows:

  1. Get a cross compiler (mingw)
  2. Cross compile ruby
  3. Cross compile your gem
  4. Building your gemspec

Step 1, The Cross Compiler

This step is pretty easy. I used Mac Ports to install mingw32. I just did:

$ sudo port install i386-mingw32-binutils i386-mingw32-gcc i386-mingw32-runtime i386-mingw32-w32api

After a while, I could run i386-mingw32-gcc to compile stuff. Next up, cross compiling ruby.

Step 2, Cross Compile Ruby

This seemed like the hardest step to me. I was able to get ruby cross compiling to work after studying documentation at eigenclass, and reading Matt's excellent notes in Johnson.

First, you have to download ruby, so I wrote a rake task to do just that. This rake task downloads ruby in to a "stash" directory:

namespace :build do
  file "stash/ruby-1.8.6-p287.tar.gz" do |t|
    puts "downloading ruby"
    FileUtils.mkdir_p('stash')
    Dir.chdir('stash') do
      url = ("ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.6-p287.tar.gz")
      system("wget #{url} || curl -O #{url}")
    end
  end
end

Next you have to apply a patch to Makefile.in so that it will work with the cross compiler. Once that patch is applied, you can compile ruby with mingw32. Here is my rake task to do that, and unfortunately the strange Makefile.in patch is very necessary:

namespace :build do
  namespace :win32 do
    file 'cross/bin/ruby.exe' => ['cross/ruby-1.8.6-p287'] do
      Dir.chdir('cross/ruby-1.8.6-p287') do
        str = ''
        File.open('Makefile.in', 'rb') do |f|
          f.each_line do |line|
            if line =~ /^\s*ALT_SEPARATOR =/
              str += "\t\t    " + 'ALT_SEPARATOR = "\\\\\"; \'
              str += "
\n"
            else
              str += line
            end
          end
        end
        File.open('Makefile.in', 'wb') { |f| f.write str }
        buildopts = if File.exists?('/usr/bin/i586-mingw32msvc-gcc')
                      "
--host=i586-mingw32msvc --target=i386-mingw32 --build=i686-linux"
                    else
                      "
--host=i386-mingw32 --target=i386-mingw32"
                    end
        sh(<<-eocommand)
          env ac_cv_func_getpgrp_void=no \
            ac_cv_func_setpgrp_void=yes \
            rb_cv_negative_time_t=no \
            ac_cv_func_memcmp_working=yes \
            rb_cv_binary_elf=no \
            ./configure \
            #{buildopts} \
            --prefix=#{File.expand_path(File.join(Dir.pwd, '..'))}
        eocommand
        sh 'make'
        sh 'make install'
      end
    end

    desc 'build cross compiled ruby'
    task :ruby => 'cross/bin/ruby.exe'
  end
end

After executing that task (which will take a while), you should have a cross compiled ruby that you can link against.

Step 3, Cross compiling your extension

The final part is cross compiling the extension. Now that you have your cross compiled ruby, you just need to cross compile your extension. The only thing special you need to do here is change the '-I' flag you send to ruby when executing 'extconf.rb'. Here is a slightly simplified version of my task to do that:

namespace :build
  task :win32 do
    dash_i = File.expand_path(
      File.join(File.dirname(__FILE__), 'cross/lib/ruby/1.8/i386-mingw32/')
    )
    Dir.chdir('ext/nokogiri') do
      ruby " -I #{dash_i} extconf.rb"
      sh 'make'
    end
  end
end

Once that is completed, it is time to package the gem. In order to do that, you need to generate your gemspec.

Step 4, generating the gemspec

I typically use Hoe for packaging my gems. Hoe makes generating my gemspecs pretty easy. One little problem though is that hoe makes assumptions for your gemspec based on the system you are currently running. Since we're cross compiling, we need to muck with the gemspec in order to package our win32 gem.

To modify the gemspec, what I do is assign the new Hoe object to a constant like so:

HOE = Hoe.new('nokogiri', Nokogiri::VERSION) do |p|
  p.developer('Aaron Patterson', 'aaronp@rubyforge.org')
  p.developer('Mike Dalessio', 'mike.dalessio@gmail.com')
  p.clean_globs = [
    'ext/nokogiri/Makefile',
    'ext/nokogiri/*.{o,so,bundle,a,log,dll}',
    'ext/nokogiri/conftest.dSYM',
    GENERATED_PARSER,
    GENERATED_TOKENIZER,
    'cross',
  ]
  p.spec_extras = { :extensions => ["Rakefile"] }
end

Then when I'm building my win32 gemspec, I modify the gemspec with win32 specific bits and write out the gemspec. This task modifies the gemspec file list to include any binary files such as dll's and so files that I've built, assigns the platform to mswin32, and tells the gemspec that there are no extensions to be built:

namespace :gem do
  namespace :win32 do
    task :spec => ['build:win32'] do
      File.open("#{HOE.name}.gemspec", 'w') do |f|
        HOE.spec.files += Dir['ext/nokogiri/**.{dll,so}']
        HOE.spec.platform = 'x86-mswin32-60'
        HOE.spec.extensions = []
        f.write(HOE.spec.to_ruby)
      end
    end
  end
end

We have to modify the file list and remove any extension building tasks because the gem is going to be shipped with the pre-built windows binaries. Setting the platform to that hardcoded string is a total hack, but I couldn't figure out a different way. If you were building this spec on windows, you should use "Gem::Platform::CURRENT" instead of that string. After executing this task, you should end up with a file named "packagename.gemspec". Just run "gem build packagename.gemspec", and you'll have your win32 gem, completely windows free!

Final Notes

Unfortunately just because it compiled, doesn't mean it will run. My workflow for testing was to package the gem, transfer it to a windows machine, run "gem unpack" on the gem. After unpacking the gem, I could go in to the directory and run my tests. Once I was satisfied that all tests passed, I would release the gem.

One final thing.... Nokogiri ships with the libxml and libxslt dll files. In order to get those files to be found with dlopen (or whatever it is that windows uses), they must be in your PATH. Yes. Your PATH. So Nokogiri changes the environment's PATH to include the directory where the DLL's are located. You can see the hot PATH manipulation code here.

If you want to see all of the uncensored nitty gritty of the cross compilation action, check out the Nokogiri Rakefile located here.

Good luck, and don't forget about those windows people.

Underpant-Free Excitement 2

Posted by Aaron Patterson on November 18, 2008

Underpant-Free Excitement, or, why Nokogiri works for me.

Nokogiri is an HTML/XML/XSLT parser that can be searched via XPath or CSS3 selectors. The library wraps libxml and libxslt, but the API is based off Hpricot. I wrote this library (with the help of super awesome ruby hacker Mike Dalessio) because I was unhappy with the API of libxml-ruby, and I was unhappy with the support, speed, and broken html support of Hpricot. I wanted something with an awesome API like Hpricot, but fast like libxml-ruby and with better XPath and CSS support.

I want to talk about the underpinnings, speed, and some interesting implementation details of nokogiri. But first, lets look at a quick example of parsing google just to whet your appetite.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://google.com/search?q=tenderlove').read)
doc.search('h3.r > a.l').each do |link|
  puts link.inner_text
end

This sample searches google for the string "tenderlove", then searches the document with the given CSS selector "h3.r > a.l", and prints out the inner text of each found node. It's as simple as that. You can get fancier, with sub searches, or using XPath, but you'll have to explore that on your own for now.

Underpinnings

Nokogiri is a wrapper around libxml2 and libxslt, but also includes a CSS selector parser. I chose libxml2 because it is very fast, it's available for many platforms, it corrects broken HTML, has built in XPath search support, it is popular, and the list goes on.

Given these reasons, I felt that there was no reason for me to write my own HTML parser and corrector when there is an existing library that has all of these good qualities. The best thing to do in this situation is leverage this existing library and expose a friendly ruby API. In fact, the only thing that libxml is missing is an API to search documents via CSS. Most of the API calls in Nokogiri are implemented inside libxml except for the CSS selector parser, and even that leverages the XPath API.

Since Nokogiri leverages libxml2, consumers get (among other things) fast parsing, i13n support, fast searching, standards based XPath support, namespace support, and mature HTML correction algorithms.

Re-using existing popular code like libxml2 also has some nice side benefits. More people are testing, and most importantly, bugs get squashed quickly.

Speed

People keep asking me about speed. Is Nokogiri fast? Yes. Is it faster than Hpricot? Yes. Faster than Hpricot 2? Yes. All tests in this benchmark show Nokogiri to be faster in all aspects. But you shouldn't believe me. I am just some (incredibly attractive) dude on the internet. Try it out for yourself. Clone this gist and run the benchmarks! Write your own benchmarks! I don't want you to believe me. I want you to find out for yourself.

If you write any benchmarks, send them back to me! I like adding to the list, even if they show Nokogiri to be slower. It helps me know where to improve!

Implementation Details

I've already touched on the underpinnings of Nokogiri. Specifically that it wraps libxml2 which gives us parsing, and XPath searching for free. One thing I'd like to talk about is the CSS selector implementation. I found this part of Nokogiri to be particularly challenging and fun!

The way the CSS selector search works is Nokogiri parses the selector, then converts it in to XPath, then leverages the XPath search to return results. I was able to take the grammar and lexer from the W3C, and output a tokenizer and parser. I used RACC to generate the parser, and FREX (my fork of REX) to output a tokenizer. The generated parser outputs an AST. I implemented a visitor which walks the AST and turns it in to an XPath query. That's it! Really no magic necessary.

Conclusion

Nokogiri works for me because re-uses a popular, fast, standards based, and well maintained library. But that is why it works for me. I encourage you to download it and try it out yourself. I think you'll be pleased!

I am so happy with this project, that I will be eventually deprecating the use of Hpricot in Mechanize. Nokogiri's API is so similar to Hpricot that I doubt there will be any surprises. If you are just using mechanize's public API, you should not have to change anything. If you dive in to the parser and use hpricot selectors, you might need to change some things. The Nokogiri API is very much like Hpricot, so I think that most people won't need to do anything.

In the meantime.....

If you find any problems, file a ticket! The source code is hosted on github. If you'd like to see more examples, check out the readme, and the wiki.

Thanks for reading!