Category: computadora

Easy Markup Validation

Posted by – June 12, 2009

I wanted a test helper that would assert that my XHTML was valid XHTML. So I wrote one and called it “markup_validity“. You can use it too, and I will show you how.

First, install the gem:

  $ sudo gem install markup_validity

Then, use it in your tests:

require 'test/unit'
require 'rubygems'
require 'markup_validity'

class ValidHTML < Test::Unit::TestCase
  def test_i_can_has_valid_xhtml
    assert_xhtml_transitional xhtml_document
  end
end

Oh. You use RSpec? It supports that too:

require 'rubygems'
require 'markup_validity'

describe "my XHTML document" do
  it "can has transitional xhtml" do
    xhtml_document.should be_xhtml_transitional
  end
end

Debugging invalid markup can be a pain. MarkupValidity tries to give you helpful errors to make your life easier. Say you have an invalid piece of XHTML like this:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
  </head>
  <body>
    <p>
      <p>
        Hello
      </p>
    </p>
  </body>
</html>

The error output from MarkupValidity will be this:

.Error on line: 2:
Element 'head': Missing child element(s). Expected is one of ( script, style, meta, link, object, isindex, title, base ).

1: <html xmlns="http://www.w3.org/1999/xhtml">
2:   <head>
3:   </head>
4:   <body>
5:     <p>

Error on line: 6:
Element 'p': This element is not expected. Expected is one of ( a, br, span, bdo, object, applet, img, map, iframe, tt ).

5:     <p>
6:       <p>
7:         Hello
8:       </p>
9:     </p>

MarkupValidity provides a few assertions for test/unit:

  • assert_xhtml_transitional(xhtml) for asserting valid transitional XHTML
  • assert_xhtml_strict(xhtml) for asserting valid strict XHTML
  • assert_schema(schema, xml) for asserting that your xml validates against a schema
  • assert_xhtml which is an alias for assert_xhtml_transitional

The methods provided for RSpec are quite similar:

  • be_xhtml_transitional for asserting valid transitional XHTML
  • be_xhtml_strict for asserting valid strict XHTML
  • be_valid_with_schema(schema) for asserting that your xml validates against a schema
  • be_xhtml which is an alias for be_xhtml_transitional

MarkupValidity even works well with rails. Here is an example rails controller test:

require 'test_helper'
require 'markup_validity'

class AwesomeControllerTest < ActionController::TestCase
  test "valid markup" do
    get :new
    assert_xhtml_transitional @response.body
  end
end

Autotest and Vim integration

Posted by – May 18, 2009

Yay! I got vim and autotest integration working. When I run autotest, if there is an error, I can have Vim read the errors from autotest and jump me to the right place.

Here is a video of me using it:

Please note that I’m not copying and pasting anything. In vim, I hit a command and Vim automatically picks up errors from autotest and jumps me to the line where the error occurred.

You too can impress your friends with this trick! Here’s how:

  1. Make sure you have vim-ruby installed
  2. Use this as your .autotest file:
    require 'autotest/restart'
    
    Autotest.add_hook :initialize do |at|
      at.unit_diff = 'cat'
    end
    
    Autotest.add_hook :ran_command do |at|
      File.open('/tmp/autotest.txt', 'wb') { |f|
        f.write(at.results.join)
      }
    end
    
  3. Add this to your .vimrc:

    compiler rubyunit
    nmap <Leader>fd :cf /tmp/autotest.txt<cr> :compiler rubyunit<cr>

Now when you get an error in autotest, just type “\fd” in Vim to jump straight to your first error.

The contents of /tmp/autotest.txt will be used in your errorfile. In Vim do “:help quickfix” for more info on what you can do with your new found power.

Caveat: You don’t get unit_diff. I’m working on that. Any help would be much appreciated (I suck at errorformat in Vim).

Fat binary gems make the rockin’ world go round

Posted by – May 7, 2009

Right now people who publish native gems targeting the windows platform have a problem. Our problem is supporting ruby 1.8 and 1.9 at the same time. Right now, we can’t build one gem targeting 1.8 and one gem targeting 1.9, and have rubygems differentiate the two. I have a solution: fat binary gems. We can build a gem that contains dynamic libraries that target ruby 1.8 and ruby 1.9 on windows, with no changes to rubygems whatsoever. I’ve put together a proof of concept that I want to share. I will walk through the steps for building a fat binary gem with the tools we have today. The steps I am going to present are not necessarily the best steps, they are just the steps I took to get this idea working.

The tools I will use are MinGW for cross compiling, hoe and rake-compiler for their packaging and compiling tasks, multiruby for cross compiling 1.8 and 1.9, and use nokogiri as the target gem to be built.

Here is the basic strategy for making dreams happen:

1. Gem entry point must be written in Ruby

When someone does “require ‘whatever’” on your library, that ‘whatever.rb’ file must be written in ruby and work with both 1.8 and 1.9. The reason is because we will:

2. Dynamically determine the correct SO file to load

We can determine at runtime the current ruby version, then load the appropriate SO file at runtime.

Let’s get down to business and see it in action.

Getting our hands dirty

The first thing we need to do is make sure that the so file from the ruby 1.8 build and the ruby 1.9 build are in a different place. The way I accomplished this was by customizing my Rake::Extension task (from rake-compiler), and adding a prerequisite to the cross task:

RET = Rake::ExtensionTask.new("nokogiri", HOE.spec) do |ext|
  ext.lib_dir = "ext/nokogiri"
end

task :muck_with_lib_dir do
  RET.lib_dir += "/#{RUBY_VERSION.sub(/\.\d$/, '')}"
  FileUtils.mkdir_p(RET.lib_dir)
end
if Rake::Task.task_defined?(:cross)
  Rake::Task[:cross].prerequisites << "muck_with_lib_dir"
end

This code will make sure the so file goes to “ext/nokogiri/1.8″ when compiling with ruby 1.8, and “ext/nokogiri/1.9″ when compiling with 1.9. Then, all you have to do is compile your extension twice:


$ ~/.multiruby/install/1.8.6-p114/bin/rake cross compile
$ rm -rf tmp
$ $ ~/.multiruby/install/1.9.1-rc2/bin/rake cross compile
[/sourcecode]
WARNING! Watch out for that "rm -rf". That is removing the tmp directory that rake-compiler made. rake-compiler doesn't seem to know that I switched ruby versions. In order to get the two different compilations working, I had to manually remove the already compiled objects.

Dynamic loading

So we've got our compiled so files in two different locations. What about loading? This step is very easy. Since our entry point will be in ruby, we can just write this in our entry point file:

if RUBY_PLATFORM =~/(mswin|mingw)/i
  # Fat binary gems, you make the Rockin' world go round
  require "nokogiri/#{RUBY_VERSION.sub(/\.\d+$/, '')}/nokogiri"
else
  require 'nokogiri/nokogiri'
end

Basically all this code says is "if we're running windows, load the shared object from a path that contains the ruby version". When a windows user requires this file, the path to the shared object is determined by the version of ruby that they are using. If they're running 1.8, the path will be "nokogiri/1.8/nokogiri", if they're running 1.9, "nokogiri/1.9/nokogiri".

Packaging

We've got one more hurdle to overcome, and that is packaging. We need to make sure that when we're building the windows gem, our custom so files are added to the gem. To do this, I just added another task:

task :add_dll_to_manifest do
  HOE.spec.files += Dir['ext/nokogiri/**.{dll,so}']
  HOE.spec.files += Dir['ext/nokogiri/{1.8,1.9}/**.{dll,so}']
end

if Rake::Task.task_defined?(:cross)
  Rake::Task[:cross].prerequisites << :add_dll_to_manifest
end

This makes sure that any extra dll or so files in our ext directories are added to the gem. Now we can run our packaging task:

$ ~/.multiruby/install/1.8.6-p114/bin/rake cross native gem
[/sourcecode]

If everything went well, we can examine the content of our packaged gem and find two different so files:

$ gem spec pkg/nokogiri-1.2.4-x86-mswin32.gem files | grep nokogiri.so
- ext/nokogiri/1.8/nokogiri.so
- ext/nokogiri/1.9/nokogiri.so
$
[/sourcecode]

Conclusion

There we have it, a fat binary gem. This gem will work with Ruby 1.8 OR Ruby 1.9 on windows. If you're a windows user, and you'd like to try using this fat binary gem, I have it on my gem server. Just do:

$ gem install nokogiri -s http://tenderlovemaking.com/
[/sourcecode]
The next full release of nokogiri will be using this technique for windows builds. Also, the rake tasks that I've presented were somewhat simplified. If you'd like to get very specific, check out the nokogiri source.

Next Steps

I would like to work with Luis on integrating this functionality in to rake-compiler. I'm not sure the best way to go about it, but I know that he and I can simplify these steps even further.

[ANN] nokogiri 1.3.0rc1 has been released!

Posted by – May 6, 2009

= nokogiri version 1.3.0rc1 has been released!

Thanks to herculean efforts by my nokogiri partner in crime, Mike Dalessio,
nokogiri now works on JRuby 1.3.0RC1 via FFI.

To install this prerelease gem do this:

$ jgem install nokogiri -s http://tenderlovemaking.com/
[/sourcecode]

Then you should be able to do this:


$ jirb
irb(main):001:0> require 'open-uri'
=> true
irb(main):002:0> require 'rubygems'
=> true
irb(main):003:0> require 'nokogiri'
=> true
irb(main):004:0> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> #
irb(main):005:0> doc.css('h3.r a.l').length
=> 10
irb(main):006:0>
[/sourcecode]

== CAVEATS!

* The JRuby FFI gem only works with JRuby 1.3.0RC1
* You MUST install it from my gem server
* The gem version will say 1.2.4, that is actually because I couldn't get
pre release gem versions working. Don't worry, it's actually the 1.3.0
release candidate.
* You can get an MRI version and the JRuby version from my gem server, no
windows support yet.

== ACCOLADES

* Mike made this FFI monster happen! I can't thank him enough.
* Thanks to the JRuby team for making FFI work!

== CHANGELOG

* hahahahahaha
* hahahahahahaha
* hahahaha
* hahahahahahahha
* You'll get to see the acutal changes when this isn't a release candidate
* Or check out the git repository

== More information

* github
* rdoc

Namespaces in XML

Posted by – April 23, 2009

Shit. This is a boring topic. Just writing the title made me cry a little bit out of boredom. Unfortunately this topic is something I feel compelled to write about because I think that most Ruby developers dealing with XML know very little about the topic, and yet XML namespaces are crucial when dealing with XML documents. So, in order to curb the boredom, I will attempt to demonstrate why we need namespaces and how they affect you when dealing with XML in the shortest amount of time. I will also try to sprinkle in a few swear words and innuendos just to make sure you’re paying attention.

A tale of two companies

One day, long ago, when XML was written on punch cards, Alice’s Auto Supply decided that they would distribute their inventory as XML so that other people would know what they had in stock. They came up with an XML document that looked like this:

<?xml version="1.0"?>
<inventory>
  <tire name="super slick racing tire" />
  <tire name="all weather tire" />
</inventory>

Excellent! Programmers started consuming the inventory for Alice’s shop. They could pull a list of tires from the document like this:

doc.xpath('//tire')

Bob’s Bike Shop also wanted to get on the XML broadcast bandwagon. So they followed suit and produced an XML document as well:

<?xml version="1.0"?>
<inventory>
  <tire name="narrow street tire" />
  <tire name="mountain trail tire" />
</inventory>

Again, programmers were happy. They started consuming inventory for Bob’s Bike Shop and getting a list of bike tires like this:

doc.xpath('//tire')

Everything was going well until someone decided to consume inventory from both sources. Fuck. There was no way to tell the difference between a car tire from Alice’s Auto Supply, or a bike tire from Bob’s Bike Shop. The search criteria used for both documents was the same:

doc.xpath('//tire')

How was one suppose to tell the difference between a bike tire and a car tire? There is a naming conflict, and this code would return both! Systems began crashing, punch cards were lit on fire, massive power outages occurred. Needless to say, our society was at it’s lowest point.

Fortunately, a very smart group of people (much smarter than me) came along and said “let’s associate these things to something unique, then we can tell them apart”. That unique bit of information was a URL. They also said “let’s have the ability to name the url so that we can easily reference it in our XML documents”. Fortunately, the names for a URL does not have to be unique since it is tied to the unique URL! Yay!

Alice got wind of these updates to the XML spec, and wanted to make sure that everyone could tell the difference between her car tires and some other tire. So she updated her inventory document, adding her URL with a name, and naming all of her inventory:

<?xml version="1.0"?>
<inventory xmlns:car="http://alicesautosupply.example.com/">
  <car:tire name="super slick racing tire" />
  <car:tire name="all weather tire" />
</inventory>

Alice’s new inventory document was pushed out. The developers had not yet updated their code. They thought they had a new bug, the code was only returning tires from Bill’s Bike shop. But how to get the car tires? They had to inform the parser they were looking for tires associated with Alice’s url, and changed their code to look like this:

doc.xpath('//tire')
doc.xpath('//aliceAuto:tire',
  'aliceAuto' => 'http://alicesautosupply.example.com/'
)

The first query returned tires that have no namespace and the second query returned tires that belong to Alice’s shop.

Alice’s inventory grew and grew (owing it all to namespacing her document of course). Prefixing everything in her document with “car” was taking a toll on her fingers as well as her puchcard supply. The XML superheros had a trick up their sleeves for Alice. They said that “URLs could be declared as a default” that way every tag could be associated with a URL but not explicitly declare a name.

Armed with this knowledge, Alice was able to change her XML to this:

<?xml version="1.0"?>
<inventory xmlns="http://alicesautosupply.example.com/">
  <tire name="super slick racing tire" />
  <tire name="all weather tire" />
</inventory>

Her XML now stated that “inventory” and all tags inside “inventory” belonged to her URL, but did not need a prefix. Everything was still associated with her URL so that developers could tell the difference between her tires and all other tires, but she did not have to add prefixes.

As for the developers consuming her XML, they never noticed the change. Their code asked for all tires belonging to Alice’s URL, and Alice’s document still declared that those tires belonged to her URL. Punchcards were saved, carpel tunnel was cured, and the world rejoiced.

Conclusion

I hope this story hasn’t been too boring. Here are the key parts I wanted to explain:

  1. Namespaces prevent tag name collision.
    We would not be able to deal with colliding tag names without namespaces.
  2. When you don’t specify a namespace in your search, that means you want tags with no namespace.
  3. Namespaces are tied to URLs. Only the URL must be unique which is why you must use the URL in your XPath queries.

Remember that there is a difference between asking for tags that belong to a namespace and ones that do not. Also remember that a default namespace in a document means that tags which do not explicitly have a namespace belong to the default. So even though those tags do not have a prefix you must use a namespace when querying for them.

Bonus Round

Even though using namespaces is essential when searching an XML document, Nokogiri tries to help out. If there are namespaces declared on the root node of a document, Nokogiri will automatically register those for you. You will still have to use the prefix when searching the document, but the URL registration is done for you.

Let’s modify Alice’s XML a little to demonstrate:

<?xml version="1.0"?>
<inventory xmlns="http://alicesautosupply.example.com/" xmlns:bike="http://bobsbikes.example.com.">
  <tire name="super slick racing tire" />
  <tire name="all weather tire" />
  <bike:tire name="skinny street" />
</inventory>

Using Nokogiri, these two statements are equivalent:

doc.xpath('//xmlns:tire',
  'xmlns' => 'http://alicesautosupply.example.com/'
)
doc.xpath('//xmlns:tire')

We can specify the namespace ourselves, or use the same name that Nokogiri picks for us.

Similarly, if we want to find bike tires, these two statements are equivalent:

doc.xpath('//bike:tire',
  'bike' => 'http://bobsbikes.example.com/'
)
doc.xpath('//bike:tire')

Testing JavaScript Outside the Browser

Posted by – April 5, 2009

The other day at LA RubyConf during the Johnson presentation, I showed a few slides which I don’t think were given the time that they deserve. Not that we didn’t have enough time, I just don’t think I made as big a deal about them as I should have. Those particular slides demonstrated HTML Document Object manipulation executed in JavaScript outside any web browser. Those particular slides, and that particular code, is the culmination of over a year worth of work (and Yak Shaving) and I would like to talk about it a little more in detail here.

Since I started doing any sort of non-trivial browser dependent JavaScript, I’ve wanted to be able to test the code which I wrote. Hitting refresh on a webpage seems like a hack. Setting up a special browser to refresh the page for me also seems like a hack. I want to run “rake test” and have my JavaScript DOM manipulations tested right along with everything else, no browser dependence required. As far as I could tell, we need three things to make that happen:

  1. A JavaScript runtime that can be used in Ruby
  2. A parser with browser-like HTML correction schemes
  3. A DOM interface that mirrors a browsers DOM interface

Over the weekend, I think we’ve come a lot closer. John has finally released Johnson. Johnson solves problem number 1. Johnson provides a JavaScript runtime that is fully accessible in Ruby. Watch our RubyConf 2008 presentation about Johnson for more details about that project.

Number 2, I believe, has been solved by nokogiri. As far as I can tell, the tree generated inside libxml2 is very similar to one found in the browser. Nokogiri was partly a Yak Shave for number 3. Since I had writing a DOM interface in mind, nokogiri’s api lends well to writing a DOM api.

Number 3 was partly solved this weekend. I’ve been working on a DOM api called taka. Taka sits a DOM api on top of nokogiri. The goal of the project is to mirror a browser’s DOM api in Ruby.

With these three tools in place, I believe that we have a good start on a browserless JavaScript testing environment.

Codes

Enough talk. Let’s look at some codes. Take this HTML page for example:

<html>
  <head>
    <script>
      function populateDropDown() {
        var select = document.getElementById('colors');
        var options = ['red', 'green', 'blue', 'black'];
        var i;
        for(i = 0; i < options.length; i++) {
          var option = document.createElement('option');
          option.appendChild(document.createTextNode(options[i]));
          option.value = options[i];
          select.appendChild(option);
        }
      }
    </script>
  </head>
  <body onload="populateDropDown()">
    <h1>Behold the Johnson</h1>
    <form>
      <select id="colors">
      </select>
    </form>
  </body>
</html>

The JavaScript in this HTML will add a few option tags as children of the select tag. Effectively populating the drop down for our user. It would be nice if we could write a test to assert that when this JavaScript executes, the option tags are actually added as children of the select tag.

With Johnson and Taka, it is possible to write such a test:

require 'rubygems'
require 'taka'
require 'johnson'
require 'test/unit'

class OptionTagsAppendedTest < Test::Unit::TestCase
  def setup
    # Create our DOM object
    @document = Taka::DOM::HTML(DATA.read)

    # Create a new JavaScript runtime
    @rt = Johnson::Runtime.new

    # Set the document in the runtime
    @rt['document'] = @document

    # Execute any script tags
    @document.getElementsByTagName('script').each do |script|
      @rt.evaluate(script.textContent);
    end
  end

  def test_options_populated_by_onload
    # 0 option tags before onload is executed
    assert_equal 0, @document.getElementsByTagName('option').length

    # Execute the onload body attribute
    @rt.evaluate(@document.getElementsByTagName('body')[0].onload)

    # 4 option tags after onload is executed
    assert_equal 4, @document.getElementsByTagName('option').length
  end
end

There. It’s done. This test executes the JavaScript and manipulates your HTML the same way the browser would. You can run this code today, just make sure to install the johnson and taka gems first.

Problems

There are at least a few problems. This HTML code is, admittedly, carefully crafted. So far, taka only implements the DOM 1 interface. That means taka is missing many methods that are available in browsers. The good news though is that Taka is pure ruby and open source. As soon as you find methods that are missing, fork the repo, add a test, and send a pull request. I will be sure to merge it.

Conclusion

We are making progress towards testing JavaScript without a browser. We have to do it a step at a time. The solution I have presented to you, while not complete, has promise. I think that the only thing standing in our way right now is time and man power. The methods that need to be implemented on Taka to make it mirror a browser are not hard (take a look at the taka source). These methods just need to be written.

We need a new version of rake

Posted by – February 26, 2009

We’ve all seen this warning:
/Library/Ruby/Gems/1.8/gems/rake-0.8.3/lib/rake/gempackagetask.rb:13:Warning: Gem::manage_gems is deprecated and will be removed on or after March 2009.[/sourcecode]
Rake is using a deprecated API from RubyGems. Jim knows about the problem, we just need to get him to release a new version of Rake.

In order to get Jim to release a new version of Rake, I have decided to start a letter writing campaign:

IMG_0150.JPG

If you would like to help get a new version of Rake released, I encourage you to send a letter to Jim, thanking him for his hard work on Rake, and asking him kindly to release the new "warning free" version.

Send your letters here:

Rake
c/o Jim Weirich
EdgeCase
1130 Congress Ave
Cincinnati, OH 45246

Together we can get Jim to release a new version of Rake. Together we can build a "warning free" future for our children.

Nokogiri’s Slop Feature

Posted by – December 4, 2008

Oops! When I released nokogiri version 1.0.7, I totally forgot to talk about Nokogiri::Slop() feature that was added. Why is it called “slop”? It lets you sloppily explore documents. Basically, it decorates your document with method_missing() that allows you to search your document via method calls.

Given this document:

doc = Nokogiri::Slop(<<-eohtml)
<html>
  <body>
    <p>hello</p>
    <p class="bold">bold hello</p>
  <body>
</html>
eohtml

You may look through the tree like so:

doc.html.body.p('.bold').text # => 'bold hello'

The way this works is that method missing is implemented on every node in the document tree. That method missing method creates an xpath or css query by using the method name and method arguments. This means that a new search is executed for every method call. It’s fun for playing around, but you definitely won’t get the same performance as using one specific CSS search.

My favorite part is that method missing is actually in the slop decorator. When you use the Nokogiri::Slop() method, it adds the decorator to a list that gets mixed in to every node instance at runtime using Module#extend. That lets me have sweet method missing action, without actually putting method missing in my Node class.

Here is a simplified example:

module Decorator
  def method_a
    "method a"
  end

  def method_b
    "method b: #{super}"
  end
end

class Foo
  def method_b
    "inside foo"
  end
end

foo = Foo.new
foo.extend(Decorator)

puts foo.method_a # => 'method a'
puts foo.method_b # => 'method b: inside foo'

foo2 = Foo.new
puts foo2.method_b # => 'inside foo'
puts foo2.method_a # => NoMethodError

Module#extend is used to add functionality to the instance ‘foo’, but not ‘foo2′. Both ‘foo’ and ‘foo2′ are instances of Foo, but using Module#extend, we can conditionally add functionality without monkey patching and keeping a clean separation of concerns. You can even reach previous functionality by calling super.

But wait! There’s more! You can stack up these decorators as much as you want. For example:

module AddAString
  def method
    "Added a string: #{super}"
  end
end

module UpperCaseResults
  def method
    super.upcase
  end
end

class Foo
  def method
    "foo"
  end
end

foo = Foo.new
foo.extend(AddAString)
foo.extend(UpperCaseResults)

puts foo.method # => 'ADDED A STRING: FOO'

Conditional functionality added to methods with no weird “alias method chain” involvement. Awesome!

I love ruby!

Cross Compiling Ruby Gems for win32

Posted by – November 21, 2008

While I was developing nokogiri, I had to learn how to cross compile gems for win32. I don’t have a compiler on windows, so I had to do this on OS X. I just want to dump a few notes here so that other people might benefit, and so that I won’t forget in the future.

As far as I can tell, there are 4 major steps to getting your native gem cross compiled for windows:

  1. Get a cross compiler (mingw)
  2. Cross compile ruby
  3. Cross compile your gem
  4. Building your gemspec

Step 1, The Cross Compiler

This step is pretty easy. I used Mac Ports to install mingw32. I just did:

$ sudo port install i386-mingw32-binutils i386-mingw32-gcc i386-mingw32-runtime i386-mingw32-w32api
[/sourcecode]

After a while, I could run i386-mingw32-gcc to compile stuff. Next up, cross compiling ruby.

Step 2, Cross Compile Ruby

This seemed like the hardest step to me. I was able to get ruby cross compiling to work after studying documentation at eigenclass, and reading Matt's excellent notes in Johnson.

First, you have to download ruby, so I wrote a rake task to do just that. This rake task downloads ruby in to a "stash" directory:

namespace :build do
  file "stash/ruby-1.8.6-p287.tar.gz" do |t|
    puts "downloading ruby"
    FileUtils.mkdir_p('stash')
    Dir.chdir('stash') do
      url = ("ftp://ftp.ruby-lang.org/pub/ruby/1.8/ruby-1.8.6-p287.tar.gz")
      system("wget #{url} || curl -O #{url}")
    end
  end
end

Next you have to apply a patch to Makefile.in so that it will work with the cross compiler. Once that patch is applied, you can compile ruby with mingw32. Here is my rake task to do that, and unfortunately the strange Makefile.in patch is very necessary:

namespace :build do
  namespace :win32 do
    file 'cross/bin/ruby.exe' => ['cross/ruby-1.8.6-p287'] do
      Dir.chdir('cross/ruby-1.8.6-p287') do
        str = ''
        File.open('Makefile.in', 'rb') do |f|
          f.each_line do |line|
            if line =~ /^\s*ALT_SEPARATOR =/
              str += "\t\t    " + 'ALT_SEPARATOR = "\\\\\"; \\'
              str += "\n"
            else
              str += line
            end
          end
        end
        File.open('Makefile.in', 'wb') { |f| f.write str }
        buildopts = if File.exists?('/usr/bin/i586-mingw32msvc-gcc')
                      "--host=i586-mingw32msvc --target=i386-mingw32 --build=i686-linux"
                    else
                      "--host=i386-mingw32 --target=i386-mingw32"
                    end
        sh(<<-eocommand)
          env ac_cv_func_getpgrp_void=no \
            ac_cv_func_setpgrp_void=yes \
            rb_cv_negative_time_t=no \
            ac_cv_func_memcmp_working=yes \
            rb_cv_binary_elf=no \
            ./configure \
            #{buildopts} \
            --prefix=#{File.expand_path(File.join(Dir.pwd, '..'))}
        eocommand
        sh 'make'
        sh 'make install'
      end
    end

    desc 'build cross compiled ruby'
    task :ruby => 'cross/bin/ruby.exe'
  end
end

After executing that task (which will take a while), you should have a cross compiled ruby that you can link against.

Step 3, Cross compiling your extension

The final part is cross compiling the extension. Now that you have your cross compiled ruby, you just need to cross compile your extension. The only thing special you need to do here is change the '-I' flag you send to ruby when executing 'extconf.rb'. Here is a slightly simplified version of my task to do that:

namespace :build
  task :win32 do
    dash_i = File.expand_path(
      File.join(File.dirname(__FILE__), 'cross/lib/ruby/1.8/i386-mingw32/')
    )
    Dir.chdir('ext/nokogiri') do
      ruby " -I #{dash_i} extconf.rb"
      sh 'make'
    end
  end
end

Once that is completed, it is time to package the gem. In order to do that, you need to generate your gemspec.

Step 4, generating the gemspec

I typically use Hoe for packaging my gems. Hoe makes generating my gemspecs pretty easy. One little problem though is that hoe makes assumptions for your gemspec based on the system you are currently running. Since we're cross compiling, we need to muck with the gemspec in order to package our win32 gem.

To modify the gemspec, what I do is assign the new Hoe object to a constant like so:

HOE = Hoe.new('nokogiri', Nokogiri::VERSION) do |p|
  p.developer('Aaron Patterson', 'aaronp@rubyforge.org')
  p.developer('Mike Dalessio', 'mike.dalessio@gmail.com')
  p.clean_globs = [
    'ext/nokogiri/Makefile',
    'ext/nokogiri/*.{o,so,bundle,a,log,dll}',
    'ext/nokogiri/conftest.dSYM',
    GENERATED_PARSER,
    GENERATED_TOKENIZER,
    'cross',
  ]
  p.spec_extras = { :extensions => ["Rakefile"] }
end

Then when I'm building my win32 gemspec, I modify the gemspec with win32 specific bits and write out the gemspec. This task modifies the gemspec file list to include any binary files such as dll's and so files that I've built, assigns the platform to mswin32, and tells the gemspec that there are no extensions to be built:

namespace :gem do
  namespace :win32 do
    task :spec => ['build:win32'] do
      File.open("#{HOE.name}.gemspec", 'w') do |f|
        HOE.spec.files += Dir['ext/nokogiri/**.{dll,so}']
        HOE.spec.platform = 'x86-mswin32-60'
        HOE.spec.extensions = []
        f.write(HOE.spec.to_ruby)
      end
    end
  end
end

We have to modify the file list and remove any extension building tasks because the gem is going to be shipped with the pre-built windows binaries. Setting the platform to that hardcoded string is a total hack, but I couldn't figure out a different way. If you were building this spec on windows, you should use "Gem::Platform::CURRENT" instead of that string. After executing this task, you should end up with a file named "packagename.gemspec". Just run "gem build packagename.gemspec", and you'll have your win32 gem, completely windows free!

Final Notes

Unfortunately just because it compiled, doesn't mean it will run. My workflow for testing was to package the gem, transfer it to a windows machine, run "gem unpack" on the gem. After unpacking the gem, I could go in to the directory and run my tests. Once I was satisfied that all tests passed, I would release the gem.

One final thing.... Nokogiri ships with the libxml and libxslt dll files. In order to get those files to be found with dlopen (or whatever it is that windows uses), they must be in your PATH. Yes. Your PATH. So Nokogiri changes the environment's PATH to include the directory where the DLL's are located. You can see the hot PATH manipulation code here.

If you want to see all of the uncensored nitty gritty of the cross compilation action, check out the Nokogiri Rakefile located here.

Good luck, and don't forget about those windows people.

Underpant-Free Excitement

Posted by – November 18, 2008

Underpant-Free Excitement, or, why Nokogiri works for me.

Nokogiri is an HTML/XML/XSLT parser that can be searched via XPath or CSS3 selectors. The library wraps libxml and libxslt, but the API is based off Hpricot. I wrote this library (with the help of super awesome ruby hacker Mike Dalessio) because I was unhappy with the API of libxml-ruby, and I was unhappy with the support, speed, and broken html support of Hpricot. I wanted something with an awesome API like Hpricot, but fast like libxml-ruby and with better XPath and CSS support.

I want to talk about the underpinnings, speed, and some interesting implementation details of nokogiri. But first, lets look at a quick example of parsing google just to whet your appetite.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://google.com/search?q=tenderlove').read)
doc.search('h3.r > a.l').each do |link|
  puts link.inner_text
end

This sample searches google for the string “tenderlove”, then searches the document with the given CSS selector “h3.r > a.l”, and prints out the inner text of each found node. It’s as simple as that. You can get fancier, with sub searches, or using XPath, but you’ll have to explore that on your own for now.

Underpinnings

Nokogiri is a wrapper around libxml2 and libxslt, but also includes a CSS selector parser. I chose libxml2 because it is very fast, it’s available for many platforms, it corrects broken HTML, has built in XPath search support, it is popular, and the list goes on.

Given these reasons, I felt that there was no reason for me to write my own HTML parser and corrector when there is an existing library that has all of these good qualities. The best thing to do in this situation is leverage this existing library and expose a friendly ruby API. In fact, the only thing that libxml is missing is an API to search documents via CSS. Most of the API calls in Nokogiri are implemented inside libxml except for the CSS selector parser, and even that leverages the XPath API.

Since Nokogiri leverages libxml2, consumers get (among other things) fast parsing, i13n support, fast searching, standards based XPath support, namespace support, and mature HTML correction algorithms.

Re-using existing popular code like libxml2 also has some nice side benefits. More people are testing, and most importantly, bugs get squashed quickly.

Speed

People keep asking me about speed. Is Nokogiri fast? Yes. Is it faster than Hpricot? Yes. Faster than Hpricot 2? Yes. All tests in this benchmark show Nokogiri to be faster in all aspects. But you shouldn’t believe me. I am just some (incredibly attractive) dude on the internet. Try it out for yourself. Clone this gist and run the benchmarks! Write your own benchmarks! I don’t want you to believe me. I want you to find out for yourself.

If you write any benchmarks, send them back to me! I like adding to the list, even if they show Nokogiri to be slower. It helps me know where to improve!

Implementation Details

I’ve already touched on the underpinnings of Nokogiri. Specifically that it wraps libxml2 which gives us parsing, and XPath searching for free. One thing I’d like to talk about is the CSS selector implementation. I found this part of Nokogiri to be particularly challenging and fun!

The way the CSS selector search works is Nokogiri parses the selector, then converts it in to XPath, then leverages the XPath search to return results. I was able to take the grammar and lexer from the W3C, and output a tokenizer and parser. I used RACC to generate the parser, and FREX (my fork of REX) to output a tokenizer. The generated parser outputs an AST. I implemented a visitor which walks the AST and turns it in to an XPath query. That’s it! Really no magic necessary.

Conclusion

Nokogiri works for me because re-uses a popular, fast, standards based, and well maintained library. But that is why it works for me. I encourage you to download it and try it out yourself. I think you’ll be pleased!

I am so happy with this project, that I will be eventually deprecating the use of Hpricot in Mechanize. Nokogiri’s API is so similar to Hpricot that I doubt there will be any surprises. If you are just using mechanize’s public API, you should not have to change anything. If you dive in to the parser and use hpricot selectors, you might need to change some things. The Nokogiri API is very much like Hpricot, so I think that most people won’t need to do anything.

In the meantime…..

If you find any problems, file a ticket! The source code is hosted on github. If you’d like to see more examples, check out the readme, and the wiki.

Thanks for reading!