2008-11-18 @ 14:17
Underpant-Free Excitement, or, why Nokogiri works for me.
Nokogiri is an HTML/XML/XSLT parser that can be searched via XPath or CSS3 selectors. The library wraps libxml and libxslt, but the API is based off Hpricot. I wrote this library (with the help of super awesome ruby hacker Mike Dalessio) because I was unhappy with the API of libxml-ruby, and I was unhappy with the support, speed, and broken html support of Hpricot. I wanted something with an awesome API like Hpricot, but fast like libxml-ruby and with better XPath and CSS support.
I want to talk about the underpinnings, speed, and some interesting implementation details of nokogiri. But first, lets look at a quick example of parsing google just to whet your appetite.
~~~ require ‘nokogiri’ require ‘open-uri’
doc = Nokogiri.HTML(open(‘http://google.com/search?q=tenderlove’).read) doc.search(‘h3.r > a.l’).each do |link| puts link.inner_text end[/sourcecode]
This sample searches google for the string “tenderlove”, then searches the document with the given CSS selector “h3.r > a.l”, and prints out the inner text of each found node. It’s as simple as that. You can get fancier, with sub searches, or using XPath, but you’ll have to explore that on your own for now.
Nokogiri is a wrapper around libxml2 and libxslt, but also includes a CSS selector parser. I chose libxml2 because it is very fast, it’s available for many platforms, it corrects broken HTML, has built in XPath search support, it is popular, and the list goes on.
Given these reasons, I felt that there was no reason for me to write my own HTML parser and corrector when there is an existing library that has all of these good qualities. The best thing to do in this situation is leverage this existing library and expose a friendly ruby API. In fact, the only thing that libxml is missing is an API to search documents via CSS. Most of the API calls in Nokogiri are implemented inside libxml except for the CSS selector parser, and even that leverages the XPath API.
Since Nokogiri leverages libxml2, consumers get (among other things) fast parsing, i13n support, fast searching, standards based XPath support, namespace support, and mature HTML correction algorithms.
Re-using existing popular code like libxml2 also has some nice side benefits. More people are testing, and most importantly, bugs get squashed quickly.
People keep asking me about speed. Is Nokogiri fast? Yes. Is it faster than Hpricot? Yes. Faster than Hpricot 2? Yes. All tests in this benchmark show Nokogiri to be faster in all aspects. But you shouldn’t believe me. I am just some (incredibly attractive) dude on the internet. Try it out for yourself. Clone this gist and run the benchmarks! Write your own benchmarks! I don’t want you to believe me. I want you to find out for yourself.
If you write any benchmarks, send them back to me! I like adding to the list, even if they show Nokogiri to be slower. It helps me know where to improve!
I’ve already touched on the underpinnings of Nokogiri. Specifically that it wraps libxml2 which gives us parsing, and XPath searching for free. One thing I’d like to talk about is the CSS selector implementation. I found this part of Nokogiri to be particularly challenging and fun!
The way the CSS selector search works is Nokogiri parses the selector, then converts it in to XPath, then leverages the XPath search to return results. I was able to take the grammar and lexer from the W3C, and output a tokenizer and parser. I used RACC to generate the parser, and FREX (my fork of REX) to output a tokenizer. The generated parser outputs an AST. I implemented a visitor which walks the AST and turns it in to an XPath query. That’s it! Really no magic necessary.
Nokogiri works for me because re-uses a popular, fast, standards based, and well maintained library. But that is why it works for me. I encourage you to download it and try it out yourself. I think you’ll be pleased!
I am so happy with this project, that I will be eventually deprecating the use of Hpricot in Mechanize. Nokogiri’s API is so similar to Hpricot that I doubt there will be any surprises. If you are just using mechanize’s public API, you should not have to change anything. If you dive in to the parser and use hpricot selectors, you might need to change some things. The Nokogiri API is very much like Hpricot, so I think that most people won’t need to do anything.
In the meantime.....
Thanks for reading!