2008-10-30 @ 20:36

Nokogiri Is Released

Hey internet. How are you doing? Ya. It’s been a while. I know, I know. I suck at blogging. Couldn’t you tell by my horrible layout? But seriously, I’ve been really busy lately. We used to have such good times together. I’d write a blog post, you would show it to everyone on the internet. But that spark just doesn’t seem to be there anymore.

Well, I’m doing my best to keep this relationship together. With the help of super awesome ruby hacker Mike Dalessio, I wrote an XML/HTML parsing library for ruby called Nokogiri. What is so great about Nokogiri? Well, for one it is really easy to parse HTML or XML:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
require 'nokogiri'

doc = Nokogiri::HTML(<<-eohtml)
<html>
  <body>
    <div id="wrapper">
      <h1>Hello world</h1>
      <p>Paragraph</p>
    </div>
  </body>
</html>
eohtml[/sourcecode]

Oh, I know what you're saying internet.  Ya, sure, it's easy to parse, but is it easy to search?  Well, it is.  I promise.  You know XPath, right?  Well you can search by XPath very easily:

{:lang="ruby"}

doc.xpath(‘//p’).each do |paragraph| puts paragraph.text end ~~~

Oh, you don’t know XPath very well? That’s OK. I know you know CSS. You use it everywhere! I’ve viewed your source (wink wink). Well you can search using CSS selectors as well:

1
2
3
4
5
6
7
doc.css('div#wrapper').each do |div_with_wrapper_id|
  puts div_with_wrapper_id['id']
end[/sourcecode]

Oh, I see how it is.  You don't want to commit.  You want to search with CSS selectors *and* XPath.  Well fine.  You can have that too.  Just use the "search" method, and you can mix and match your selectors:

{:lang="ruby"}

css.search(‘//p’, ‘div#wrapper’) do |node| puts node.name end[/sourcecode]

Well, I hope you’re feeling better about our relationship now. I just want to tell you that you shouldn’t worry about that old legacy code that uses Hpricot. Nokogiri can be used as a drop in replacement! Really! Nokogiri doesn’t reproduce the bugs that are in Hpricot, but should work in most cases. Just use “Nokogir::Hpricot()” to parse your HTML. Of course, I’ve tried to keep the syntax of Hpricot that I like. For example, you can use slashes for searching, subsearching:

~~~ (doc/’div’).each { |div| puts div.at(‘p’).text }[/sourcecode]

You even get a speed increase. For free!

Want to install Nokogiri? No problem. Just do “gem install nokogiri”. It’s that easy!

Well, now that we’re back together, why don’t you send some twitters if you like it! Thanks innernet. I promise to update you more often. I swear.

read more »