Nokogiri Is Released 46

Posted by Aaron Patterson on October 30, 2008

Hey internet. How are you doing? Ya. It's been a while. I know, I know. I suck at blogging. Couldn't you tell by my horrible layout? But seriously, I've been really busy lately. We used to have such good times together. I'd write a blog post, you would show it to everyone on the internet. But that spark just doesn't seem to be there anymore.

Well, I'm doing my best to keep this relationship together. With the help of super awesome ruby hacker Mike Dalessio, I wrote an XML/HTML parsing library for ruby called Nokogiri. What is so great about Nokogiri? Well, for one it is really easy to parse HTML or XML:

require 'nokogiri'

doc = Nokogiri::HTML(<<-eohtml)
<html>
  <body>
    <div id="wrapper">
      <h1>Hello world</h1>
      <p>Paragraph</p>
    </div>
  </body>
</html>
eohtml

Oh, I know what you're saying internet. Ya, sure, it's easy to parse, but is it easy to search? Well, it is. I promise. You know XPath, right? Well you can search by XPath very easily:

doc.xpath('//p').each do |paragraph|
  puts paragraph.text
end

Oh, you don't know XPath very well? That's OK. I know you know CSS. You use it everywhere! I've viewed your source (wink wink). Well you can search using CSS selectors as well:

doc.css('div#wrapper').each do |div_with_wrapper_id|
  puts div_with_wrapper_id['id']
end

Oh, I see how it is. You don't want to commit. You want to search with CSS selectors and XPath. Well fine. You can have that too. Just use the "search" method, and you can mix and match your selectors:

css.search('//p', 'div#wrapper') do |node|
  puts node.name
end

Well, I hope you're feeling better about our relationship now. I just want to tell you that you shouldn't worry about that old legacy code that uses Hpricot. Nokogiri can be used as a drop in replacement! Really! Nokogiri doesn't reproduce the bugs that are in Hpricot, but should work in most cases. Just use "Nokogir::Hpricot()" to parse your HTML. Of course, I've tried to keep the syntax of Hpricot that I like. For example, you can use slashes for searching, subsearching:

(doc/'div').each { |div| puts div.at('p').text }

You even get a speed increase. For free!

Want to install Nokogiri? No problem. Just do "gem install nokogiri". It's that easy!

Well, now that we're back together, why don't you send some twitters if you like it! Thanks innernet. I promise to update you more often. I swear.

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. Saimon Moore Fri, 31 Oct 2008 03:33:29 PDT

    Hi Aaron, wanted to try it out but I’m having problems installing it: http://pastie.org/304781

  2. khelll Fri, 31 Oct 2008 05:06:02 PDT

    Well what are the benefits over hpricot?
    Have you benchmarked it?

  3. Saimon Moore Fri, 31 Oct 2008 08:00:31 PDT

    khelll: see http://gist.github.com/18533 (link at bottom of post - ’speed increase’)

  4. Aaron Patterson Fri, 31 Oct 2008 08:14:21 PDT

    Saimon: What version of libxml do you have installed. It looks like your build went just fine. Unfortunately that test returns different numbers depending on what version of libxml you have installed.

  5. Saimon Moore Fri, 31 Oct 2008 08:19:59 PDT

    Hi Aaron,

    You mean the gem? libxml-ruby (0.8.3)

  6. Saimon Moore Fri, 31 Oct 2008 08:21:36 PDT

    libxml2 2.6.32, Revision 1, textproc/libxml2 (Variants: universal, debug)

  7. Aaron Patterson Fri, 31 Oct 2008 08:26:19 PDT

    Just plain libxml2. Thanks. I’ll have to add an exception for that version. The problem is that libxml2 corrects broken html differently depending on what version you’re using. I have tests which attempt to correct broken HTML, and so the version of libxml2 may make them break.

    You should be good to go though! :-)

  8. Saimon Moore Fri, 31 Oct 2008 08:31:32 PDT

    cool thanks…

  9. Jerrett Fri, 31 Oct 2008 10:23:05 PDT

    Yay! Thanks, this will be very useful, especially since hpricot seems to be dead on the updates/fixes front lately…

  10. [...] Aaron Patterson (@tenderlove) and Mike Dalessio released Nokogiri (Github repository), a new HTML and XML parser for Ruby. It “parses and searches XML/HTML faster [...]

  11. Evan Fri, 31 Oct 2008 12:55:12 PDT

    Hi–sounds awesome but I’m having a little trouble converting from Hpricot. I have an app that scrapes some nightmare HTML that in turn contains very ugly data. I’m getting a failure on the following expression, which as far as I can tell is valid XPath and works in Hpricot:

    http://pastie.org/305175

    Thanks for any help.

  12. Aaron Patterson Fri, 31 Oct 2008 13:05:10 PDT

    Hi Evan! Can you post a larger snippet of your code?

  13. Evan Fri, 31 Oct 2008 13:21:35 PDT

    Sure, and thanks very much. I updated the original pastie, let me know if you need more:

    http://pastie.org/305175

  14. filterfish Fri, 31 Oct 2008 19:28:05 PDT

    You are fucking heros!

    I’ve been using hpricot for about a year now and have been getting more concerned about the fact that it appears to be entirely unmaintained. What’s more I’m a mere mortal and as such the innards scare the shit out of me. It has a number of bugs and I’m kinda thinking that ruby 1.9 and hpricot will never happen.

    But no longer. I spent about an hour porting my code this morning and it just works. There is some tweaking I will need to do but given that an html parser is pivital to the application that’s not really surprising.

    Oh, and just for the record it’s a little over twice as fast as hpricot in my application. I should note that the application probably doesn’t follow the most common use cases so your milage will almost cetain vary.

    rgh

  15. filterfish Fri, 31 Oct 2008 19:29:31 PDT

    Oh and if you are using a Debian based Linux distro you will need to install libxslt1-dev and libxml2-dev.

    rgh

  16. PhoeniX Fri, 31 Oct 2008 22:02:08 PDT

    Will it be integrated into www::mechanize and how soon?
    Thanks, looks promising.

  17. Jon Sat, 01 Nov 2008 08:44:38 PDT

    How does this compare to libxml-ruby ?

  18. Aaron Patterson Sat, 01 Nov 2008 10:41:39 PDT

    @PhoeniX: Yes, it will be integrated with WWW::Mechanize. I am working on the Hpricot deprecation roadmap now. I hope Nokogiri will be integrated with Mechanize by version 1.0 (possibly around January, I’m not sure).

    @Jon: I found that it is slightly slower than libxml-ruby, but that is because of marshalling costs. And those will diminish over time. Also, I believe my interface to be much cleaner than libxml-ruby, and the implementation to be much nicer.

  19. Luigi Montanez Sat, 01 Nov 2008 18:41:40 PDT

    Thanks for this. I’m in desperate need of an RSS/Atom parser that’s significantly faster than FeedTools. What’s the best way to use Nokogiri to parse feeds? The Reader object?

  20. Aaron Patterson Sat, 01 Nov 2008 18:47:44 PDT

    @Luigi:

    You should be able to modify the HTML example I have in the readme. Just use XML instead of HTML, and use the rss url.

    http://github.com/tenderlove/nokogiri/tree/master

  21. Mike Ferrier Sat, 01 Nov 2008 19:17:12 PDT

    Thanks for releasing this nice drop-in replacement for Hpricot, it’s offering quite a speed improvement over Hpricot in some software I’m working on that parses quite a bit of XML.

    One problem I’m running into, however, is a segfault that I’m having a hard time reproducing. That is to say, one particular Nokogiri::XML::Node is causing a segfault if any search is run on it, but only in the middle of a large XML parsing routine, and if I try to reproduce the segfault by loading up the offending XML into a new Nokogiri::XML::Node in a clean ruby instance, it doesn’t segfault.

    Do you have any advice on how to track a bug like this down?

    Here’s the XML giving me the problem, if you’re interested: http://pastie.org/private/a3zzvk11ddfkqjszdc9ag

    And here’s what the segfault looks like:

    http://pastie.org/private/mddg3ixtzqikju2lw6iisg

  22. Aaron Patterson Sat, 01 Nov 2008 19:30:20 PDT

    Hey Mike. That code doesn’t seem to crash for me. Can you give me your set up information.

    What OS are you using, which version of ruby, and also can you print out these constants from nokogiri:

    Nokogiri::VERSION
    Nokogiri::LIBXML_VERSION

    Thanks!

  23. Mike Ferrier Sun, 02 Nov 2008 10:35:53 PST

    Thanks for the response.

    $ irb -rubygems -rnokogiri
    irb(main):001:0> RUBY_VERSION

    > “1.8.7″

    irb(main):002:0>

    uname -a

    > “Linux mikeulas 2.6.27-7-generic #1 SMP Thu Oct 30 04:18:38 UTC 2008 i686 GNU/Linuxn”

    irb(main):003:0> Nokogiri::VERSION

    > “0.0.0″

    irb(main):004:0> Nokogiri::LIBXML_VERSION

    > “2.6.31″

    irb(main):005:0>

    gem list | grep -i noko

    > “tenderlove-nokogiri (0.0.0.20081021110113)n”

    Version zero seems weird, did I somehow get a super early gem installed?

  24. Aaron Patterson Sun, 02 Nov 2008 11:58:43 PST

    @Mike: Looks like you’ve got a development version installed. Try uninstalling that, and getting the official released gem. ‘gem install nokogiri’.

    The latest released version is 1.0.2

  25. Mike Ferrier Sun, 02 Nov 2008 12:52:43 PST

    Hmm, ran into the same thing again, and I can reproduce it on my machine; however, I can’t reproduce it on another linux machine running ruby 1.8.6 instead of my 1.8.7, so I’m going to assume this is an incompatibility of some kind with 1.8.7. The libxml versions are slightly different as well (2.6.32 vs 2.6.27).

    crash.xml - http://pastie.org/private/lowzxqpb9vqjq5lwr81pa
    crash.rb - http://pastie.org/private/za4lqzfdo0sxnfun1fyxg

    Versions on crash system
    irb(main):001:0> RUBY_VERSION

    > “1.8.7″

    irb(main):002:0>

    uname -a

    > “Linux mikeulas 2.6.27-7-generic #1 SMP Thu Oct 30 04:18:38 UTC 2008 i686 GNU/Linuxn”

    irb(main):004:0> Nokogiri::VERSION

    > “1.0.2″

    irb(main):005:0> Nokogiri::LIBXML_VERSION

    > “2.6.32″

    Versions on non-crashing system
    irb(main):001:0> RUBY_VERSION

    > “1.8.6″

    irb(main):002:0>

    uname -a

    > “Linux ethan 2.6.20-17-server #2 SMP Wed Aug 20 16:54:26 UTC 2008 i686 GNU/Linuxn”

    irb(main):003:0> Nokogiri::VERSION

    > “1.0.2″

    irb(main):004:0> Nokogiri::LIBXML_VERSION

    > “2.6.27″

    Any insight on this? Did you develop this on 1.8.6 or 1.8.7?

  26. Mike Ferrier Sun, 02 Nov 2008 12:54:19 PST

    Oh, btw, the crash looks like:

    $ ruby crash.rb
    crash.rb:13: [BUG] Segmentation fault
    ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]

    Aborted

  27. Aaron Patterson Sun, 02 Nov 2008 14:03:35 PST

    @Mike: I normally develop with 1.8.6, but all tests pass with 1.8.7. I can’t seem to reproduce this with 1.8.7 either.

    Can you add a puts or something at the end of the script and see if that is outputted? That would help me determine if the problem is a GC problem or something else.

  28. Mike Ferrier Sun, 02 Nov 2008 15:34:15 PST

    I added

    puts “test”

    on the last line of crash.rb, but no change to the output. Still segfault on line 13, no “test” printed.

  29. Aaron Patterson Sun, 02 Nov 2008 15:43:48 PST

    @Mike

    Okay. Try this: ‘gdb ruby’. Then when you’re in the gdb shell do ‘run crash.rb’. When ruby segfaults, it will drop you back in to the gdb shell. Type ‘bt’ in the gdb shell and send me the output.

  30. jiah's me2DAY Sun, 02 Nov 2008 22:59:17 PST

    지아의 생각…

    Nokogiri, XML/HTML parsing library for ruby…

  31. Mike Ferrier Mon, 03 Nov 2008 07:37:08 PST

    Okay, emailed you the results.

  32. Tom Taylor Mon, 03 Nov 2008 08:09:14 PST

    Aaron - thanks a for lot for Nokogiri - looks excellent.

    I think I’ve found a bug though - is there somewhere better than here for me to file it? Lighthouse?

  33. Aaron Patterson Mon, 03 Nov 2008 09:35:00 PST

    Hey Tom. I’m trying to get a lighthouse project going, but that seems to be a challenge at this point. Just file the ticket on rubyforge, or send it to me for now.

    http://rubyforge.org/tracker/?atid=27368&group_id=7062&func=browse

  34. Mike Ferrier Mon, 03 Nov 2008 11:14:24 PST

    @Aaron I downgraded to ruby 1.8.6. and now it doesn’t crash… if that helps at all :)

  35. Mike Ferrier Mon, 03 Nov 2008 11:36:41 PST

    Hmm, spoke too soon… it just crashes in a different spot. Emailed you the info.
    I can’t reproduce this on the two other machines I tried, so I suppose there’s something stupid about my machine that’s causing this.

  36. The Merbist » Blog Archive » Merb 1.0 RC5 Mon, 03 Nov 2008 19:50:29 PST

    [...] view helpers now use nokogiri (with rexml as fallback [...]

  37. [...] view helpers now use nokogiri (with rexml as fallback [...]

  38. Chris Olsen Sun, 09 Nov 2008 14:21:50 PST

    It is failing two of the tests for me, which kills the installation. http://pastie.org/310956

    I am using libxml v2.6.30.

  39. Aaron Patterson Sun, 09 Nov 2008 23:01:51 PST

    Hey Chris!

    It should work now. Pull down version nokogiri 1.0.4, and it should install just fine.

  40. Chris Olsen Mon, 10 Nov 2008 07:04:04 PST

    It installed smoothly this time :) For future reference, is there a way to install gems without running the tests. I see there is a –no-test param that can be added, but that didn’t seem to stop them from running.

    Thanks Aaron.

  41. Rudolfs Tue, 11 Nov 2008 08:10:44 PST

    I had an error in nokogiri.rb line 13 about using “nil.+” , so I changed it a bit and now it seem to work:

    if RUBYPLATFORM =~ /mswin/i
    if ENV['PATH'].nil?
    ENV['PATH'] = “;” + File.expand
    path(File.join( File.dirname(FILE) , “..”, “ext”, “nokogiri”))
    else
    ENV['PATH'] += “;” + File.expand_path(File.join( File.dirname(FILE) , “..”, “ext”, “nokogiri”))
    end
    end

    Or am I doing something wrong here?

  42. Thomas Tue, 11 Nov 2008 09:07:36 PST

    “Sudo gem install nokogiri” fails with Error: Failed to build gem native extension (Nokogiri 1.0.4), on OS X with all updated gems. Any pointers ? Seems like a cool gem.

  43. Derek Wed, 12 Nov 2008 00:09:05 PST

    Apparently the installer attempts to call the rubyforge command that’s installed with the rubyforge gem.

    However, the gem spec doesn’t list rubyforge as a dependency, so the install will fail for folks who don’t have rubyforge installed beforehand.

  44. [...] Nokogiri As already mentioned, we switched to nokogiri before 1.0. Nokogiri is based on libxml2 which is a fast and reliable XML parser. I won’t get into the hpricot vs nokogiri argument, but let’s just say that nokogiri was fitting our needs better. (Nokogiri is only used in the specs, and therefore is not needed to run Merb) [...]

  45. Spencer Miles Thu, 27 Nov 2008 13:19:33 PST

    Hi Aaron,

    I’m wanting to switch from Hpricot to Nokogiri, hoping that it’ll fix the horrible memory leaks in hpricot. I’m running into some problems on OS X though.

    Here’s an example:

    require ‘open-uri’; require ‘nokogiri’; doc = Nokogiri.HTML(open(’http://stereogum.com’)); doc.css(’*')

    Assertion failed: (node->doc), function Nokogiriwrapxmlnode, file xmlnode.c, line 518.

    Abort trap

    When I call doc.css(’*') it’s segfaulting, and drops me out of irb and back into bash.

    On my linux production server, this works fine, but on OS X it’s segfaulting.

    I’m using libxml 2.6.16 and Ruby 1.8.6 patchlevel 114.

    Any ideas?

    Spencer

  46. Crónica de una vida - Ruby HTML/XML Parsers Mon, 22 Dec 2008 15:40:49 PST

    [...] nokogiri ( Aaron Patterson): Esta es una librería más reciente que es muy parecida a hpricot. La gran [...]

Comments

Check Spelling
Activate Spell Check while Typing