Nokogiri Is Released 56

Posted by Aaron Patterson on October 30, 2008

Hey internet. How are you doing? Ya. It’s been a while. I know, I know. I suck at blogging. Couldn’t you tell by my horrible layout? But seriously, I’ve been really busy lately. We used to have such good times together. I’d write a blog post, you would show it to everyone on the internet. But that spark just doesn’t seem to be there anymore.

Well, I’m doing my best to keep this relationship together. With the help of super awesome ruby hacker Mike Dalessio, I wrote an XML/HTML parsing library for ruby called Nokogiri. What is so great about Nokogiri? Well, for one it is really easy to parse HTML or XML:

require 'nokogiri'

doc = Nokogiri::HTML(<<-eohtml)
<html>
  <body>
    <div id="wrapper">
      <h1>Hello world</h1>
      <p>Paragraph</p>
    </div>
  </body>
</html>
eohtml

Oh, I know what you’re saying internet. Ya, sure, it’s easy to parse, but is it easy to search? Well, it is. I promise. You know XPath, right? Well you can search by XPath very easily:

doc.xpath('//p').each do |paragraph|
  puts paragraph.text
end

Oh, you don’t know XPath very well? That’s OK. I know you know CSS. You use it everywhere! I’ve viewed your source (*wink* *wink*). Well you can search using CSS selectors as well:

doc.css('div#wrapper').each do |div_with_wrapper_id|
  puts div_with_wrapper_id['id']
end

Oh, I see how it is. You don’t want to commit. You want to search with CSS selectors *and* XPath. Well fine. You can have that too. Just use the “search” method, and you can mix and match your selectors:

css.search('//p', 'div#wrapper') do |node|
  puts node.name
end

Well, I hope you’re feeling better about our relationship now. I just want to tell you that you shouldn’t worry about that old legacy code that uses Hpricot. Nokogiri can be used as a drop in replacement! Really! Nokogiri doesn’t reproduce the bugs that are in Hpricot, but should work in most cases. Just use “Nokogir::Hpricot()” to parse your HTML. Of course, I’ve tried to keep the syntax of Hpricot that I like. For example, you can use slashes for searching, subsearching:

(doc/'div').each { |div| puts div.at('p').text }

You even get a speed increase. For free!

Want to install Nokogiri? No problem. Just do “gem install nokogiri”. It’s that easy!

Well, now that we’re back together, why don’t you send some twitters if you like it! Thanks innernet. I promise to update you more often. I swear.

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. Saimon Moore Fri, 31 Oct 2008 03:33:29 UTC

    Hi Aaron, wanted to try it out but I’m having problems installing it: http://pastie.org/304781

  2. khelll Fri, 31 Oct 2008 05:06:02 UTC

    Well what are the benefits over hpricot?
    Have you benchmarked it?

  3. Saimon Moore Fri, 31 Oct 2008 08:00:31 UTC

    khelll: see http://gist.github.com/18533 (link at bottom of post – ’speed increase’)

  4. Aaron Patterson Fri, 31 Oct 2008 08:14:21 UTC

    Saimon: What version of libxml do you have installed. It looks like your build went just fine. Unfortunately that test returns different numbers depending on what version of libxml you have installed.

  5. Saimon Moore Fri, 31 Oct 2008 08:19:59 UTC

    Hi Aaron,

    You mean the gem? libxml-ruby (0.8.3)

  6. Saimon Moore Fri, 31 Oct 2008 08:21:36 UTC

    libxml2 2.6.32, Revision 1, textproc/libxml2 (Variants: universal, debug)

  7. Aaron Patterson Fri, 31 Oct 2008 08:26:19 UTC

    Just plain libxml2. Thanks. I’ll have to add an exception for that version. The problem is that libxml2 corrects broken html differently depending on what version you’re using. I have tests which attempt to correct broken HTML, and so the version of libxml2 may make them break.

    You should be good to go though! :-)

  8. Saimon Moore Fri, 31 Oct 2008 08:31:32 UTC

    cool thanks…

  9. Jerrett Fri, 31 Oct 2008 10:23:05 UTC

    Yay! Thanks, this will be very useful, especially since hpricot seems to be dead on the updates/fixes front lately…

  10. [...] Aaron Patterson (@tenderlove) and Mike Dalessio released Nokogiri (Github repository), a new HTML and XML parser for Ruby. It “parses and searches XML/HTML faster [...]

  11. Evan Fri, 31 Oct 2008 12:55:12 UTC

    Hi–sounds awesome but I’m having a little trouble converting from Hpricot. I have an app that scrapes some nightmare HTML that in turn contains very ugly data. I’m getting a failure on the following expression, which as far as I can tell is valid XPath and works in Hpricot:

    http://pastie.org/305175

    Thanks for any help.

  12. Aaron Patterson Fri, 31 Oct 2008 13:05:10 UTC

    Hi Evan! Can you post a larger snippet of your code?

  13. Evan Fri, 31 Oct 2008 13:21:35 UTC

    Sure, and thanks very much. I updated the original pastie, let me know if you need more:

    http://pastie.org/305175

  14. filterfish Fri, 31 Oct 2008 19:28:05 UTC

    You are fucking heros!

    I’ve been using hpricot for about a year now and have been getting more concerned about the fact that it appears to be entirely unmaintained. What’s more I’m a mere mortal and as such the innards scare the shit out of me. It has a number of bugs and I’m kinda thinking that ruby 1.9 and hpricot will never happen.

    But no longer. I spent about an hour porting my code this morning and it just works. There is some tweaking I will need to do but given that an html parser is pivital to the application that’s not really surprising.

    Oh, and just for the record it’s a little over twice as fast as hpricot in my application. I should note that the application probably doesn’t follow the most common use cases so your milage will almost cetain vary.

    rgh

  15. filterfish Fri, 31 Oct 2008 19:29:31 UTC

    Oh and if you are using a Debian based Linux distro you will need to install libxslt1-dev and libxml2-dev.

    rgh

  16. PhoeniX Fri, 31 Oct 2008 22:02:08 UTC

    Will it be integrated into www::mechanize and how soon?
    Thanks, looks promising.

  17. Jon Sat, 01 Nov 2008 08:44:38 UTC

    How does this compare to libxml-ruby ?

  18. Aaron Patterson Sat, 01 Nov 2008 10:41:39 UTC

    @PhoeniX: Yes, it will be integrated with WWW::Mechanize. I am working on the Hpricot deprecation roadmap now. I hope Nokogiri will be integrated with Mechanize by version 1.0 (possibly around January, I’m not sure).

    @Jon: I found that it is slightly slower than libxml-ruby, but that is because of marshalling costs. And those will diminish over time. Also, I believe my interface to be much cleaner than libxml-ruby, and the implementation to be much nicer.

  19. Luigi Montanez Sat, 01 Nov 2008 18:41:40 UTC

    Thanks for this. I’m in desperate need of an RSS/Atom parser that’s significantly faster than FeedTools. What’s the best way to use Nokogiri to parse feeds? The Reader object?

  20. Aaron Patterson Sat, 01 Nov 2008 18:47:44 UTC

    @Luigi:

    You should be able to modify the HTML example I have in the readme. Just use XML instead of HTML, and use the rss url.

    http://github.com/tenderlove/nokogiri/tree/master

  21. Mike Ferrier Sat, 01 Nov 2008 19:17:12 UTC

    Thanks for releasing this nice drop-in replacement for Hpricot, it’s offering quite a speed improvement over Hpricot in some software I’m working on that parses quite a bit of XML.

    One problem I’m running into, however, is a segfault that I’m having a hard time reproducing. That is to say, one particular Nokogiri::XML::Node is causing a segfault if any search is run on it, but only in the middle of a large XML parsing routine, and if I try to reproduce the segfault by loading up the offending XML into a new Nokogiri::XML::Node in a clean ruby instance, it doesn’t segfault.

    Do you have any advice on how to track a bug like this down?

    Here’s the XML giving me the problem, if you’re interested: http://pastie.org/private/a3zzvk11ddfkqjszdc9ag

    And here’s what the segfault looks like:

    http://pastie.org/private/mddg3ixtzqikju2lw6iisg

  22. Aaron Patterson Sat, 01 Nov 2008 19:30:20 UTC

    Hey Mike. That code doesn’t seem to crash for me. Can you give me your set up information.

    What OS are you using, which version of ruby, and also can you print out these constants from nokogiri:

    Nokogiri::VERSION
    Nokogiri::LIBXML_VERSION

    Thanks!

  23. Mike Ferrier Sun, 02 Nov 2008 10:35:53 UTC

    Thanks for the response.

    $ irb -rubygems -rnokogiri
    irb(main):001:0> RUBY_VERSION

    > “1.8.7″

    irb(main):002:0> uname -a

    > “Linux mikeulas 2.6.27-7-generic #1 SMP Thu Oct 30 04:18:38 UTC 2008 i686 GNU/Linuxn”

    irb(main):003:0> Nokogiri::VERSION

    > “0.0.0″

    irb(main):004:0> Nokogiri::LIBXML_VERSION

    > “2.6.31″

    irb(main):005:0> gem list | grep -i noko

    > “tenderlove-nokogiri (0.0.0.20081021110113)n”

    Version zero seems weird, did I somehow get a super early gem installed?

  24. Aaron Patterson Sun, 02 Nov 2008 11:58:43 UTC

    @Mike: Looks like you’ve got a development version installed. Try uninstalling that, and getting the official released gem. ‘gem install nokogiri’.

    The latest released version is 1.0.2

  25. Mike Ferrier Sun, 02 Nov 2008 12:52:43 UTC

    Hmm, ran into the same thing again, and I can reproduce it on my machine; however, I can’t reproduce it on another linux machine running ruby 1.8.6 instead of my 1.8.7, so I’m going to assume this is an incompatibility of some kind with 1.8.7. The libxml versions are slightly different as well (2.6.32 vs 2.6.27).

    crash.xml – http://pastie.org/private/lowzxqpb9vqjq5lwr81pa
    crash.rb – http://pastie.org/private/za4lqzfdo0sxnfun1fyxg

    Versions on crash system
    irb(main):001:0> RUBY_VERSION

    > “1.8.7″

    irb(main):002:0> uname -a

    > “Linux mikeulas 2.6.27-7-generic #1 SMP Thu Oct 30 04:18:38 UTC 2008 i686 GNU/Linuxn”

    irb(main):004:0> Nokogiri::VERSION

    > “1.0.2″

    irb(main):005:0> Nokogiri::LIBXML_VERSION

    > “2.6.32″

    Versions on non-crashing system
    irb(main):001:0> RUBY_VERSION

    > “1.8.6″

    irb(main):002:0> uname -a

    > “Linux ethan 2.6.20-17-server #2 SMP Wed Aug 20 16:54:26 UTC 2008 i686 GNU/Linuxn”

    irb(main):003:0> Nokogiri::VERSION

    > “1.0.2″

    irb(main):004:0> Nokogiri::LIBXML_VERSION

    > “2.6.27″

    Any insight on this? Did you develop this on 1.8.6 or 1.8.7?

  26. Mike Ferrier Sun, 02 Nov 2008 12:54:19 UTC

    Oh, btw, the crash looks like:

    $ ruby crash.rb
    crash.rb:13: [BUG] Segmentation fault
    ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]

    Aborted

  27. Aaron Patterson Sun, 02 Nov 2008 14:03:35 UTC

    @Mike: I normally develop with 1.8.6, but all tests pass with 1.8.7. I can’t seem to reproduce this with 1.8.7 either.

    Can you add a puts or something at the end of the script and see if that is outputted? That would help me determine if the problem is a GC problem or something else.

  28. Mike Ferrier Sun, 02 Nov 2008 15:34:15 UTC

    I added puts "test" on the last line of crash.rb, but no change to the output. Still segfault on line 13, no “test” printed.

  29. Aaron Patterson Sun, 02 Nov 2008 15:43:48 UTC

    @Mike

    Okay. Try this: ‘gdb ruby’. Then when you’re in the gdb shell do ‘run crash.rb’. When ruby segfaults, it will drop you back in to the gdb shell. Type ‘bt’ in the gdb shell and send me the output.

  30. jiah's me2DAY Sun, 02 Nov 2008 22:59:17 UTC

    지아의 생각…

    Nokogiri, XML/HTML parsing library for ruby…

  31. Mike Ferrier Mon, 03 Nov 2008 07:37:08 UTC

    Okay, emailed you the results.

  32. Tom Taylor Mon, 03 Nov 2008 08:09:14 UTC

    Aaron – thanks a for lot for Nokogiri – looks excellent.

    I think I’ve found a bug though – is there somewhere better than here for me to file it? Lighthouse?

  33. Aaron Patterson Mon, 03 Nov 2008 09:35:00 UTC

    Hey Tom. I’m trying to get a lighthouse project going, but that seems to be a challenge at this point. Just file the ticket on rubyforge, or send it to me for now.

    http://rubyforge.org/tracker/?atid=27368&group_id=7062&func=browse

  34. Mike Ferrier Mon, 03 Nov 2008 11:14:24 UTC

    @Aaron I downgraded to ruby 1.8.6. and now it doesn’t crash… if that helps at all :)

  35. Mike Ferrier Mon, 03 Nov 2008 11:36:41 UTC

    Hmm, spoke too soon… it just crashes in a different spot. Emailed you the info.
    I can’t reproduce this on the two other machines I tried, so I suppose there’s something stupid about my machine that’s causing this.

  36. The Merbist » Blog Archive » Merb 1.0 RC5 Mon, 03 Nov 2008 19:50:29 UTC

    [...] view helpers now use nokogiri (with rexml as fallback [...]

  37. [...] view helpers now use nokogiri (with rexml as fallback [...]

  38. Chris Olsen Sun, 09 Nov 2008 14:21:50 UTC

    It is failing two of the tests for me, which kills the installation. http://pastie.org/310956

    I am using libxml v2.6.30.

  39. Aaron Patterson Sun, 09 Nov 2008 23:01:51 UTC

    Hey Chris!

    It should work now. Pull down version nokogiri 1.0.4, and it should install just fine.

  40. Chris Olsen Mon, 10 Nov 2008 07:04:04 UTC

    It installed smoothly this time :) For future reference, is there a way to install gems without running the tests. I see there is a –no-test param that can be added, but that didn’t seem to stop them from running.

    Thanks Aaron.

  41. Rudolfs Tue, 11 Nov 2008 08:10:44 UTC

    I had an error in nokogiri.rb line 13 about using “nil.+” , so I changed it a bit and now it seem to work:

    if RUBYPLATFORM =~ /mswin/i
    if ENV['PATH'].nil?
    ENV['PATH'] = “;” + File.expand
    path(File.join( File.dirname(FILE) , “..”, “ext”, “nokogiri”))
    else
    ENV['PATH'] += “;” + File.expand_path(File.join( File.dirname(FILE) , “..”, “ext”, “nokogiri”))
    end
    end

    Or am I doing something wrong here?

  42. Thomas Tue, 11 Nov 2008 09:07:36 UTC

    “Sudo gem install nokogiri” fails with Error: Failed to build gem native extension (Nokogiri 1.0.4), on OS X with all updated gems. Any pointers ? Seems like a cool gem.

  43. Derek Wed, 12 Nov 2008 00:09:05 UTC

    Apparently the installer attempts to call the rubyforge command that’s installed with the rubyforge gem.

    However, the gem spec doesn’t list rubyforge as a dependency, so the install will fail for folks who don’t have rubyforge installed beforehand.

  44. [...] Nokogiri As already mentioned, we switched to nokogiri before 1.0. Nokogiri is based on libxml2 which is a fast and reliable XML parser. I won’t get into the hpricot vs nokogiri argument, but let’s just say that nokogiri was fitting our needs better. (Nokogiri is only used in the specs, and therefore is not needed to run Merb) [...]

  45. Spencer Miles Thu, 27 Nov 2008 13:19:33 UTC

    Hi Aaron,

    I’m wanting to switch from Hpricot to Nokogiri, hoping that it’ll fix the horrible memory leaks in hpricot. I’m running into some problems on OS X though.

    Here’s an example:

    require ‘open-uri’; require ‘nokogiri’; doc = Nokogiri.HTML(open(‘http://stereogum.com’)); doc.css(‘*’)

    Assertion failed: (node->doc), function Nokogiriwrapxmlnode, file xmlnode.c, line 518.

    Abort trap

    When I call doc.css(‘*’) it’s segfaulting, and drops me out of irb and back into bash.

    On my linux production server, this works fine, but on OS X it’s segfaulting.

    I’m using libxml 2.6.16 and Ruby 1.8.6 patchlevel 114.

    Any ideas?

    Spencer

  46. Crónica de una vida - Ruby HTML/XML Parsers Mon, 22 Dec 2008 15:40:49 UTC

    [...] nokogiri ( Aaron Patterson): Esta es una librería más reciente que es muy parecida a hpricot. La gran [...]

  47. Valery Thu, 12 Mar 2009 10:01:32 UTC

    This is awesome plugin. The speed compared to hpricot increased in times. Thank you.

  48. Kris Thu, 19 Mar 2009 04:48:38 UTC

    I’m having a problem installing on Ubuntu (On my Mac development machine everything was fine and dandy).

    I installed libxml2 and libxml2-dev, but when I go to install the gem it says the headers are missing.

    checking for #include … no
    libxml2 is missing. try ‘port install libxml2′ or ‘yum install libxml2′
    *** extconf.rb failed ***

    Any ideas please?

  49. Kris Thu, 19 Mar 2009 05:54:59 UTC

    Dont know if this helps?

    sudo find / -name ‘parser.h’
    /usr/include/gnome-xml/libxml/parser.h
    /usr/include/gnome-xml/parser.h

  50. Aaron Patterson Thu, 19 Mar 2009 08:28:47 UTC

    Make sure you also install “libxml2-devel” and “libxslt-devel”

  51. Kris Fri, 20 Mar 2009 04:14:39 UTC

    Do you mean libxml2-dev and libxslt-dev?

  52. Kris Fri, 20 Mar 2009 04:25:32 UTC

    [SOLVED] – many thanks!

    sudo apt-get update
    sudo apt-get install libxslt1-dev
    sudo apt-get install libxml2-dev
    sudo gem install nokogiri

  53. GenFoch Sun, 19 Apr 2009 16:01:31 UTC

    Hi, I’m really new to ruby so please bear with me. I have installed ruby and rubygems and other gems I have installed do work. I have been requested to install nokogiri-1.2.3 but I get the following error:

    checking for xmlParseDoc() in -lxml2… no
    libxml2 is missing. try ‘port install libxml2′ or ‘yum install libxml2′

    however looking at /usr/lib I see libxml2.a and looking in it I can see DOCBparser which I think contains the xmlParseDoc, so it seems like everything is there.

    I have ruby 1.8.6 (2007-03-13 patchlevel 0) [i486-linux] on the system and updated to rubygems 1.3.2 in my attempt to solve this issue.

    I see there is a message like details. You may need configuration options.

    Provided configuration options:
    and a bunch of options but I’m not sure where to start.

    Thanks for your time.

  54. MikeW Tue, 15 Sep 2009 08:34:41 UTC

    I’m having the same issue as GenFoch after installing Snow Leopard. I’ve even go so far as to try to install lxml2 manaully. I’ve already used port to install libxml2 and libxslt. Every solution I’ve seen says to install libxml2-dev and libxslt-dev, but those are solutions for other linux distros; not OS X. Any way to fix this on a Mac?

  55. Aaron Patterson Tue, 15 Sep 2009 08:41:10 UTC

    @MikeW I run Snow Leopard, and it installs just fine for me.

    Would you mind sending an email to the mailing list that includes the version of nokogiri you’re trying to install, your “ruby -v”, and also the version of XCode you have installed? We can continue debugging it from there!

    Here is a link to the mailing list: http://groups.google.com/group/nokogiri-talk

  56. [...] a shout out to mechanize I will never use another screenscraping library again. It’s uses nokogiri (so it parses all the html into a nice xpath accessible form), it handles all the cookie session [...]

Comments

Check Spelling
Activate Spell Check while Typing