Hey internet. How are you doing? Ya. It’s been a while. I know, I know. I suck at blogging. Couldn’t you tell by my horrible layout? But seriously, I’ve been really busy lately. We used to have such good times together. I’d write a blog post, you would show it to everyone on the internet. But that spark just doesn’t seem to be there anymore.
Well, I’m doing my best to keep this relationship together. With the help of super awesome ruby hacker Mike Dalessio, I wrote an XML/HTML parsing library for ruby called Nokogiri. What is so great about Nokogiri? Well, for one it is really easy to parse HTML or XML:
require 'nokogiri'
doc = Nokogiri::HTML(<<-eohtml)
<html>
<body>
<div id="wrapper">
<h1>Hello world</h1>
<p>Paragraph</p>
</div>
</body>
</html>
eohtml
Oh, I know what you’re saying internet. Ya, sure, it’s easy to parse, but is it easy to search? Well, it is. I promise. You know XPath, right? Well you can search by XPath very easily:
doc.xpath('//p').each do |paragraph|
puts paragraph.text
end
Oh, you don’t know XPath very well? That’s OK. I know you know CSS. You use it everywhere! I’ve viewed your source (*wink* *wink*). Well you can search using CSS selectors as well:
doc.css('div#wrapper').each do |div_with_wrapper_id|
puts div_with_wrapper_id['id']
end
Oh, I see how it is. You don’t want to commit. You want to search with CSS selectors *and* XPath. Well fine. You can have that too. Just use the “search” method, and you can mix and match your selectors:
css.search('//p', 'div#wrapper') do |node|
puts node.name
end
Well, I hope you’re feeling better about our relationship now. I just want to tell you that you shouldn’t worry about that old legacy code that uses Hpricot. Nokogiri can be used as a drop in replacement! Really! Nokogiri doesn’t reproduce the bugs that are in Hpricot, but should work in most cases. Just use “Nokogir::Hpricot()” to parse your HTML. Of course, I’ve tried to keep the syntax of Hpricot that I like. For example, you can use slashes for searching, subsearching:
(doc/'div').each { |div| puts div.at('p').text }
You even get a speed increase. For free!
Want to install Nokogiri? No problem. Just do “gem install nokogiri”. It’s that easy!
Well, now that we’re back together, why don’t you send some twitters if you like it! Thanks innernet. I promise to update you more often. I swear.
Hi Aaron, wanted to try it out but I’m having problems installing it: http://pastie.org/304781
Well what are the benefits over hpricot?
Have you benchmarked it?
khelll: see http://gist.github.com/18533 (link at bottom of post – ’speed increase’)
Saimon: What version of libxml do you have installed. It looks like your build went just fine. Unfortunately that test returns different numbers depending on what version of libxml you have installed.
Hi Aaron,
You mean the gem? libxml-ruby (0.8.3)
libxml2 2.6.32, Revision 1, textproc/libxml2 (Variants: universal, debug)
Just plain libxml2. Thanks. I’ll have to add an exception for that version. The problem is that libxml2 corrects broken html differently depending on what version you’re using. I have tests which attempt to correct broken HTML, and so the version of libxml2 may make them break.
You should be good to go though!
cool thanks…
Yay! Thanks, this will be very useful, especially since hpricot seems to be dead on the updates/fixes front lately…
[...] Aaron Patterson (@tenderlove) and Mike Dalessio released Nokogiri (Github repository), a new HTML and XML parser for Ruby. It “parses and searches XML/HTML faster [...]
Hi–sounds awesome but I’m having a little trouble converting from Hpricot. I have an app that scrapes some nightmare HTML that in turn contains very ugly data. I’m getting a failure on the following expression, which as far as I can tell is valid XPath and works in Hpricot:
http://pastie.org/305175
Thanks for any help.
Hi Evan! Can you post a larger snippet of your code?
Sure, and thanks very much. I updated the original pastie, let me know if you need more:
http://pastie.org/305175
You are fucking heros!
I’ve been using hpricot for about a year now and have been getting more concerned about the fact that it appears to be entirely unmaintained. What’s more I’m a mere mortal and as such the innards scare the shit out of me. It has a number of bugs and I’m kinda thinking that ruby 1.9 and hpricot will never happen.
But no longer. I spent about an hour porting my code this morning and it just works. There is some tweaking I will need to do but given that an html parser is pivital to the application that’s not really surprising.
Oh, and just for the record it’s a little over twice as fast as hpricot in my application. I should note that the application probably doesn’t follow the most common use cases so your milage will almost cetain vary.
rgh
Oh and if you are using a Debian based Linux distro you will need to install libxslt1-dev and libxml2-dev.
rgh
Will it be integrated into www::mechanize and how soon?
Thanks, looks promising.
How does this compare to libxml-ruby ?
@PhoeniX: Yes, it will be integrated with WWW::Mechanize. I am working on the Hpricot deprecation roadmap now. I hope Nokogiri will be integrated with Mechanize by version 1.0 (possibly around January, I’m not sure).
@Jon: I found that it is slightly slower than libxml-ruby, but that is because of marshalling costs. And those will diminish over time. Also, I believe my interface to be much cleaner than libxml-ruby, and the implementation to be much nicer.
Thanks for this. I’m in desperate need of an RSS/Atom parser that’s significantly faster than FeedTools. What’s the best way to use Nokogiri to parse feeds? The Reader object?
@Luigi:
You should be able to modify the HTML example I have in the readme. Just use XML instead of HTML, and use the rss url.
http://github.com/tenderlove/nokogiri/tree/master
Thanks for releasing this nice drop-in replacement for Hpricot, it’s offering quite a speed improvement over Hpricot in some software I’m working on that parses quite a bit of XML.
One problem I’m running into, however, is a segfault that I’m having a hard time reproducing. That is to say, one particular Nokogiri::XML::Node is causing a segfault if any search is run on it, but only in the middle of a large XML parsing routine, and if I try to reproduce the segfault by loading up the offending XML into a new Nokogiri::XML::Node in a clean ruby instance, it doesn’t segfault.
Do you have any advice on how to track a bug like this down?
Here’s the XML giving me the problem, if you’re interested: http://pastie.org/private/a3zzvk11ddfkqjszdc9ag
And here’s what the segfault looks like:
http://pastie.org/private/mddg3ixtzqikju2lw6iisg
Hey Mike. That code doesn’t seem to crash for me. Can you give me your set up information.
What OS are you using, which version of ruby, and also can you print out these constants from nokogiri:
Nokogiri::VERSION
Nokogiri::LIBXML_VERSION
Thanks!
Thanks for the response.
$ irb -rubygems -rnokogiri
irb(main):001:0> RUBY_VERSION
> “1.8.7″
irb(main):002:0>
uname -a> “Linux mikeulas 2.6.27-7-generic #1 SMP Thu Oct 30 04:18:38 UTC 2008 i686 GNU/Linuxn”
irb(main):003:0> Nokogiri::VERSION
> “0.0.0″
irb(main):004:0> Nokogiri::LIBXML_VERSION
> “2.6.31″
irb(main):005:0>
gem list | grep -i noko> “tenderlove-nokogiri (0.0.0.20081021110113)n”
Version zero seems weird, did I somehow get a super early gem installed?
@Mike: Looks like you’ve got a development version installed. Try uninstalling that, and getting the official released gem. ‘gem install nokogiri’.
The latest released version is 1.0.2
Hmm, ran into the same thing again, and I can reproduce it on my machine; however, I can’t reproduce it on another linux machine running ruby 1.8.6 instead of my 1.8.7, so I’m going to assume this is an incompatibility of some kind with 1.8.7. The libxml versions are slightly different as well (2.6.32 vs 2.6.27).
crash.xml – http://pastie.org/private/lowzxqpb9vqjq5lwr81pa
crash.rb – http://pastie.org/private/za4lqzfdo0sxnfun1fyxg
Versions on crash system
irb(main):001:0> RUBY_VERSION
> “1.8.7″
irb(main):002:0>
uname -a> “Linux mikeulas 2.6.27-7-generic #1 SMP Thu Oct 30 04:18:38 UTC 2008 i686 GNU/Linuxn”
irb(main):004:0> Nokogiri::VERSION
> “1.0.2″
irb(main):005:0> Nokogiri::LIBXML_VERSION
> “2.6.32″
Versions on non-crashing system
irb(main):001:0> RUBY_VERSION
> “1.8.6″
irb(main):002:0>
uname -a> “Linux ethan 2.6.20-17-server #2 SMP Wed Aug 20 16:54:26 UTC 2008 i686 GNU/Linuxn”
irb(main):003:0> Nokogiri::VERSION
> “1.0.2″
irb(main):004:0> Nokogiri::LIBXML_VERSION
> “2.6.27″
Any insight on this? Did you develop this on 1.8.6 or 1.8.7?
Oh, btw, the crash looks like:
$ ruby crash.rb
crash.rb:13: [BUG] Segmentation fault
ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux]
Aborted
@Mike: I normally develop with 1.8.6, but all tests pass with 1.8.7. I can’t seem to reproduce this with 1.8.7 either.
Can you add a puts or something at the end of the script and see if that is outputted? That would help me determine if the problem is a GC problem or something else.
I added
puts "test"on the last line of crash.rb, but no change to the output. Still segfault on line 13, no “test” printed.@Mike
Okay. Try this: ‘gdb ruby’. Then when you’re in the gdb shell do ‘run crash.rb’. When ruby segfaults, it will drop you back in to the gdb shell. Type ‘bt’ in the gdb shell and send me the output.
지아의 생각…
Nokogiri, XML/HTML parsing library for ruby…
Okay, emailed you the results.
Aaron – thanks a for lot for Nokogiri – looks excellent.
I think I’ve found a bug though – is there somewhere better than here for me to file it? Lighthouse?
Hey Tom. I’m trying to get a lighthouse project going, but that seems to be a challenge at this point. Just file the ticket on rubyforge, or send it to me for now.
http://rubyforge.org/tracker/?atid=27368&group_id=7062&func=browse
@Aaron I downgraded to ruby 1.8.6. and now it doesn’t crash… if that helps at all
Hmm, spoke too soon… it just crashes in a different spot. Emailed you the info.
I can’t reproduce this on the two other machines I tried, so I suppose there’s something stupid about my machine that’s causing this.
[...] view helpers now use nokogiri (with rexml as fallback [...]
[...] view helpers now use nokogiri (with rexml as fallback [...]
It is failing two of the tests for me, which kills the installation. http://pastie.org/310956
I am using libxml v2.6.30.
Hey Chris!
It should work now. Pull down version nokogiri 1.0.4, and it should install just fine.
It installed smoothly this time
For future reference, is there a way to install gems without running the tests. I see there is a –no-test param that can be added, but that didn’t seem to stop them from running.
Thanks Aaron.
I had an error in nokogiri.rb line 13 about using “nil.+” , so I changed it a bit and now it seem to work:
if RUBYPLATFORM =~ /mswin/i
if ENV['PATH'].nil?
ENV['PATH'] = “;” + File.expandpath(File.join( File.dirname(FILE) , “..”, “ext”, “nokogiri”))
else
ENV['PATH'] += “;” + File.expand_path(File.join( File.dirname(FILE) , “..”, “ext”, “nokogiri”))
end
end
Or am I doing something wrong here?
“Sudo gem install nokogiri” fails with Error: Failed to build gem native extension (Nokogiri 1.0.4), on OS X with all updated gems. Any pointers ? Seems like a cool gem.
Apparently the installer attempts to call the rubyforge command that’s installed with the rubyforge gem.
However, the gem spec doesn’t list rubyforge as a dependency, so the install will fail for folks who don’t have rubyforge installed beforehand.
[...] Nokogiri As already mentioned, we switched to nokogiri before 1.0. Nokogiri is based on libxml2 which is a fast and reliable XML parser. I won’t get into the hpricot vs nokogiri argument, but let’s just say that nokogiri was fitting our needs better. (Nokogiri is only used in the specs, and therefore is not needed to run Merb) [...]
Hi Aaron,
I’m wanting to switch from Hpricot to Nokogiri, hoping that it’ll fix the horrible memory leaks in hpricot. I’m running into some problems on OS X though.
Here’s an example:
Assertion failed: (node->doc), function Nokogiriwrapxmlnode, file xmlnode.c, line 518.
Abort trap
When I call doc.css(‘*’) it’s segfaulting, and drops me out of irb and back into bash.
On my linux production server, this works fine, but on OS X it’s segfaulting.
I’m using libxml 2.6.16 and Ruby 1.8.6 patchlevel 114.
Any ideas?
Spencer
[...] nokogiri ( Aaron Patterson): Esta es una librería más reciente que es muy parecida a hpricot. La gran [...]
This is awesome plugin. The speed compared to hpricot increased in times. Thank you.
I’m having a problem installing on Ubuntu (On my Mac development machine everything was fine and dandy).
I installed libxml2 and libxml2-dev, but when I go to install the gem it says the headers are missing.
checking for #include … no
libxml2 is missing. try ‘port install libxml2′ or ‘yum install libxml2′
*** extconf.rb failed ***
Any ideas please?
Dont know if this helps?
sudo find / -name ‘parser.h’
/usr/include/gnome-xml/libxml/parser.h
/usr/include/gnome-xml/parser.h
Make sure you also install “libxml2-devel” and “libxslt-devel”
Do you mean libxml2-dev and libxslt-dev?
[SOLVED] – many thanks!
sudo apt-get update
sudo apt-get install libxslt1-dev
sudo apt-get install libxml2-dev
sudo gem install nokogiri
Hi, I’m really new to ruby so please bear with me. I have installed ruby and rubygems and other gems I have installed do work. I have been requested to install nokogiri-1.2.3 but I get the following error:
checking for xmlParseDoc() in -lxml2… no
libxml2 is missing. try ‘port install libxml2′ or ‘yum install libxml2′
however looking at /usr/lib I see libxml2.a and looking in it I can see DOCBparser which I think contains the xmlParseDoc, so it seems like everything is there.
I have ruby 1.8.6 (2007-03-13 patchlevel 0) [i486-linux] on the system and updated to rubygems 1.3.2 in my attempt to solve this issue.
I see there is a message like details. You may need configuration options.
Provided configuration options:
and a bunch of options but I’m not sure where to start.
Thanks for your time.
I’m having the same issue as GenFoch after installing Snow Leopard. I’ve even go so far as to try to install lxml2 manaully. I’ve already used port to install libxml2 and libxslt. Every solution I’ve seen says to install libxml2-dev and libxslt-dev, but those are solutions for other linux distros; not OS X. Any way to fix this on a Mac?
@MikeW I run Snow Leopard, and it installs just fine for me.
Would you mind sending an email to the mailing list that includes the version of nokogiri you’re trying to install, your “ruby -v”, and also the version of XCode you have installed? We can continue debugging it from there!
Here is a link to the mailing list: http://groups.google.com/group/nokogiri-talk
[...] a shout out to mechanize I will never use another screenscraping library again. It’s uses nokogiri (so it parses all the html into a nice xpath accessible form), it handles all the cookie session [...]