Hey internet. How are you doing? Ya. It’s been a while. I know, I know. I suck at blogging. Couldn’t you tell by my horrible layout? But seriously, I’ve been really busy lately. We used to have such good times together. I’d write a blog post, you would show it to everyone on the internet. But that spark just doesn’t seem to be there anymore.
Well, I’m doing my best to keep this relationship together. With the help of super awesome ruby hacker Mike Dalessio, I wrote an XML/HTML parsing library for ruby called Nokogiri. What is so great about Nokogiri? Well, for one it is really easy to parse HTML or XML:
require 'nokogiri'
doc = Nokogiri::HTML(<<-eohtml)
<html>
<body>
<div id="wrapper">
<h1>Hello world</h1>
<p>Paragraph</p>
</div>
</body>
</html>
eohtml
Oh, I know what you’re saying internet. Ya, sure, it’s easy to parse, but is it easy to search? Well, it is. I promise. You know XPath, right? Well you can search by XPath very easily:
doc.xpath('//p').each do |paragraph|
puts paragraph.text
end
Oh, you don’t know XPath very well? That’s OK. I know you know CSS. You use it everywhere! I’ve viewed your source (*wink* *wink*). Well you can search using CSS selectors as well:
doc.css('div#wrapper').each do |div_with_wrapper_id|
puts div_with_wrapper_id['id']
end
Oh, I see how it is. You don’t want to commit. You want to search with CSS selectors *and* XPath. Well fine. You can have that too. Just use the “search” method, and you can mix and match your selectors:
css.search('//p', 'div#wrapper') do |node|
puts node.name
end
Well, I hope you’re feeling better about our relationship now. I just want to tell you that you shouldn’t worry about that old legacy code that uses Hpricot. Nokogiri can be used as a drop in replacement! Really! Nokogiri doesn’t reproduce the bugs that are in Hpricot, but should work in most cases. Just use “Nokogir::Hpricot()” to parse your HTML. Of course, I’ve tried to keep the syntax of Hpricot that I like. For example, you can use slashes for searching, subsearching:
(doc/'div').each { |div| puts div.at('p').text }
You even get a speed increase. For free!
Want to install Nokogiri? No problem. Just do “gem install nokogiri”. It’s that easy!
Well, now that we’re back together, why don’t you send some twitters if you like it! Thanks innernet. I promise to update you more often. I swear.


Do you mean libxml2-dev and libxslt-dev?
[SOLVED] – many thanks!
sudo apt-get update
sudo apt-get install libxslt1-dev
sudo apt-get install libxml2-dev
sudo gem install nokogiri
Hi, I’m really new to ruby so please bear with me. I have installed ruby and rubygems and other gems I have installed do work. I have been requested to install nokogiri-1.2.3 but I get the following error:
checking for xmlParseDoc() in -lxml2… no
libxml2 is missing. try ‘port install libxml2′ or ‘yum install libxml2′
however looking at /usr/lib I see libxml2.a and looking in it I can see DOCBparser which I think contains the xmlParseDoc, so it seems like everything is there.
I have ruby 1.8.6 (2007-03-13 patchlevel 0) [i486-linux] on the system and updated to rubygems 1.3.2 in my attempt to solve this issue.
I see there is a message like details. You may need configuration options.
Provided configuration options:
and a bunch of options but I’m not sure where to start.
Thanks for your time.
I’m having the same issue as GenFoch after installing Snow Leopard. I’ve even go so far as to try to install lxml2 manaully. I’ve already used port to install libxml2 and libxslt. Every solution I’ve seen says to install libxml2-dev and libxslt-dev, but those are solutions for other linux distros; not OS X. Any way to fix this on a Mac?
@MikeW I run Snow Leopard, and it installs just fine for me.
Would you mind sending an email to the mailing list that includes the version of nokogiri you’re trying to install, your “ruby -v”, and also the version of XCode you have installed? We can continue debugging it from there!
Here is a link to the mailing list: http://groups.google.com/group/nokogiri-talk
[...] a shout out to mechanize I will never use another screenscraping library again. It’s uses nokogiri (so it parses all the html into a nice xpath accessible form), it handles all the cookie session [...]