Tenderlove Making

String Encoding in Ruby 1.9 C extensions

One of the challenges of developing nokogiri has been dealing with String encodings in C. I would like to present one of the problems encountered, along with a solution. I will be using RubyInline in the examples below, but the C code presented should be easy to port to your own C extensions.

If you’ve developed a C extension before, you’re probably familiar with rb_str_new2 and friends. They all basically turn a char * in to a string VALUE. But in Ruby 1.9, what is the encoding of the returned Ruby String? Well, using RubyInline, it’s easy enough to see by calling the “encoding” method. Here is a script that works in Ruby 1.8 and Ruby 1.9:

require 'rubygems'
require 'inline'

class HelloWorld
  inline do |builder|
    builder.c '
      static VALUE test() {
        return rb_str_new2("Hello world");
      }
    '
  end
end

string = HelloWorld.new.test

if string.respond_to? :encoding
  puts string.encoding
else
  puts string
end

In Ruby 1.8, this outputs the string, and in 1.9 we see the encoding. In 1.9, the encoding returned is ASCII-8BIT. Now ASCII-8BIT may be the encoding that you want, but then again, it may not. In Nokogiri, the strings coming from libxml2 are already encoded according to the document declaration. So strings returned must be marked with the appropriate encoding. How can we update the encoding?

The first function, rb_enc_find_index, given a char * will look up the index of your encoding. The function takes a string like “UTF-8” and returns a magic index number for that encoding.

The second function, rb_enc_associate_index, will associate a string held in a VALUE with the encoding index returned from the first function.

Armed with this knowledge, we can modify our original program to return a string encoded with UTF-8. The only modifications are to include <ruby/encoding.h>, get the index for the desired encoding, then associate the VALUE with the returned index:

require 'rubygems'
require 'inline'

class HelloWorld
  inline do |builder|
    builder.include "<ruby/encoding.h>"

    builder.c '
      static VALUE test() {
        VALUE string = rb_str_new2("Hello World");
        int enc = rb_enc_find_index("UTF-8");
        rb_enc_associate_index(string, enc);
        return string;
      }
    '
  end
end

string = HelloWorld.new.test

if string.respond_to? :encoding
  puts string.encoding
else
  puts string
end

Great! When this is run under Ruby 1.9, the encoding returned is UTF-8. Unfortunately, this example is now specific for Ruby 1.9. Ruby 1.8 does not ship with the correct header files, and definitely does not include the functions for looking up and assigning encoding. This code will just not work under Ruby 1.8. Luckily, this code can be refactored to work under either version of Ruby.

require 'rubygems'
require 'inline'

class HelloWorld
  inline do |builder|

    builder.prefix <<-eoc
#include <ruby.h>

#ifdef HAVE_RUBY_ENCODING_H

#include <ruby/encoding.h>

#define ENCODED_STR_NEW2(str, encoding) \
  ({ \
    VALUE _string = rb_str_new2((const char *)str); \
    int _enc = rb_enc_find_index(encoding); \
    rb_enc_associate_index(_string, _enc); \
    _string; \
  })

#else

#define ENCODED_STR_NEW2(str, encoding) \
  rb_str_new2((const char *)str)

#endif
    eoc

    builder.c '
      static VALUE test() {
        return ENCODED_STR_NEW2("Hello world", "UTF-8");
      }
    '
  end
end

string = HelloWorld.new.test

if string.respond_to? :encoding
  puts string.encoding
else
  puts string
end

In 1.8, the macro just returns the new string. In 1.9, the macro returns the string and additionally sets the encoding. Now if we use this macro wherever we create new strings, we’ll be working well with 1.8 and 1.9!

Also, if you’re playing along at home, remember to save the file between running it with 1.8 and 1.9. RubyInline examines the mtime of the ruby file, and will only recompile when the rb file has been written to. That means if you run it with 1.8, then immediately run again with 1.9, it won’t recompile it for 1.9. I suppose I should send in a patch. ;-)

One last thing… There may be better ways to do this. I needed to determine the encoding at runtime because XML files declare their encoding scheme. If you parse an XML file that declares it’s encoding as EUC-JP, it would make sense that the strings you pull our are encoded in EUC-JP, right? If you know that you’re always going to be returning UTF-8 strings from your C extensions, it could be a different story. Either way, using macros and checking for constants should make sure your code works with 1.8 or 1.9.

« go back