Tenderlove Making

Guide to String Encoding in Ruby

Encoding issues don’t seem to happen frequently, but that is a blessing and a curse. It’s great not to fix them very frequently, but when you do need to fix them, lack of experience can leave you feeling lost.

This post is meant to be a sort of guide about what to do when you encounter different types of encoding errors in Ruby. First we’ll cover what an encoding object is, then we’ll look at common encoding exceptions and how to fix them.

What are String encodings in Ruby?

In Ruby, strings are a combination of an array of bytes, and an encoding object. We can access the encoding object on the string by calling encoding on the string object.

For example:

>> x = 'Hello World'
>> x.encoding
=> #<Encoding:UTF-8>

In my environment, the default encoding object associated with a string us the “UTF-8” encoding object. A graph of the object relationship looks something like this:

string points at encoding

Changing a String’s Encoding

We can change encoding by two different methods:

  • String#force_encoding
  • String#encode

The force_encoding method will mutate the string object and only change which encoding object the string points to. It does nothing to the bytes of the string, it merely changes the encoding object associated with the string. Here we can see that the return value of encoding changes after we call the force_encode method:

>> x = 'Hello World'
>> x.encoding
=> #<Encoding:UTF-8>
>> x.force_encoding "US-ASCII"
=> "Hello World"
>> x.encoding
=> #<Encoding:US-ASCII>

The encode method will create a new string based on the bytes of the old string and associate the encoding object with the new string.

Here we can see that the encoding of x remains the same, and calling encode returns a new string y which is associated with the new encoding:

>> x = 'Hello World'
>> x.encoding
=> #<Encoding:UTF-8>
>> y = x.encode("US-ASCII")
>> x.encoding
=> #<Encoding:UTF-8>
>> y.encoding
=> #<Encoding:US-ASCII>

Here is a visualization of the difference:

changing encoding

Calling force_encoding mutates the original string, where encode creates a new string with a different encoding. Translating a string from one encoding to another is probably the “normal” use of encodings. However, developers will rarely call the encode method because Ruby will typically handle any necessary translations automatically. It’s probably more common to call the force_encoding method, and that is because strings can be associated with the wrong encoding.

Strings Can Have the Wrong Encoding

Strings can be associated with the wrong encoding object, and that is the source of most if not all encoding related exceptions. Let’s look at an example:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> false

In this case, Ruby associated the string "Hello \x93\xfa\x96\x7b" with the default encoding UTF-8. However, many of the bytes in the string are not valid Unicode characters. We can check if the string is associated with a valid encoding object by calling valid_encoding? method. The valid_encoding? method will scan all bytes to see if they are valid for that particular encoding object.

So how do we fix this? The answer depends on the situation. We need to think about where the data came from and where the data is going. Let’s say we’ll display this string on a webpage, but we do not know the correct encoding for the string. In that case we probably want to make sure the string is valid UTF-8, but since we don’t know the correct encoding for the string, our only choice is to remove the bad bytes from the string.

We can remove the unknown bytes by using the scrub method:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.valid_encoding?
=> false
>> y = x.scrub
>> y
=> "Hello ���{"
>> y.encoding
=> #<Encoding:UTF-8>
>> y.valid_encoding?
=> true

The scrub method will return a new string associated with the encoding but with all of the invalid bytes replaced by a replacement character, the diamond question mark thing.

What if we do know the encoding of the source string? Actually the example above is using a string that’s encoding using Shift JIS. Let’s say we know the encoding, and we want to display the string on a webpage. In that case we tag the string by using force_encoding, and transcode to UTF-8:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.force_encoding "Shift_JIS"
=> "Hello \x{93FA}\x{967B}"
>> x.valid_encoding?
=> true
>> x.encode "UTF-8" # display as UTF-8
=> "Hello 日本"

The most important thing to think about when dealing with encoding issues is “where did this data come from?” and “what will we do with this data?” Answering those two questions will drive all decisions about which encoding to use with which string.

Encoding Depends on the Context

Before we look at some common errors and their remediation, let’s look at one more example of the encoding context dependency. In this example, we’ll use some user input as a cache key, but we’ll also display the user input on a webpage. We’re going to use our source data (the user input) in two places: as a cache key, and something to display on a web page.

Here’s the code:

require "digest/md5"
require "cgi"

# Make a checksum
def make_checksum string
  Digest::MD5.hexdigest string
end

# Not good HTML escaping (don't use this)
# Returns a string with UTF-8 compatible encoding for display on a webpage
def display_on_web string
  string.gsub(/>/, "&gt;")
end

# User input from an unknown source
x = "Hello \x93\xfa\x96\x7b"
p ENCODING: x.encoding
p VALID_ENCODING: x.valid_encoding?

p display_on_web x
p make_checksum x

If we run this code, we’ll get an exception:

$ ruby thing.rb
{:ENCODING=>#<Encoding:UTF-8>}
{:VALID_ENCODING=>false}
Traceback (most recent call last):
        2: from thing.rb:20:in `<main>'
        1: from thing.rb:12:in `display_on_web'
thing.rb:12:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)

The problem is that we have a string of unknown input with bytes that are not valid UTF-8 characters. We know we want to display this string on a UTF-8 encoded webpage, so lets scrub the string:

require "digest/md5"
require "cgi"

# Make a checksum
def make_checksum string
  Digest::MD5.hexdigest string
end

# Not good HTML escaping (don't use this)
# Returns a string with UTF-8 compatible encoding for display on a webpage
def display_on_web string
  string.gsub(/>/, "&gt;")
end

# User input from an unknown source
x = "Hello \x93\xfa\x96\x7b".scrub
p ENCODING: x.encoding
p VALID_ENCODING: x.valid_encoding?

p display_on_web x
p make_checksum x

Now when we run the program, the output is like this:

$ ruby thing.rb
{:ENCODING=>#<Encoding:UTF-8>}
{:VALID_ENCODING=>true}
"Hello ���{"
"4dab6f63b4d3ae3279345c9df31091eb"

Great! We’ve build some HTML and generated a checksum. Unfortunately there is a bug in this code (of course the mere fact that we’ve written code means there’s a bug! lol) Let’s introduce a second user input string with slightly different bytes than the first input string:

require "digest/md5"
require "cgi"

# Make a checksum
def make_checksum string
  Digest::MD5.hexdigest string
end

# Not good HTML escaping (don't use this)
# Returns a string with UTF-8 compatible encoding for display on a webpage
def display_on_web string
  string.gsub(/>/, "&gt;")
end

# User input from an unknown source
x = "Hello \x93\xfa\x96\x7b".scrub
p ENCODING: x.encoding
p VALID_ENCODING: x.valid_encoding?

p display_on_web x
p make_checksum x

# Second user input from an unknown source with slightly different bytes
y = "Hello \x94\xfa\x97\x7b".scrub
p ENCODING: y.encoding
p VALID_ENCODING: y.valid_encoding?

p display_on_web y
p make_checksum y

Here is the output from the program:

$ ruby thing.rb
{:ENCODING=>#<Encoding:UTF-8>}
{:VALID_ENCODING=>true}
"Hello ���{"
"4dab6f63b4d3ae3279345c9df31091eb"
{:ENCODING=>#<Encoding:UTF-8>}
{:VALID_ENCODING=>true}
"Hello ���{"
"4dab6f63b4d3ae3279345c9df31091eb"

The program works in the sense that there is no exception. But both user input strings have the same checksum despite the fact that the original strings clearly have different bytes! So what is the correct fix for this program? Again, we need to think about the source of the data (where did it come from), as well as what we will do with it (where it is going). In this case we have one source, from a user, and the user provided us with no encoding information. In other words, the encoding information of the source data is unknown, so we can only treat it as a sequence of bytes. We have two output cases, one is a UTF-8 HTML the other output is the input to our checksum function. The HTML requires that our string be UTF-8 so making the string valid UTF-8, in other words “scrubbing” it, before displaying makes sense. However, our checksum function requires seeing the original bytes of the string. Since the checksum is only concerned with the bytes in the string, any encoding including an invalid encoding will work. It’s nice to make sure all our strings have valid encodings though, so we’ll fix this example such that everything has a valid encoding.

require "digest/md5"
require "cgi"

# Make a checksum
def make_checksum string
  Digest::MD5.hexdigest string
end

# Not good HTML escaping (don't use this)
# Returns a string with UTF-8 compatible encoding for display on a webpage
def display_on_web string
  string.gsub(/>/, "&gt;")
end

# User input from an unknown source
x = "Hello \x93\xfa\x96\x7b".b
p ENCODING: x.encoding
p VALID_ENCODING: x.valid_encoding?

p display_on_web x.encode("UTF-8", undef: :replace)
p make_checksum x

# Second user input from an unknown source with slightly different bytes
y = "Hello \x94\xfa\x97\x7b".b
p ENCODING: y.encoding
p VALID_ENCODING: y.valid_encoding?

p display_on_web y.encode("UTF-8", undef: :replace)
p make_checksum y

Here is the output of the program:

$ ruby thing.rb
{:ENCODING=>#<Encoding:ASCII-8BIT>}
{:VALID_ENCODING=>true}
"Hello ���{"
"96cf6db2750fd4d2488fac57d8e4d45a"
{:ENCODING=>#<Encoding:ASCII-8BIT>}
{:VALID_ENCODING=>true}
"Hello ���{"
"b92854c0db4f2c2c20eff349a9a8e3a0"

To fix our program, we’ve changed a couple things. First we tagged the string of unknown encoding as “binary” by using the .b method. The .b method returns a new string that is associated with the ASCII-8BIT encoding. The name ASCII-8BIT is somewhat confusing because it has the word “ASCII” in it. It’s better to think of this encoding as either “unknown” or “binary data”. Unknown meaning we have some data that may have a valid encoding, but we don’t know what it is. Or binary data, as in the bytes read from a JPEG file or some such binary format. Anyway, we pass the binary string in to the checksum function because the checksum only cares about the bytes in the string, not about the encoding.

The second change we made is to call encode with the encoding we want (UTF-8) along with undef: :replace meaning that any time Ruby encounters bytes it doesn’t know how to convert to the target encoding, it will replace them with the replacement character (the diamond question thing).

SIDE NOTE: This is probably not important, but it is fun! We can specify what Ruby uses for replacing unknown bytes. Here’s an example:

>> x = "Hello \x94\xfa\x97\x7b".b
>> x.encoding
=> #<Encoding:ASCII-8BIT>
>> x.encode("UTF-8", undef: :replace, replace: "Aaron")
=> "Hello AaronAaronAaron{"
>> x.encode("UTF-8", undef: :replace, replace: "🤣")
=> "Hello 🤣🤣🤣{"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:UTF-8>, true]

Now lets take a look at some common encoding errors in Ruby and what to do about them.

Encoding::InvalidByteSequenceError

This exception occurs when Ruby needs to examine the bytes in a string and the bytes do not match the encoding. Here is an example of this exception:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.encode "UTF-16"
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):4
        1: from (irb):4:in `encode'
Encoding::InvalidByteSequenceError ("\x93" on UTF-8)
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> false

The string x contains bytes that aren’t valid UTF-8, yet it is associated with the UTF-8 encoding object. When we try to convert x to UTF-16, an exception occurs.

How to fix Encoding::InvalidByteSequenceError

Like most encoding issues, our string x is tagged with the wrong encoding. The way to fix this issue is to tag the string with the correct encoding. But what is the correct encoding? To figure out the correct encoding, you need to know where the string came from. For example if the string came from a Mime attachment, the Mime attachment should specify the encoding (or the RFC will tell you).

In this case, the string is a valid Shift JIS string, but I know that because I looked up the bytes and manually entered them. So we’ll tag this as Shift JIS, and the exception goes away:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.force_encoding "Shift_JIS"
=> "Hello \x{93FA}\x{967B}"
>> x.encode "UTF-16"
=> "\uFEFFHello \u65E5\u672C"
>> x.encoding
=> #<Encoding:Shift_JIS>
>> x.valid_encoding?
=> true

If you don’t know the source of the string, an alternative solution is to tag as UTF-8 and then scrub the bytes:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.force_encoding "UTF-8"
=> "Hello \x93\xFA\x96{"
>> x.scrub!
=> "Hello ���{"
>> x.encode "UTF-16"
=> "\uFEFFHello \uFFFD\uFFFD\uFFFD{"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> true

Of course this works, but it means that you’ve lost data. The best solution is to figure out what the encoding of the string should be depending on its source and tag it with the correct encoding.

Encoding::UndefinedConversionError

This exception occurs when a string of one encoding can’t be converted to another encoding.

Here is an example:

>> x = "四\u2160"
>> x
=> "四Ⅰ"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> true
>> x.encode "Shift_JIS"
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):23
        1: from (irb):23:in `encode'
Encoding::UndefinedConversionError (U+2160 from UTF-8 to Shift_JIS)

In this example, we have two characters: “四”, and the Roman numeral 1 (“Ⅰ”). Unicode Roman numeral 1 cannot be converted to Shift JIS because there are two codepoints that represent that character in Shift JIS. This means the conversion is ambiguous, so Ruby will raise an exception.

How to fix Encoding::UndefinedConversionError

Our original string is correctly tagged as UTF-8, but we need to convert to Shift JIS. In this case we’ll use a replacement character when converting to Shift JIS:

>> x = "四\u2160"
>> y = x.encode("Shift_JIS", undef: :replace)
>> y
=> "\x{8E6C}?"
>> y.encoding
=> #<Encoding:Shift_JIS>
>> y.valid_encoding?
=> true
>> y.encode "UTF-8"
=> "四?"

We were able to convert to Shift JIS, but we did lose some data.

ArgumentError

When a string contains invalid bytes, sometimes Ruby will raise an ArgumentError exception:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.downcase
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):34
        1: from (irb):34:in `downcase'
ArgumentError (input string invalid)
>> x.gsub(/ello/, "i")
Traceback (most recent call last):
        6: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        3: from (irb):34
        2: from (irb):35:in `rescue in irb_binding'
        1: from (irb):35:in `gsub'
ArgumentError (invalid byte sequence in UTF-8)

Again we use our incorrectly tagged Shift JIS string. Calling downcase or gsub both result in an ArgumentError. I personally think these exceptions are not great. We didn’t pass anything to downcase, so why is it an ArgumentError? There is nothing wrong with the arguments we passed to gsub, so why is it an ArgumentError? Why does one say “input string invalid” where the other gives us a slightly more helpful exception of “invalid byte sequence in UTF-8”? I think these should both result in Encoding::InvalidByteSequenceError exceptions, as it’s a problem with the encoding, not the arguments.

Regardless, these errors both stem from the fact that the Shift JIS string is incorrectly tagged as UTF-8.

Fixing ArgumentError

Fixing this issue is just like fixing Encoding::InvalidByteSequenceError. We need to figure out the correct encoding of the source string, then tag the source string with that encoding. If the encoding of the source string is truly unknown, scrub it.

>> x = "Hello \x93\xfa\x96\x7b"
>> x.force_encoding "Shift_JIS"
=> "Hello \x{93FA}\x{967B}"
>> x.downcase
=> "hello \x{93FA}\x{967B}"
>> x.gsub(/ello/, "i")
=> "Hi \x{93FA}\x{967B}"

Encoding::CompatibilityError

This exception occurs when we try to combine strings of two different encodings and those encodings are incompatible. For example:

>> x = "四\u2160"
>> y = "Hello \x93\xfa\x96\x7b".force_encoding("Shift_JIS")
>> [x.encoding, x.valid_encoding?]
=> [#<Encoding:UTF-8>, true]
>> [y.encoding, y.valid_encoding?]
=> [#<Encoding:Shift_JIS>, true]
>> x + y
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):50
        1: from (irb):50:in `+'
Encoding::CompatibilityError (incompatible character encodings: UTF-8 and Shift_JIS)

In this example we have a valid UTF-8 string and a valid Shift JIS string. However, these two encodings are not compatible, so we get an exception when combining.

Fixing Encoding::CompatibilityError

To fix this exception, we need to manually convert one string to a new string that has a compatible encoding. In the case above, we can choose whether we want the output string to be UTF-8 or Shift JIS, and then call encode on the appropriate string.

In the case we want UTF-8 output, we can do this:

>> x = "四"
>> y = "Hello \x93\xfa\x96\x7b".force_encoding("Shift_JIS")
>> x + y.encode("UTF-8")
=> "四Hello 日本"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:UTF-8>, true]

If we wanted Shift JIS, we could do this:

>> x = "四"
>> y = "Hello \x93\xfa\x96\x7b".force_encoding("Shift_JIS")
>> x.encode("Shift_JIS") + y
=> "\x{8E6C}Hello \x{93FA}\x{967B}"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:Shift_JIS>, true]

Another possible solution is to scrub bytes and concatenate, but again that results in data loss.

What is a compatible encoding?

If there are incompatible encodings, there must be compatible encodings too (at least I would think that). Here is an example of compatible encodings:

>> x = "Hello World!".force_encoding "US-ASCII"
>> [x.encoding, x.valid_encoding?]
=> [#<Encoding:US-ASCII>, true]
>> y = "こんにちは"
>> [y.encoding, y.valid_encoding?]
=> [#<Encoding:UTF-8>, true]
>> y + x
=> "こんにちはHello World!"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:UTF-8>, true]
>> x + y
=> "Hello World!こんにちは"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:UTF-8>, true]

The x string is encoded with “US ASCII” encoding and the y string UTF-8. US ASCII is fully compatible with UTF-8, so even though these two strings have different encoding, concatenation works fine.

String literals may default to UTF-8, but some functions will return US ASCII encoded strings. For example:

>> require "digest/md5"
=> true
>> Digest::MD5.hexdigest("foo").encoding
=> #<Encoding:US-ASCII>

A hexdigest will only ever contain ASCII characters, so the implementation tags the returned string as US-ASCII.

Encoding Gotchas

Let’s look at a couple encoding gotcha’s.

Infectious Invalid Encodings

When a string is incorrectly tagged, Ruby will typically only raise an exception when it needs to actually examine the bytes. Here is an example:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> false
>> x + "ほげ"
=> "Hello \x93\xFA\x96{ほげ"
>> y = _
>> y
=> "Hello \x93\xFA\x96{ほげ"
>> [y.encoding, y.valid_encoding?]
=> [#<Encoding:UTF-8>, false]

Again we have the incorrectly tagged Shift JIS string. We’re able to append a correctly tagged UTF-8 string and no exception is raised. Why is that? Ruby assumes that if both strings have the same encoding, there is no reason to validate the bytes in either string so it will just append them. That means we can have an incorrectly tagged string “infect” what would otherwise be correctly tagged UTF-8 strings. Say we have some code like this:

def append string
  string + "ほげ"
end

p append("ほげ").valid_encoding? # => true
p append("Hello \x93\xfa\x96\x7b").valid_encoding? # = false

When debugging this code, we may be tempted to think the problem is in the append method. But actually the issue is with the caller. The caller is passing in incorrectly tagged strings, and unfortunately we might not get an exception until the return value of append is used somewhere far away.

ASCII-8BIT is Special

Sometimes ASCII-8BIT is considered to be a “compatible” encoding and sometimes it isn’t. Here is an example:

>> x = "\x93\xfa\x96\x7b".b
>> x.encoding
=> #<Encoding:ASCII-8BIT>
>> y = "ほげ"
>> y + x
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):89
        1: from (irb):89:in `+'
Encoding::CompatibilityError (incompatible character encodings: UTF-8 and ASCII-8BIT)

Here we have a binary string stored in x. Maybe it came from a JPEG file or something (it didn’t, I just typed it in!) When we try to concatenate the binary string with the UTF-8 string, we get an exception. But this may actually be an exception we want! It doesn’t make sense to be concatenating some JPEG data with an actual string we want to view, so it’s good we got an exception here.

Now here is the same code, but with the contents of x changed somewhat:

>> x = "Hello World".b
>> x.encoding
=> #<Encoding:ASCII-8BIT>
>> y = "ほげ"
>> y + x
=> "ほげHello World"

We have the same code with the same encodings at play. The only thing that changed is the actual contents of the x string.

When Ruby concatenates ASCII-8BIT strings, it will examine the contents of that string. If all bytes in the string are ASCII characters, it will treat it as a US-ASCII string and consider it to be “compatible”. If the string contains non-ASCII characters, it will consider it to be incompatible.

This means that if you had read some data from your JPEG, and that data happened to all be ASCII characters, you would not get an exception even though maybe you really wanted one.

In my personal opinion, concatenating an ASCII-8BIT string with anything besides another ASCII-8BIT string should be an exception.

Anyway, this is all I feel like writing today. I hope you have a good day, and remember to check your encodings!

« go back