Tenderlove Making

Bitmap Matrix and Undirected Graphs in Ruby

I’ve been working my way through Engineering a Compiler. I really enjoy the book, but one part has you build an interference graph for doing register allocation via graph coloring. An interference graph is an undirected graph, and one way you can represent an undirected graph is with a bitmap matrix.

A bitmap matrix is just a matrix but the values in the matrix can only be 1 or 0. If every node in your graph maps to an index, you can use the bitmap matrix to represent edges in the graph.

I made a bitmap matrix implementation that I like, but I think the code is too trivial to put in a Gem. Here is the code I used:

class BitMatrix
  def initialize size
    @size = size
    size = (size + 7) & -8 # round up to the nearest multiple of 8
    @row_bytes = size / 8
    @buffer = "\0".b * (@row_bytes * size)
  end

  def initialize_copy other
    @buffer = @buffer.dup
  end

  def set x, y
    raise IndexError if y >= @size || x >= @size

    x, y = [y, x].sort

    row = x * @row_bytes
    column_byte = y / 8
    column_bit = 1 << (y % 8)

    @buffer.setbyte(row + column_byte, @buffer.getbyte(row + column_byte) | column_bit)
  end

  def set? x, y
    raise IndexError if y >= @size || x >= @size

    x, y = [y, x].sort

    row = x * @row_bytes
    column_byte = y / 8
    column_bit = 1 << (y % 8)

    (@buffer.getbyte(row + column_byte) & column_bit) != 0
  end

  def each_pair
    return enum_for(:each_pair) unless block_given?

    @buffer.bytes.each_with_index do |byte, i|
      row = i / @row_bytes
      column = i % @row_bytes
      8.times do |j|
        if (1 << j) & byte != 0
          yield [row, (column * 8) + j]
        end
      end
    end
  end

  def to_dot
    "graph g {\n" + each_pair.map { |x, y| "#{x} -- #{y};" }.join("\n") + "\n}"
  end
end

I like this implementation because all bits are packed in to a binary string. Copying the matrix is trivial because we just have to dup the string. Memory usage is much smaller than if every node in the graph were to store an actual reference to other nodes.

Anyway, this was fun to write and I hope someone finds it useful!

read more »

Vim, tmux, and Fish

I do most of my text editing with MacVim, but when I pair with people I like to use tmate. tmate is just an easy way to connect tmux sessions with a remote person. But this means that I go from coding in a GUI to coding in a terminal. Normally this wouldn’t be a problem, but I had made a Fish alias that would open the MacVim GUI every time I typed vim in the terminal. Of course when I’m pairing via tmate, the other people cannot see the GUI, so I would have to remember a different command to open Vim.

Today I did about 10min of research to fix this problem and came up with the following Fish command:

$ cat .config/fish/functions/vim.fish 
function vim --wraps='vim' --description 'open Vim'
  if set -q TMUX # if we're in a TMUX session, open in terminal
    command vim $argv
  else
    # Otherwise open macvim
    open -a MacVim $argv; 
  end
end

All it does is open terminal Vim if I’m in a TMUX session, otherwise it opens the MacVim GUI.

Instead of putting up with this frustration for such a long time, I should have taken the 10 min required to fix the situation. This was a good reminder for me, and hopefully I’ll be better about it in the future!

read more »

In Memory of a Giant

The Ruby community has lost a giant. As a programmer, I always feel as if I’m standing on the shoulders of giants. Chris Seaton was one of those giants.

I’ve been working at the same company as Chris for the past 2 years. However, I first met him through the open source world many years ago. He was working on a Ruby implementation called TruffleRuby, and got his PhD in Ruby. Can you believe that? A PhD in Ruby? I’d never heard of such a thing. My impression was that nobody in academia cared about Ruby, but here was Chris, the Ruby Doctor. I was impressed.

Patience

As a college dropout, I’ve always felt underqualified. Embarrassment about my lack of knowledge and credentials has driven me to study hard on my own time. But Chris never once made me feel out of place. Any time I had questions, without judgement, he would take the time to explain things to me.

I’ve always looked up to Chris. I was at a bar in London with a few coworkers. We started talking about age, and I found out that Chris was much younger than me. I said “You’re so smart and accomplished! How can I possibly catch up to you?” Chris said “Don’t worry, I’ll just tell you everything I know!”

Meeting Chris in London

Puns

My team is fully remote, so every Friday we have a team meeting over video to just hang out and talk about stuff. Eventually I’ll make a really great pun, most people will sigh, Kevin Menard will get angry, and Chris would just be straight faced. No reaction from Chris. Every. Single. Time.

One time, someone asked Chris “do you know that he’s making a joke? Or do you just not think it’s funny?” Chris responded “I know he’s making a pun, I just don’t react because I don’t want to encourage him.” I said “This just encourages me more because now I feel challenged!”

I wish I had tried harder because now I’ll never get that reaction.

Kindness

My last conversation with Chris was Thursday December 1st at RubyConf in Houston. We all went to dinner at a Ramen shop. I find British English to be extremely adorable, so any time I hear fun British phrases in the news I always ask my British coworkers about it. The latest one was “Wonky Veg” so I asked Chris if he’d been buying any at the store. He said no, but that one of his favorite things to do was find weird things at the local supermarket, take photos of it, then share with his coworkers. He flipped through photos on his phone, showing me pics of him shopping with his daughter. Some of the products he showed me were quite funny and we both had a good laugh.

Dinner with Chris

Memory

I feel honored to have had the opportunity to work with Chris.

I feel grateful for the time that we had together.

I feel angry that I can’t learn more from him.

I feel sad that he is gone from my life.

Chris was an important part of the community, his family, and his country. I will never forget the time I spent with Chris, a Giant.

read more »

Cross Platform Machine Code

I hate writing if statements.

I’ve been working on a couple different assemblers for Ruby. Fisk is a pure Ruby x86 assembler. You can use it to generate bytes that can be executed on x86 machines. AArch64 is a pure Ruby ARM64 assembler. You can use it to generate bytes that can be executed on ARM64 machines.

Both of these libraries just generate bytes that can be interpreted by their respective processors. Unfortunately you can’t just generate bytes and expect the CPU to execute them. You first need to put the bytes in executable memory before you can hand them off to the CPU for execution. Executable memory is basically the same thing regardless of CPU architecture, so I decided to make a library called JITBuffer that encapsulates executable memory manipulation.

To use the JITBuffer, you write platform specific bytes to the buffer, then give the buffer to the CPU for execution. Here is an example on the ARM64 platform:

require "aarch64"
require "jit_buffer"
require "fiddle"

asm = AArch64::Assembler.new

# Make some instructions.  These instructions simply
# return the value 0xF00DCAFE
asm.pretty do
  asm.movz x0, 0xCAFE
  asm.movk x0, 0xF00D, lsl(16)
  asm.ret
end

# Write the bytes to executable memory:
buf = JITBuffer.new(4096)
buf.writeable!
asm.write_to buf
buf.executable!

# Point the CPU at the executable memory
func = Fiddle::Function.new(buf.to_i, [], -Fiddle::TYPE_INT)
p func.call.to_s(16) # => "f00dcafe"

The example uses the AArch64 gem to assemble ARM64 specific bytes, the JITBuffer gem to allocate executable memory, and the Fiddle gem to point the CPU at the executable memory and run it.

Tests are important I guess, so I thought it would be a good idea to write tests for the JITBuffer gem. My goal for the test is to ensure that it’s actually possible to execute the bytes in the buffer itself. I’m not a huge fan of stubs or mocks and I try to avoid them if possible, so I wanted to write a test that would actually execute the bytes in the buffer. I also want the test to be “cross platform” (where “cross platform” means “works on x86_64 and ARM64”).

Writing a test like this would mean writing something like the following:

def test_can_execute
  buf = JITBuffer.new(4096)

  platform = figure_out_what_platform_we_are_on()
  if platform == "arm64"
    # write arm64 specific bytes
    buf.write(...)
  else
    # write x86_64 specific bytes
    buf.write(...)
  end

  # Use fiddle to execute
end

As I said at the start though, I hate writing if statements, and I’d rather avoid it if possible. In addition, how do you reliably figure out what platform you’re executing on? I really don’t want to figure that out. Not to mention, I just don’t think this code is cool.

My test requirements:

  • No if statements
  • Self contained (I don’t want to shell out or use other libraries)
  • Must have pizzazz

Since machine code is just bytes that the CPU interprets, it made me wonder “is there a set of bytes that execute both on an x86_64 CPU and an ARM64 CPU?” It turns out there are, and I want to walk through them here.

x86_64 Instructions

First lets look at the x86_64 instructions we’ll execute. Below is the assembly code (in Intel syntax):

.start:
  mov rax, 0x2b ; put 0x2b in the rax register
  ret           ; return from the function
  jmp start     ; jump to .start

This assembly code puts the value 0x2b in the rax register and returns from the current “C” function. I put “C” in quotes because we’re writing assembly code, but the assembly code is conforming to the C calling convention and we’ll treat it as if it’s a C function when we call it. The x86 C calling convention states that the value in the rax register is the “return value” of the C function. So we’ve created a function that returns 0x2b. At the end of the code there is a jmp instruction that jumps to the start of this sequence. However, since we return from the function before getting to the jump, the jump is never used (or is it?????)

Machine code is just bytes, and here are the bytes for the above x86 machine code:

0x48 0xC7 0xC0 0x2b 0x00 0x00 0x00  ; mov rax, 0x2b
0xC3                                ; ret
0xEB 0xF6                           ; jmp start

x86 uses a “variable width” encoding, meaning that the number of bytes each instruction uses can vary. In this example, the mov instruction used 7 bytes, and the ret instruction used 1 byte. This means that the jmp instruction is the 9th byte, or offset 8.

ARM64 Instructions

Below are some ARM64 instructions we can execute:

movz X11, 0x7b7 ; put 0x7b7 in the X11 register
movz X0, 0x2b   ; put 0x2b in the X0 register
ret             ; return from the function

This machine code puts the value 0x7b7 in to the register X11. Then it puts the value 0x2b in the X0 register. The third instruction returns from the function. Again we are abiding by the C calling convention, but this time on the ARM64 platform. On the ARM64 platform, the value stored in X0 is the return value. So the above machine code will return the value 0x2b to the caller just like the x86_64 machine code did.

Here are the bytes that represent the above ARM64 machine code:

0xEB 0xF6 0x80 0xD2  ; movz X11, 0x7b7
0x60 0x05 0x80 0xD2  ; movz X0, 0x2b
0xC0 0x03 0x5F 0xD6  ; ret

ARM64 uses fixed width instructions. All instructions on ARM64 are 32 bits wide.

Cross Platform Machine Code

Lets look at the byte blocks next to each other:

; x86_64 bytes
0x48 0xC7 0xC0 0x2b 0x00 0x00 0x00  ; mov rax, 0x2b
0xC3                                ; ret
0xEB 0xF6                           ; jmp start
; ARM64 bytes
0xEB 0xF6 0x80 0xD2  ; movz X11, 0x7b7
0x60 0x05 0x80 0xD2  ; movz X0, 0x2b
0xC0 0x03 0x5F 0xD6  ; ret

Looking at the bytes, you’ll notice that the first two bytes of the ARM64 code (0xEB 0xF6) are exactly the same as the last two bytes of the x86_64 code. The first movz instruction in the ARM64 code was specially crafted as to have the same bytes as the last jmp instruction in the x86 code.

If we combine these bytes, then tell the CPU to execute starting at a particular offset, then the interpretation of the bytes will change depending on the CPU, but the result is the same.

Here are the bytes combined:

          0x48 0xC7 0xC0 0x2b 0x00 0x00 0x00  ; mov rax, 0x2b
          0xC3                                ; ret
start ->  0xEB 0xF6 0x80 0xD2                 ; (jmp start, or movz X11, 0x7b7)
          0x60 0x05 0x80 0xD2                 ; movz X0, 0x2b
          0xC0 0x03 0x5F 0xD6                 ; ret

Regardless of platform, we’ll tell the CPU to start executing from offset 8 in the byte buffer. If it’s an x86 CPU, it will interpret the bytes as a jump, execute the top bytes, return at the ret, and ignore the rest of the bytes in the buffer (as they are never reached). If it’s an ARM64 machine, then it will interpret the bytes as “put 0x7b7 in the X11 register” and continue, never seeing the x86 specific bytes at the start of the buffer.

Both x86_64 and ARM64 platforms will return the same value 0x2b.

Now we can write a test without if statements like this:

def test_execute
  # Cross platform bytes
  bytes = [0x48, 0xc7, 0xc0, 0x2b, 0x00, 0x00, 0x00, # x86_64 mov rax, 0x2b
           0xc3,                                     # x86_64 ret
           0xeb, 0xf6,                               # x86 jmp
           0x80, 0xd2,                               # ARM movz X11, 0x7b7
           0x60, 0x05, 0x80, 0xd2,                   # ARM movz X0, #0x2b
           0xc0, 0x03, 0x5f, 0xd6]                   # ARM ret

  # Write them to the buffer
  jit = JITBuffer.new(4096)
  jit.writeable!
  jit.write bytes.pack("C*")
  jit.executable!

  # start at offset 8
  offset = 8
  func = Fiddle::Function.new(jit.to_i + offset, [], Fiddle::TYPE_INT)

  # Check the return value
  assert_equal 0x2b, func.call
end

So simple!

So cool!

Tons of pizzazz!

This test will execute machine code on both x86_64 as well as ARM64 and the machine code will return the same value. Not to mention, there’s no way RuboCop or Flay could possibly complain about this code. 🤣

I hope this inspires you to try writing cross platform machine code. This code only supports 2 platforms, but it does make me wonder how far we could stretch this and how many platforms we could support.

Anyway, hope you have a good day!

read more »

Homebrew, Rosetta, and Ruby

Hi everyone! I finally upgraded to an M1. It’s really really great, but the main problem is that some projects I work on like TenderJIT and YJIT only really work on x86_64 and these new M1 machines use ARM chips. Fortunately we can run x86_64 software via Rosetta, so we can still do development work on x86 specific software.

I’ve seen some solutions for setting up a dev environment that uses Rosetta, but I’d like to share what I did.

Installing Homebrew

I think most people recommend that you install two different versions of Homebrew, one that targets ARM, and the other that targets x86.

So far, I’ve found this to be the best solution, so I went with it. Just do the normal Homebrew installation for ARM like this:

$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then run the installer again under Rosetta like this:

$ arch -x86_64 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

After doing this, I ended up with a Homebrew installation in /opt/homebrew (the ARM version), and another installation in /usr/local (the x86 version).

Configuring your Terminal

I read many places on the web that recommend you duplicate terminal, then rename it and modify the renamed version to run under Rosetta.

I really didn’t like this solution. The problem for me is that I’d end up with two different terminal icons when doing cmd-tab, and I really can’t be bothered to read whether the terminal is the Rosetta one or not. It makes switching to the right terminal take way too long.

Instead I decided to make my shell figure out what architecture I’m using, then update $PATH depending on whether I’m using x86 or ARM. To accomplish this, I installed Fish (I use Fish shell) in both the x86 and ARM installations of Homebrew:

$ /opt/homebrew/bin/brew install fish
$ arch -x86_64 /usr/local/bin/brew install fish

If you’re not using Fish you don’t need to do this step. 😆

Next is the “interesting” part. In my config.fish, I added this:

if test (arch) = "i386"
  set HOMEBREW_PREFIX /usr/local
else
  set HOMEBREW_PREFIX /opt/homebrew
end

# Add the Homebrew prefix to $PATH. -m flag ensures it's at the beginning
# of the path since the path might already be in $PATH (just not at the start)
fish_add_path -m --path $HOMEBREW_PREFIX/bin

alias intel 'arch -x86_64 /usr/local/bin/fish'

The arch command will tell you which architecture you’re on. If I’m on i386, set the Homebrew prefix to /usr/local, otherwise set it to /opt/homebrew. Then use fish_add_path to prepend the Homebrew prefix to my $PATH environment variable. The -m switch moves the path to the front if $PATH already contained the path I’m trying to add.

Finally I added an alias intel that just starts a new shell but under Rosetta. So my default workflow is to open a terminal under ARM, and if I need to work on an intel project, just run intel.

How do I know my current architecture?

The arch command will tell you the current architecture, but I don’t want to run that every time I want to verify my current architecture. My solution was to add an emoji to my prompt. I don’t like adding more text to my prompt, but this seems important enough to warrant the addition.

My fish_prompt function looks like this:

function fish_prompt --description 'Write out the prompt'
    if not set -q __fish_prompt_normal
        set -g __fish_prompt_normal (set_color normal)
    end

    if not set -q __fish_prompt_cwd
        set -g __fish_prompt_cwd (set_color $fish_color_cwd)
    end

    if test (arch) = "i386"
      set emote 🧐
    else
      set emote 💪
    end

    echo -n -s "[$USER" @ (prompt_hostname) $emote ' ' "$__fish_prompt_cwd" (prompt_pwd) (__fish_vcs_prompt) "$__fish_prompt_normal" ']$ '
end

If I’m on ARM, the prompt will have an 💪 emoji, and if I’m on x86, the prompt will have a 🧐 emoji.

Just to give an example, here is a sample session in my terminal:

Last login: Fri Jan  7 12:37:59 on ttys001
Welcome to fish, the friendly interactive shell
[aaron@tc-lan-adapter💪 ~]$ which brew
/opt/homebrew/bin/brew
[aaron@tc-lan-adapter💪 ~]$ arch
arm64
[aaron@tc-lan-adapter💪 ~]$ intel
Welcome to fish, the friendly interactive shell
[aaron@tc-lan-adapter🧐 ~]$ which brew
/usr/local/bin/brew
[aaron@tc-lan-adapter🧐 ~]$ arch
i386
[aaron@tc-lan-adapter🧐 ~]$ exit
[aaron@tc-lan-adapter💪 ~]$ arch
arm64
[aaron@tc-lan-adapter💪 ~]$ which brew
/opt/homebrew/bin/brew
[aaron@tc-lan-adapter💪 ~]$

Now I can easily switch back and forth between x86 and ARM and my prompt tells me which I’m using.

Ruby with chruby

My Ruby dev environment is still a work in progress. I use chruby for changing Ruby versions. The problem is that all Ruby versions live in the same directory. chruby doesn’t know the difference between ARM versions and x86 versions. So for now I’m adding the architecture name to the directory:

[aaron@tc-lan-adapter💪 ~]$ chruby
   ruby-3.0.2
   ruby-arm64
   ruby-i386
   ruby-trunk
[aaron@tc-lan-adapter💪 ~]$

So I have to be careful which Ruby I switch to. I’ve filed a ticket on ruby-install, and I think we can make this nicer.

Specifically I’d like to add a subfolder in ~/.rubies for each architecture, then point chruby at the right subfolder depending on my current architecture. Essentially the same trick I used for $PATH and Homebrew, but for pointing chruby at the right place given my current architecture.

For now I just have to be careful though!

One huge caveat for Fish users is that the current version of chruby-fish is broken such that changes to $PATH end up getting lost (see this issue).

To work around that issue, I’m using @ioquatix’s fork of chruby-fish which can be found here. I just checked out that version of chruby-fish in my git project folder and added this to my config.fish:

# Point Fish at our local checkout of chruby-fish
set fish_function_path $fish_function_path $HOME/git/chruby-fish/share/fish/vendor_functions.d

Conclusion

Getting a dev environment up and running with Rosetta wasn’t too bad, but I think having the shell fix up $PATH is a better solution than having two copies of Terminal.app

The scripts I presented here were all Fish specific, but I don’t think it should be too hard to translate them to whatever shell you use.

Anyway, I hope you have a good weekend!

read more »

Publishing Gems With Your YubiKey

The recent compromise of ua-parser-js has put the security and trust of published packages at the top of my mind lately. In order to mitigate the risk of any Ruby Gems I manage from being hijacked, I enabled 2FA on my RubyGems.org account. This means that whenever I publish a Ruby Gem, I have to enter a one time passcode.

I have to admit, I find this to be a pain. Whenever I do a release of Rails, I have to enter a passcode over and over again because you can only push one Gem at a time.

Finally I’ve found a way to deal with this. I can maintain account security and also not be hassled with OTP codes again, thanks to my YubiKey.

This is just a short post about how to set up your YubiKey as an authenticator for RubyGems.org, and how to publish Gems without getting an OTP prompt.

Install ykman

ykman is a command line utility for interacting with your YubiKey. I installed it on my Mac with Homebrew:

$ brew install ykman

Set up 2FA as usual

If you already have 2FA enabled, you’ll have to temporarily disable it.

Just go through the normal 2FA setup process and when you’re presented with a QR code, you’ll use the text key to configure your YubiKey.

Just do:

$ ykman oath accounts add -t -o TOTP rubygems.org:youremail@example.org 123456

But use your email address and replace 123456 with the code you got from RubyGems.org. The -t flag will require you to touch the YubiKey when you want to generate an OTP.

Generate an OTP

You can now generate an OTP like this:

$ ykman oath accounts code -s rubygems.org

Publishing a Gem without OTP Prompts

You can supply an OTP code to the gem interface via an environment variable or a command line argument.

The environment variable version is like this:

$ GEM_HOST_OTP_CODE=$(ykman oath accounts code -s rubygems.org) gem push cool-gem-0.0.0.gem

The command line argument is like this:

$ gem push cool-gem-0.0.0.gem --otp $(ykman oath accounts code -s rubygems.org)

I’ve used the environment variable version, but not the command line argument though.

Final Thoughts

I also did this for NPM, but I haven’t tried pushing a package yet so I’ll see how that goes. I don’t really have any other thoughts except that everyone should enable 2FA so that we can prevent situations like ua-parser-js. I’m not particularly interested in installing someone’s Bitcoin miner on my machine, and I’m also not interested in being hassled because my package was hijacked.

Everyone, please stay safe and enable 2FA!

–Aaron

<3 <3

read more »

Debugging an Assertion Error in Ruby

I hope nobody runs in to a problem where they need the information in this post, but in case you do, I hope this post is helpful. (I’m talking to you, future Aaron! lol)

I committed a patch to Ruby that caused the tests to start failing. This was the patch:

commit 1be84e53d76cff30ae371f0b397336dee934499d
Author: Aaron Patterson <tenderlove@ruby-lang.org>
Date:   Mon Feb 1 10:42:13 2021 -0800

    Don't pin `val` passed in to `rb_define_const`.
    
    The caller should be responsible for holding a pinned reference (if they
    need that)

diff --git a/variable.c b/variable.c
index 92d7d11eab..ff4f7964a7 100644
--- a/variable.c
+++ b/variable.c
@@ -3154,7 +3154,6 @@ rb_define_const(VALUE klass, const char *name, VALUE val)
     if (!rb_is_const_id(id)) {
        rb_warn("rb_define_const: invalid name `%s' for constant", name);
     }
-    rb_gc_register_mark_object(val);
     rb_const_set(klass, id, val);
 }

This patch is supposed to allow objects passed in to rb_define_const to move. As the commit message says, the caller should be responsible for keeping the value pinned. At the time I committed the patch, I thought that most callers of the function were marking the value passed in (as val), so we were pinning objects that something else would already pin. In other words, this code was being wasteful by chewing up GC time by pinning objects that were already pinned.

Unfortunately the CI started to error shortly after I committed this patch. Clearly the patch was related, but how?

In this post I am going to walk through the debugging tricks I used to find the error.

Reproduction

I was able to reproduce the error on my Linux machine by running the same command CI ran. Unfortunately since this bug is related to GC, the error was intermittent. To reproduce it, I just ran the tests in a loop until the process crashed like this:

$ while test $status -eq 0
    env RUBY_TESTOPTS='-q --tty=no' make -j16 -s check
  end

Before running this loop though, I made sure to do ulimit -c unlimited so that I would get a core file when the process crashed.

The Error

After the process crashed, the top of the error looked like this:

<OBJ_INFO:rb_ractor_confirm_belonging@./ractor_core.h:327> 0x000055be8657f180 [0     ] T_NONE 
/home/aaron/git/ruby/lib/bundler/environment_preserver.rb:47: [BUG] id == 0 but not shareable
ruby 3.1.0dev (2021-02-03T17:35:37Z master 6b4814083b) [x86_64-linux]

The Ractor verification routines crashed the process because a T_NONE object is “not sharable”. In other words you can’t share an object of type T_NONE between Ractors. This makes sense because T_NONE objects are actually empty slots in the GC. If a Ractor, or any other Ruby code sees a T_NONE object, then it’s clearly an error. Only the GC internals should ever be dealing with this type.

The top of the C backtrace looked like this:

-- C level backtrace information -------------------------------------------
/home/aaron/git/ruby/ruby(rb_print_backtrace+0x14) [0x55be856e9816] vm_dump.c:758
/home/aaron/git/ruby/ruby(rb_vm_bugreport) vm_dump.c:1020
/home/aaron/git/ruby/ruby(bug_report_end+0x0) [0x55be854e2a69] error.c:778
/home/aaron/git/ruby/ruby(rb_bug_without_die) error.c:778
/home/aaron/git/ruby/ruby(rb_bug+0x7d) [0x55be854e2bb0] error.c:786
/home/aaron/git/ruby/ruby(rb_ractor_confirm_belonging+0x102) [0x55be856cf6e2] ./ractor_core.h:328
/home/aaron/git/ruby/ruby(vm_exec_core+0x4ff3) [0x55be856b0003] vm.inc:2224
/home/aaron/git/ruby/tool/lib/test/unit/parallel.rb(rb_vm_exec+0x886) [0x55be856c9946]
/home/aaron/git/ruby/ruby(load_iseq_eval+0xbb) [0x55be8554f66b] load.c:594
/home/aaron/git/ruby/ruby(require_internal+0x394) [0x55be8554e3e4] load.c:1065
/home/aaron/git/ruby/ruby(rb_require_string+0x973c4) [0x55be8554d8a4] load.c:1142
/home/aaron/git/ruby/ruby(rb_f_require) load.c:838
/home/aaron/git/ruby/ruby(vm_call_cfunc_with_frame+0x11a) [0x55be856dd6fa] ./vm_insnhelper.c:2897
/home/aaron/git/ruby/ruby(vm_call_method_each_type+0xaa) [0x55be856d4d3a] ./vm_insnhelper.c:3387
/home/aaron/git/ruby/ruby(vm_call_alias+0x87) [0x55be856d68e7] ./vm_insnhelper.c:3037
/home/aaron/git/ruby/ruby(vm_sendish+0x200) [0x55be856d08e0] ./vm_insnhelper.c:4498

The function rb_ractor_confirm_belonging was the function raising an exception.

Debugging the Core File with LLDB

I usually use clang / lldb when debugging. I’ve added scripts to Ruby’s lldb tools that let me track down problems more easily, so I prefer it over gcc / gdb.

First I inspected the backtrace in the corefile:

(lldb) target create "./ruby" --core "core.456156"
Core file '/home/aaron/git/ruby/core.456156' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'ruby', stop reason = signal SIGABRT
  * frame #0: 0x00007fdc5fc8918b libc.so.6`raise + 203
    frame #1: 0x00007fdc5fc68859 libc.so.6`abort + 299
    frame #2: 0x000056362ac38bc6 ruby`die at error.c:765:5
    frame #3: 0x000056362ac38bb5 ruby`rb_bug(fmt=<unavailable>) at error.c:788:5
    frame #4: 0x000056362ae256e2 ruby`rb_ractor_confirm_belonging(obj=<unavailable>) at ractor_core.h:328:13
    frame #5: 0x000056362ae06003 ruby`vm_exec_core(ec=<unavailable>, initial=<unavailable>) at vm.inc:2224:5
    frame #6: 0x000056362ae1f946 ruby`rb_vm_exec(ec=<unavailable>, mjit_enable_p=<unavailable>) at vm.c:0
    frame #7: 0x000056362aca566b ruby`load_iseq_eval(ec=0x000056362b176710, fname=0x000056362ce96660) at load.c:594:5
    frame #8: 0x000056362aca43e4 ruby`require_internal(ec=<unavailable>, fname=<unavailable>, exception=1) at load.c:1065:21
    frame #9: 0x000056362aca38a4 ruby`rb_f_require [inlined] rb_require_string(fname=0x00007fdc38033178) at load.c:1142:18
    frame #10: 0x000056362aca3880 ruby`rb_f_require(obj=<unavailable>, fname=0x00007fdc38033178) at load.c:838
    frame #11: 0x000056362ae336fa ruby`vm_call_cfunc_with_frame(ec=0x000056362b176710, reg_cfp=0x00007fdc5f958de0, calling=<unavailable>) at vm_insnhelper.c:2897:11
    frame #12: 0x000056362ae2ad3a ruby`vm_call_method_each_type(ec=0x000056362b176710, cfp=0x00007fdc5f958de0, calling=0x00007ffe3b552128) at vm_insnhelper.c:3387:16
    frame #13: 0x000056362ae2c8e7 ruby`vm_call_alias(ec=0x000056362b176710, cfp=0x00007fdc5f958de0, calling=0x00007ffe3b552128) at vm_insnhelper.c:3037:12

It’s very similar to the backtrace in the crash report. The first thing that was interesting to me was frame 5 in vm_exec_core. vm_exec_core is the main loop for the YARV VM. This program was crashing when executing some kind of instruction in the virtual machine.

(lldb) f 5
frame #5: 0x000056362ae06003 ruby`vm_exec_core(ec=<unavailable>, initial=<unavailable>) at vm.inc:2224:5
   2221	    /* ### Instruction trailers. ### */
   2222	    CHECK_VM_STACK_OVERFLOW_FOR_INSN(VM_REG_CFP, INSN_ATTR(retn));
   2223	    CHECK_CANARY(leaf, INSN_ATTR(bin));
-> 2224	    PUSH(val);
   2225	    if (leaf) ADD_PC(INSN_ATTR(width));
   2226	#   undef INSN_ATTR
   2227	
(lldb) 

Checking frame 5, we can see that it’s crashing when we push a value on to the stack. The Ractor function checks the value of objects being pushed on the VM stack, and in this case we have an object that is a T_NONE. The question is where did this value come from?

The crash happened in the file vm.inc, line 2224. This file is a generated file, so I can’t link to it, but I wanted to know which instruction was being executed, so I pulled up that file.

Line 2224 happened to be inside the opt_send_without_block instruction. So something is calling a method, and the return value of the method is a T_NONE object.

But what method is being called, and on what object?

Finding the called method

The value ec, or “Execution Context” contains information about the virtual machine at runtime. On the ec, we can find the cfp or “Control Frame Pointer” which is a data structure representing the current executing stack frame. In lldb, I could see that frame 7 had the ec available, so I went to that frame to look at the cfp:

(lldb) f 7
frame #7: 0x000056362aca566b ruby`load_iseq_eval(ec=0x000056362b176710, fname=0x000056362ce96660) at load.c:594:5
   591 	        rb_ast_dispose(ast);
   592 	    }
   593 	    rb_exec_event_hook_script_compiled(ec, iseq, Qnil);
-> 594 	    rb_iseq_eval(iseq);
   595 	}
   596 	
   597 	static inline enum ruby_tag_type
(lldb) p *ec->cfp
(rb_control_frame_t) $1 = {
  pc = 0x000056362c095d58
  sp = 0x00007fdc5f859330
  iseq = 0x000056362ca051f0
  self = 0x000056362b1d92c0
  ep = 0x00007fdc5f859328
  block_code = 0x0000000000000000
  __bp__ = 0x00007fdc5f859330
}

The control frame pointer has a pointer to the iseq or “Instruction Sequence” that is currently being executed. It also has a pc or “Program Counter”, and the program counter usually points at the instruction that will be executed next (in other words, not the currently executing instruction). Of other interest, the iseq also has the source location that corresponds to those instructions.

Getting the Source File

If we examine the iseq structure, we can find the source location of the code that is currently being executed:

(lldb) p ec->cfp->iseq->body->location
(rb_iseq_location_t) $4 = {
  pathobj = 0x000056362ca06960
  base_label = 0x000056362ce95a30
  label = 0x000056362ce95a30
  first_lineno = 0x0000000000000051
  node_id = 137
  code_location = {
    beg_pos = (lineno = 40, column = 4)
    end_pos = (lineno = 50, column = 7)
  }
}
(lldb) command script import -r ~/git/ruby/misc/lldb_cruby.py
lldb scripts for ruby has been installed.
(lldb) rp 0x000056362ca06960
bits [     ]
T_STRING: [FROZEN] (const char [57]) $6 = "/home/aaron/git/ruby/lib/bundler/environment_preserver.rb"
(lldb) 

The location info clearly shows us that the instructions are on line 40. The pathobj member contains the file name, but it is stored as a Ruby string. To print out the string, I imported the lldb CRuby extensions, then used the rp command and gave it the address of the path object.

From the output, we can see that it’s crashing in the “environment_preserver.rb” file inside of the instructions that are defined on line 40. We’re not crashing on line 40, but the instructions are defined there.

Those instructions are this method:

    def replace_with_backup
      ENV.replace(backup) unless Gem.win_platform?

      # Fallback logic for Windows below to workaround
      # https://bugs.ruby-lang.org/issues/16798. Can be dropped once all
      # supported rubies include the fix for that.

      ENV.clear

      backup.each {|k, v| ENV[k] = v }
    end

It’s still not clear which of these method calls is breaking. In this function we have some method call that is returning a T_NONE.

Finding The Method Call

To find the method call, I disassembled the instruction sequence and checked the program counter:

(lldb) command script import -r misc/lldb_disasm.py
lldb Ruby disasm installed.
(lldb) rbdisasm ec->cfp->iseq
PC             IDX  insn_name(operands) 
0x56362c095c20 0000 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095ee0 )
0x56362c095c38 0003 putobject( (VALUE)0x14 )
0x56362c095c48 0005 getconstant( ID: 0x807b )
0x56362c095c58 0007 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095ee0 )
0x56362c095c68 0009 opt_send_without_block( (struct rb_call_data *)0x56362c095f20 )
0x56362c095c78 0011 branchif( 15 )
0x56362c095c88 0013 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095ef0 )
0x56362c095ca0 0016 putobject( (VALUE)0x14 )
0x56362c095cb0 0018 getconstant( ID: 0x370b )
0x56362c095cc0 0020 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095ef0 )
0x56362c095cd0 0022 putself
0x56362c095cd8 0023 opt_send_without_block( (struct rb_call_data *)0x56362c095f30 )
0x56362c095ce8 0025 opt_send_without_block( (struct rb_call_data *)0x56362c095f40 )
0x56362c095cf8 0027 pop
0x56362c095d00 0028 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095f00 )
0x56362c095d18 0031 putobject( (VALUE)0x14 )
0x56362c095d28 0033 getconstant( ID: 0x370b )
0x56362c095d38 0035 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095f00 )
0x56362c095d48 0037 opt_send_without_block( (struct rb_call_data *)0x56362c095f50 )
0x56362c095d58 0039 pop
0x56362c095d60 0040 putself
0x56362c095d68 0041 opt_send_without_block( (struct rb_call_data *)0x56362c095f60 )
0x56362c095d78 0043 send( (struct rb_call_data *)0x56362c095f70, (rb_iseq_t *)0x56362ca05178 )
0x56362c095d90 0046 leave
(lldb) p ec->cfp->pc
(const VALUE *) $9 = 0x000056362c095d58

First I loaded the disassembly helper script. It provides the rbdisasm function. Then I used rbdisasm on the instruction sequence. This printed out the instructions in mostly human readable form. Printing the PC showed a value of 0x000056362c095d58. Looking at the PC list in the disassembly shows that 0x000056362c095d58 corresponds to a pop instruction. But the PC always points at the next instruction that will execute, not the currently executing instruction. The currently executing instruction is the one right before the PC. In this case we can see it is opt_send_without_block, which lines up with the information we discovered from vm.inc.

This is the 3rd from last method call in the block. At 0041 there is another opt_send_without_block, and then at 0043 there is a generic send call.

Looking at the Ruby code, from the bottom of the method, we see a call to backup. It’s not a local variable, so it must be a method call. The code calls each on that, and each takes a block. These must correspond to the opt_send_without_block and the send at the end of the instruction sequence. Our crash is happening just before these two, so it must be the call to ENV.clear.

If we read the implementation of ENV.clear, we can see that it returns a global variable called envtbl:

VALUE
rb_env_clear(void)
{
    VALUE keys;
    long i;

    keys = env_keys(TRUE);
    for (i=0; i<RARRAY_LEN(keys); i++) {
        VALUE key = RARRAY_AREF(keys, i);
        const char *nam = RSTRING_PTR(key);
        ruby_setenv(nam, 0);
    }
    RB_GC_GUARD(keys);
    return envtbl;
}

This object is allocated here:

    envtbl = rb_obj_alloc(rb_cObject);

And then it calls rb_define_global_const to define the ENV constant as a global:

    /*
     * ENV is a Hash-like accessor for environment variables.
     *
     * See ENV (the class) for more details.
     */
    rb_define_global_const("ENV", envtbl);

If we read rb_define_global_const we can see that it just calls rb_define_const:

void
rb_define_global_const(const char *name, VALUE val)
{
    rb_define_const(rb_cObject, name, val);
}

Before my patch, any object passed to rb_define_const would be pinned. Once I removed the pinning, that allowed the ENV variable to move around even though it shouldn’t.

I reverted that patch here, and then sent a pull request to make rb_gc_register_mark_object a little bit smarter here.

Conclusion

TBH I don’t know what to conclude this with. Debugging errors kind of sucks, but I hope that the LLDB scripts I wrote make it suck a little less. Hope you’re having a good day!!!

read more »

Counting Write Barrier Unprotected Objects

This is just a quick post mostly as a note to myself (because I forget the jq commands). Ruby objects that are not protected with a write barrier must be examined on every minor GC. That means that any objects in your system that live for a long time and don’t have write barrier protection will cause unnecessary overhead on every minor collection.

Heap dumps will tell you which objects have a write barrier. In Rails apps I use a small script to get a dump of the heap after boot:

require 'objspace'
require 'config/environment'

GC.start

File.open("heap.dump", "wb") do |f|
  ObjectSpace.dump_all(output: f)
end

The heap.dump file will have a list of all of the objects in the heap.

Here is an example of an object with a write barrier:

{"address":"0x7fec1b2ff940", "type":"IMEMO", "class":"0x7fec1b2ffd50", "imemo_type":"ment", "references":["0x7fec1b314908", "0x7fec1b2ffcd8"], "memsize":48, "flags":{"wb_protected":true, "old":true, "uncollectible":true, "marked":true}}

Here is an example of an object without a write barrier:

{"address":"0x7fec1b2ff760", "type":"ICLASS", "class":"0x7fec1a8c0f60", "references":["0x7fec1a8c9250", "0x7fec1b2fefe0"], "memsize":40}

Objects with a write barrier will have "wb_protected":true in their flags section.

I like to use jq to process heap dumps. Here is a command to find all of the unprotected objects, group them by type, then count them up:

$ jq 'select(.flags.wb_protected | not) | .type' heap.dump  | sort | uniq -c | sort -n
   1 "MATCH"
   2 "ARRAY"
   5 "ROOT"
   9 "FILE"
 323 "MODULE"
 927 "ICLASS"
1631 "DATA"

All of the objects listed here will be examined on every minor GC. If my Rails app is spending a lot of time in minor GCs, this is a good place to look.

Ruby 2.8 (or 3.0) will eliminate ICLASS from this list (here is the commit).

read more »

Guide to String Encoding in Ruby

Encoding issues don’t seem to happen frequently, but that is a blessing and a curse. It’s great not to fix them very frequently, but when you do need to fix them, lack of experience can leave you feeling lost.

This post is meant to be a sort of guide about what to do when you encounter different types of encoding errors in Ruby. First we’ll cover what an encoding object is, then we’ll look at common encoding exceptions and how to fix them.

What are String encodings in Ruby?

In Ruby, strings are a combination of an array of bytes, and an encoding object. We can access the encoding object on the string by calling encoding on the string object.

For example:

>> x = 'Hello World'
>> x.encoding
=> #<Encoding:UTF-8>

In my environment, the default encoding object associated with a string us the “UTF-8” encoding object. A graph of the object relationship looks something like this:

string points at encoding

Changing a String’s Encoding

We can change encoding by two different methods:

  • String#force_encoding
  • String#encode

The force_encoding method will mutate the string object and only change which encoding object the string points to. It does nothing to the bytes of the string, it merely changes the encoding object associated with the string. Here we can see that the return value of encoding changes after we call the force_encode method:

>> x = 'Hello World'
>> x.encoding
=> #<Encoding:UTF-8>
>> x.force_encoding "US-ASCII"
=> "Hello World"
>> x.encoding
=> #<Encoding:US-ASCII>

The encode method will create a new string based on the bytes of the old string and associate the encoding object with the new string.

Here we can see that the encoding of x remains the same, and calling encode returns a new string y which is associated with the new encoding:

>> x = 'Hello World'
>> x.encoding
=> #<Encoding:UTF-8>
>> y = x.encode("US-ASCII")
>> x.encoding
=> #<Encoding:UTF-8>
>> y.encoding
=> #<Encoding:US-ASCII>

Here is a visualization of the difference:

changing encoding

Calling force_encoding mutates the original string, where encode creates a new string with a different encoding. Translating a string from one encoding to another is probably the “normal” use of encodings. However, developers will rarely call the encode method because Ruby will typically handle any necessary translations automatically. It’s probably more common to call the force_encoding method, and that is because strings can be associated with the wrong encoding.

Strings Can Have the Wrong Encoding

Strings can be associated with the wrong encoding object, and that is the source of most if not all encoding related exceptions. Let’s look at an example:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> false

In this case, Ruby associated the string "Hello \x93\xfa\x96\x7b" with the default encoding UTF-8. However, many of the bytes in the string are not valid Unicode characters. We can check if the string is associated with a valid encoding object by calling valid_encoding? method. The valid_encoding? method will scan all bytes to see if they are valid for that particular encoding object.

So how do we fix this? The answer depends on the situation. We need to think about where the data came from and where the data is going. Let’s say we’ll display this string on a webpage, but we do not know the correct encoding for the string. In that case we probably want to make sure the string is valid UTF-8, but since we don’t know the correct encoding for the string, our only choice is to remove the bad bytes from the string.

We can remove the unknown bytes by using the scrub method:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.valid_encoding?
=> false
>> y = x.scrub
>> y
=> "Hello ���{"
>> y.encoding
=> #<Encoding:UTF-8>
>> y.valid_encoding?
=> true

The scrub method will return a new string associated with the encoding but with all of the invalid bytes replaced by a replacement character, the diamond question mark thing.

What if we do know the encoding of the source string? Actually the example above is using a string that’s encoding using Shift JIS. Let’s say we know the encoding, and we want to display the string on a webpage. In that case we tag the string by using force_encoding, and transcode to UTF-8:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.force_encoding "Shift_JIS"
=> "Hello \x{93FA}\x{967B}"
>> x.valid_encoding?
=> true
>> x.encode "UTF-8" # display as UTF-8
=> "Hello 日本"

The most important thing to think about when dealing with encoding issues is “where did this data come from?” and “what will we do with this data?” Answering those two questions will drive all decisions about which encoding to use with which string.

Encoding Depends on the Context

Before we look at some common errors and their remediation, let’s look at one more example of the encoding context dependency. In this example, we’ll use some user input as a cache key, but we’ll also display the user input on a webpage. We’re going to use our source data (the user input) in two places: as a cache key, and something to display on a web page.

Here’s the code:

require "digest/md5"
require "cgi"

# Make a checksum
def make_checksum string
  Digest::MD5.hexdigest string
end

# Not good HTML escaping (don't use this)
# Returns a string with UTF-8 compatible encoding for display on a webpage
def display_on_web string
  string.gsub(/>/, "&gt;")
end

# User input from an unknown source
x = "Hello \x93\xfa\x96\x7b"
p ENCODING: x.encoding
p VALID_ENCODING: x.valid_encoding?

p display_on_web x
p make_checksum x

If we run this code, we’ll get an exception:

$ ruby thing.rb
{:ENCODING=>#<Encoding:UTF-8>}
{:VALID_ENCODING=>false}
Traceback (most recent call last):
        2: from thing.rb:20:in `<main>'
        1: from thing.rb:12:in `display_on_web'
thing.rb:12:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)

The problem is that we have a string of unknown input with bytes that are not valid UTF-8 characters. We know we want to display this string on a UTF-8 encoded webpage, so lets scrub the string:

require "digest/md5"
require "cgi"

# Make a checksum
def make_checksum string
  Digest::MD5.hexdigest string
end

# Not good HTML escaping (don't use this)
# Returns a string with UTF-8 compatible encoding for display on a webpage
def display_on_web string
  string.gsub(/>/, "&gt;")
end

# User input from an unknown source
x = "Hello \x93\xfa\x96\x7b".scrub
p ENCODING: x.encoding
p VALID_ENCODING: x.valid_encoding?

p display_on_web x
p make_checksum x

Now when we run the program, the output is like this:

$ ruby thing.rb
{:ENCODING=>#<Encoding:UTF-8>}
{:VALID_ENCODING=>true}
"Hello ���{"
"4dab6f63b4d3ae3279345c9df31091eb"

Great! We’ve build some HTML and generated a checksum. Unfortunately there is a bug in this code (of course the mere fact that we’ve written code means there’s a bug! lol) Let’s introduce a second user input string with slightly different bytes than the first input string:

require "digest/md5"
require "cgi"

# Make a checksum
def make_checksum string
  Digest::MD5.hexdigest string
end

# Not good HTML escaping (don't use this)
# Returns a string with UTF-8 compatible encoding for display on a webpage
def display_on_web string
  string.gsub(/>/, "&gt;")
end

# User input from an unknown source
x = "Hello \x93\xfa\x96\x7b".scrub
p ENCODING: x.encoding
p VALID_ENCODING: x.valid_encoding?

p display_on_web x
p make_checksum x

# Second user input from an unknown source with slightly different bytes
y = "Hello \x94\xfa\x97\x7b".scrub
p ENCODING: y.encoding
p VALID_ENCODING: y.valid_encoding?

p display_on_web y
p make_checksum y

Here is the output from the program:

$ ruby thing.rb
{:ENCODING=>#<Encoding:UTF-8>}
{:VALID_ENCODING=>true}
"Hello ���{"
"4dab6f63b4d3ae3279345c9df31091eb"
{:ENCODING=>#<Encoding:UTF-8>}
{:VALID_ENCODING=>true}
"Hello ���{"
"4dab6f63b4d3ae3279345c9df31091eb"

The program works in the sense that there is no exception. But both user input strings have the same checksum despite the fact that the original strings clearly have different bytes! So what is the correct fix for this program? Again, we need to think about the source of the data (where did it come from), as well as what we will do with it (where it is going). In this case we have one source, from a user, and the user provided us with no encoding information. In other words, the encoding information of the source data is unknown, so we can only treat it as a sequence of bytes. We have two output cases, one is a UTF-8 HTML the other output is the input to our checksum function. The HTML requires that our string be UTF-8 so making the string valid UTF-8, in other words “scrubbing” it, before displaying makes sense. However, our checksum function requires seeing the original bytes of the string. Since the checksum is only concerned with the bytes in the string, any encoding including an invalid encoding will work. It’s nice to make sure all our strings have valid encodings though, so we’ll fix this example such that everything has a valid encoding.

require "digest/md5"
require "cgi"

# Make a checksum
def make_checksum string
  Digest::MD5.hexdigest string
end

# Not good HTML escaping (don't use this)
# Returns a string with UTF-8 compatible encoding for display on a webpage
def display_on_web string
  string.gsub(/>/, "&gt;")
end

# User input from an unknown source
x = "Hello \x93\xfa\x96\x7b".b
p ENCODING: x.encoding
p VALID_ENCODING: x.valid_encoding?

p display_on_web x.encode("UTF-8", undef: :replace)
p make_checksum x

# Second user input from an unknown source with slightly different bytes
y = "Hello \x94\xfa\x97\x7b".b
p ENCODING: y.encoding
p VALID_ENCODING: y.valid_encoding?

p display_on_web y.encode("UTF-8", undef: :replace)
p make_checksum y

Here is the output of the program:

$ ruby thing.rb
{:ENCODING=>#<Encoding:ASCII-8BIT>}
{:VALID_ENCODING=>true}
"Hello ���{"
"96cf6db2750fd4d2488fac57d8e4d45a"
{:ENCODING=>#<Encoding:ASCII-8BIT>}
{:VALID_ENCODING=>true}
"Hello ���{"
"b92854c0db4f2c2c20eff349a9a8e3a0"

To fix our program, we’ve changed a couple things. First we tagged the string of unknown encoding as “binary” by using the .b method. The .b method returns a new string that is associated with the ASCII-8BIT encoding. The name ASCII-8BIT is somewhat confusing because it has the word “ASCII” in it. It’s better to think of this encoding as either “unknown” or “binary data”. Unknown meaning we have some data that may have a valid encoding, but we don’t know what it is. Or binary data, as in the bytes read from a JPEG file or some such binary format. Anyway, we pass the binary string in to the checksum function because the checksum only cares about the bytes in the string, not about the encoding.

The second change we made is to call encode with the encoding we want (UTF-8) along with undef: :replace meaning that any time Ruby encounters bytes it doesn’t know how to convert to the target encoding, it will replace them with the replacement character (the diamond question thing).

SIDE NOTE: This is probably not important, but it is fun! We can specify what Ruby uses for replacing unknown bytes. Here’s an example:

>> x = "Hello \x94\xfa\x97\x7b".b
>> x.encoding
=> #<Encoding:ASCII-8BIT>
>> x.encode("UTF-8", undef: :replace, replace: "Aaron")
=> "Hello AaronAaronAaron{"
>> x.encode("UTF-8", undef: :replace, replace: "🤣")
=> "Hello 🤣🤣🤣{"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:UTF-8>, true]

Now lets take a look at some common encoding errors in Ruby and what to do about them.

Encoding::InvalidByteSequenceError

This exception occurs when Ruby needs to examine the bytes in a string and the bytes do not match the encoding. Here is an example of this exception:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.encode "UTF-16"
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):4
        1: from (irb):4:in `encode'
Encoding::InvalidByteSequenceError ("\x93" on UTF-8)
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> false

The string x contains bytes that aren’t valid UTF-8, yet it is associated with the UTF-8 encoding object. When we try to convert x to UTF-16, an exception occurs.

How to fix Encoding::InvalidByteSequenceError

Like most encoding issues, our string x is tagged with the wrong encoding. The way to fix this issue is to tag the string with the correct encoding. But what is the correct encoding? To figure out the correct encoding, you need to know where the string came from. For example if the string came from a Mime attachment, the Mime attachment should specify the encoding (or the RFC will tell you).

In this case, the string is a valid Shift JIS string, but I know that because I looked up the bytes and manually entered them. So we’ll tag this as Shift JIS, and the exception goes away:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.force_encoding "Shift_JIS"
=> "Hello \x{93FA}\x{967B}"
>> x.encode "UTF-16"
=> "\uFEFFHello \u65E5\u672C"
>> x.encoding
=> #<Encoding:Shift_JIS>
>> x.valid_encoding?
=> true

If you don’t know the source of the string, an alternative solution is to tag as UTF-8 and then scrub the bytes:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.force_encoding "UTF-8"
=> "Hello \x93\xFA\x96{"
>> x.scrub!
=> "Hello ���{"
>> x.encode "UTF-16"
=> "\uFEFFHello \uFFFD\uFFFD\uFFFD{"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> true

Of course this works, but it means that you’ve lost data. The best solution is to figure out what the encoding of the string should be depending on its source and tag it with the correct encoding.

Encoding::UndefinedConversionError

This exception occurs when a string of one encoding can’t be converted to another encoding.

Here is an example:

>> x = "四\u2160"
>> x
=> "四Ⅰ"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> true
>> x.encode "Shift_JIS"
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):23
        1: from (irb):23:in `encode'
Encoding::UndefinedConversionError (U+2160 from UTF-8 to Shift_JIS)

In this example, we have two characters: “四”, and the Roman numeral 1 (“Ⅰ”). Unicode Roman numeral 1 cannot be converted to Shift JIS because there are two codepoints that represent that character in Shift JIS. This means the conversion is ambiguous, so Ruby will raise an exception.

How to fix Encoding::UndefinedConversionError

Our original string is correctly tagged as UTF-8, but we need to convert to Shift JIS. In this case we’ll use a replacement character when converting to Shift JIS:

>> x = "四\u2160"
>> y = x.encode("Shift_JIS", undef: :replace)
>> y
=> "\x{8E6C}?"
>> y.encoding
=> #<Encoding:Shift_JIS>
>> y.valid_encoding?
=> true
>> y.encode "UTF-8"
=> "四?"

We were able to convert to Shift JIS, but we did lose some data.

ArgumentError

When a string contains invalid bytes, sometimes Ruby will raise an ArgumentError exception:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.downcase
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):34
        1: from (irb):34:in `downcase'
ArgumentError (input string invalid)
>> x.gsub(/ello/, "i")
Traceback (most recent call last):
        6: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        3: from (irb):34
        2: from (irb):35:in `rescue in irb_binding'
        1: from (irb):35:in `gsub'
ArgumentError (invalid byte sequence in UTF-8)

Again we use our incorrectly tagged Shift JIS string. Calling downcase or gsub both result in an ArgumentError. I personally think these exceptions are not great. We didn’t pass anything to downcase, so why is it an ArgumentError? There is nothing wrong with the arguments we passed to gsub, so why is it an ArgumentError? Why does one say “input string invalid” where the other gives us a slightly more helpful exception of “invalid byte sequence in UTF-8”? I think these should both result in Encoding::InvalidByteSequenceError exceptions, as it’s a problem with the encoding, not the arguments.

Regardless, these errors both stem from the fact that the Shift JIS string is incorrectly tagged as UTF-8.

Fixing ArgumentError

Fixing this issue is just like fixing Encoding::InvalidByteSequenceError. We need to figure out the correct encoding of the source string, then tag the source string with that encoding. If the encoding of the source string is truly unknown, scrub it.

>> x = "Hello \x93\xfa\x96\x7b"
>> x.force_encoding "Shift_JIS"
=> "Hello \x{93FA}\x{967B}"
>> x.downcase
=> "hello \x{93FA}\x{967B}"
>> x.gsub(/ello/, "i")
=> "Hi \x{93FA}\x{967B}"

Encoding::CompatibilityError

This exception occurs when we try to combine strings of two different encodings and those encodings are incompatible. For example:

>> x = "四\u2160"
>> y = "Hello \x93\xfa\x96\x7b".force_encoding("Shift_JIS")
>> [x.encoding, x.valid_encoding?]
=> [#<Encoding:UTF-8>, true]
>> [y.encoding, y.valid_encoding?]
=> [#<Encoding:Shift_JIS>, true]
>> x + y
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):50
        1: from (irb):50:in `+'
Encoding::CompatibilityError (incompatible character encodings: UTF-8 and Shift_JIS)

In this example we have a valid UTF-8 string and a valid Shift JIS string. However, these two encodings are not compatible, so we get an exception when combining.

Fixing Encoding::CompatibilityError

To fix this exception, we need to manually convert one string to a new string that has a compatible encoding. In the case above, we can choose whether we want the output string to be UTF-8 or Shift JIS, and then call encode on the appropriate string.

In the case we want UTF-8 output, we can do this:

>> x = "四"
>> y = "Hello \x93\xfa\x96\x7b".force_encoding("Shift_JIS")
>> x + y.encode("UTF-8")
=> "四Hello 日本"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:UTF-8>, true]

If we wanted Shift JIS, we could do this:

>> x = "四"
>> y = "Hello \x93\xfa\x96\x7b".force_encoding("Shift_JIS")
>> x.encode("Shift_JIS") + y
=> "\x{8E6C}Hello \x{93FA}\x{967B}"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:Shift_JIS>, true]

Another possible solution is to scrub bytes and concatenate, but again that results in data loss.

What is a compatible encoding?

If there are incompatible encodings, there must be compatible encodings too (at least I would think that). Here is an example of compatible encodings:

>> x = "Hello World!".force_encoding "US-ASCII"
>> [x.encoding, x.valid_encoding?]
=> [#<Encoding:US-ASCII>, true]
>> y = "こんにちは"
>> [y.encoding, y.valid_encoding?]
=> [#<Encoding:UTF-8>, true]
>> y + x
=> "こんにちはHello World!"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:UTF-8>, true]
>> x + y
=> "Hello World!こんにちは"
>> [_.encoding, _.valid_encoding?]
=> [#<Encoding:UTF-8>, true]

The x string is encoded with “US ASCII” encoding and the y string UTF-8. US ASCII is fully compatible with UTF-8, so even though these two strings have different encoding, concatenation works fine.

String literals may default to UTF-8, but some functions will return US ASCII encoded strings. For example:

>> require "digest/md5"
=> true
>> Digest::MD5.hexdigest("foo").encoding
=> #<Encoding:US-ASCII>

A hexdigest will only ever contain ASCII characters, so the implementation tags the returned string as US-ASCII.

Encoding Gotchas

Let’s look at a couple encoding gotcha’s.

Infectious Invalid Encodings

When a string is incorrectly tagged, Ruby will typically only raise an exception when it needs to actually examine the bytes. Here is an example:

>> x = "Hello \x93\xfa\x96\x7b"
>> x.encoding
=> #<Encoding:UTF-8>
>> x.valid_encoding?
=> false
>> x + "ほげ"
=> "Hello \x93\xFA\x96{ほげ"
>> y = _
>> y
=> "Hello \x93\xFA\x96{ほげ"
>> [y.encoding, y.valid_encoding?]
=> [#<Encoding:UTF-8>, false]

Again we have the incorrectly tagged Shift JIS string. We’re able to append a correctly tagged UTF-8 string and no exception is raised. Why is that? Ruby assumes that if both strings have the same encoding, there is no reason to validate the bytes in either string so it will just append them. That means we can have an incorrectly tagged string “infect” what would otherwise be correctly tagged UTF-8 strings. Say we have some code like this:

def append string
  string + "ほげ"
end

p append("ほげ").valid_encoding? # => true
p append("Hello \x93\xfa\x96\x7b").valid_encoding? # = false

When debugging this code, we may be tempted to think the problem is in the append method. But actually the issue is with the caller. The caller is passing in incorrectly tagged strings, and unfortunately we might not get an exception until the return value of append is used somewhere far away.

ASCII-8BIT is Special

Sometimes ASCII-8BIT is considered to be a “compatible” encoding and sometimes it isn’t. Here is an example:

>> x = "\x93\xfa\x96\x7b".b
>> x.encoding
=> #<Encoding:ASCII-8BIT>
>> y = "ほげ"
>> y + x
Traceback (most recent call last):
        5: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `<main>'
        4: from /Users/aaron/.rbenv/versions/ruby-trunk/bin/irb:23:in `load'
        3: from /Users/aaron/.rbenv/versions/ruby-trunk/lib/ruby/gems/2.7.0/gems/irb-1.2.0/exe/irb:11:in `<top (required)>'
        2: from (irb):89
        1: from (irb):89:in `+'
Encoding::CompatibilityError (incompatible character encodings: UTF-8 and ASCII-8BIT)

Here we have a binary string stored in x. Maybe it came from a JPEG file or something (it didn’t, I just typed it in!) When we try to concatenate the binary string with the UTF-8 string, we get an exception. But this may actually be an exception we want! It doesn’t make sense to be concatenating some JPEG data with an actual string we want to view, so it’s good we got an exception here.

Now here is the same code, but with the contents of x changed somewhat:

>> x = "Hello World".b
>> x.encoding
=> #<Encoding:ASCII-8BIT>
>> y = "ほげ"
>> y + x
=> "ほげHello World"

We have the same code with the same encodings at play. The only thing that changed is the actual contents of the x string.

When Ruby concatenates ASCII-8BIT strings, it will examine the contents of that string. If all bytes in the string are ASCII characters, it will treat it as a US-ASCII string and consider it to be “compatible”. If the string contains non-ASCII characters, it will consider it to be incompatible.

This means that if you had read some data from your JPEG, and that data happened to all be ASCII characters, you would not get an exception even though maybe you really wanted one.

In my personal opinion, concatenating an ASCII-8BIT string with anything besides another ASCII-8BIT string should be an exception.

Anyway, this is all I feel like writing today. I hope you have a good day, and remember to check your encodings!

read more »

My Career Goals

I was going to tweet about this, but then I thought I’d have to make a bunch of tweets, and writing a blurgh post just seemed easier. Plus I don’t really have any puns in this post, so I can’t tweet it!

My Career Goals

I think many people aren’t sure what they want to do in their career. When I first started programming, I wasn’t sure what I wanted to do with my career. But after years of experience, my career aspirations have become crystal clear. I would like my job to be:

  • Improving Ruby and Rails internals
  • Teaching people

Improving Ruby and Rails internals

I got my first job programming in 1999. At that time, I didn’t know I wanted to be a programmer, it was just a way for me to pay for school. It turned out that I was pretty good at programming, so I decided that would be my career. To be honest, at that time I didn’t really love programming. I just found that I was good at it, and I could make decent money. In 2005 I found Ruby and Rails and that’s when I actually learned that I love programming. I loved Ruby so much that I learned Japanese so I could read blog posts about Ruby. 14 years later, I can easily read those blog posts, but I don’t actually need them. Oops!

The reason I want to work on Ruby and Rails internals is that I want the language and framework to be performant, stable, easy to use. I want Ruby and Rails to be a great choice for people to use in production. I want others to experience the same joy I felt writing Ruby, and I want to make sure there are business that will employ those people.

Teaching People

I love to teach people things I know. I also love learning new things. As I hack on language and framework internals, I try to take that knowledge an disseminate it to as many people as I can.

Why?

First, I don’t think people can feel the joy of programming in Ruby/Rails unless they know how to actually program with Ruby/Rails. So I’m happy to help new folks get in to the language and framework.

Second, I realize I’m not going to be around forever, and I want to make sure that these technologies will outlive me. If these technologies are going to survive in to the future, people need to understand how they work. Simply put: it’s an insurance policy for the future.

Third, it’s just fun.

Summary

My dream job is to hack Ruby/Rails internals and teach people everything I know. Doing it is fun for me, and it’s the best way I can use my skills to make a real impact on the world.

The End.

read more »