Tenderlove Making

Monkey Patch Detection in Ruby

My last post detailed one way that CRuby will eliminate some intermediate array allocations when using methods like Array#hash and Array#max. Part of the technique hinges on detecting when someone monkey patches array. Today, I thought we’d dive a little bit in to how CRuby detects and de-optimizes itself when these “important” methods get monkey patched.

Monkey Patching Problem

The optimization in the previous post made the assumption that the implementation Array#max was the original definition (as defined in Ruby itself). But the Ruby language allows us to reopen classes, redefine any methods we want, and that those methods will “just work”.

For example, if someone were to reopen Array and define a new max method, we would need to respect that monkey patch:

class Array
  def max
    "hello!"
  end
end

puts [1, 2].max # => "hello!"

In fact, a monkey patch implementation could mutate the array itself, so we’re definitely required to allocate an array in the case that someone added their own max method:

class Array
  def max
    self << :neat
    self
  end
end

x = [1, 2].max
p x # => [1, 2, :neat]

So how does CRuby detect that a method has been monkey patched?

Method Definition Time

Every time a method is defined, an entry is stored in a hash table pointed to by the current class. We call this the “method table”, but you’ll see it referred to as M_TBL or RCLASS_M_TBL in the code. The key to the hash is simply the method name as an ID type (an integer which represents a Ruby Symbol), and the value of the hash is a method entry structure. If there was already an entry in the table, then we know it’s a “redefinition” (a.k.a. “monkey patch”), and we end up calling rb_vm_check_redefinition_opt_method here.

rb_vm_check_redefinition_opt_method checks to see if this is a method we “care” about. Methods we “care” about are typically ones where we’ve made some kind of optimization and we need to deoptimize if someone redefines them.

If the redefined method is something we care to detect, then we set a flag in a global variable ruby_vm_redefined_flag, which is an array of integers.

The indexes of the ruby_vm_redefined_flag array correspond to “basic operators”, or BOPs. So for example, the 0th element is for BOP_PLUS, the 1th element is BOP_MINUS, etc. You can see the full list of basic operators here. These basic operators correspond to method names that we care about. So if someone monkey patches the + operator, we’ll set a flag in ruby_vm_redefined_flag[BOP_PLUS].

The values of the ruby_vm_redefined_flag array correspond to a bitmap that maps to classes we care about. You can see the list of classes and their corresponding bits here, as well as a function for mapping “classes we care about” to their corresponding bit flag.

For example, if someone monkey patches Array#pack, we would set a bit in ruby_vm_redefined_flag like this:

ruby_vm_redefined_flag[BOP_PACK] |= ARRAY_REDEFINED_OP_FLAG;

Then, when we execute our optimized instruction (opt_newarray_send which was introduced in the last post), we can check the bitmap to decide whether or not to take our fast path:

if ((ruby_vm_redefined_flag[BOP_PACK] & ARRAY_REDEFINED_OP_FLAG) == 0) {
  // It _hasn't_ been monkey patched, so take the fast path
}
else {
  // It _has_ been monkey patched, do the slow path
}

Of course this bitmask checking is wrapped in a macro that looks more like this:

if (BASIC_OP_UNREDEFINED_P(BOP_PACK, ARRAY_REDEFINED_OP_FLAG)) {
  // It _hasn't_ been monkey patched, so take the fast path
}
else {
  // It _has_ been monkey patched, do the slow path
}

You can see the actual code for Array#pack redefinition checking here.

Bonus Stuff

A cool thing (at least I think it’s cool) is that the function rb_vm_check_redefinition_opt_method not only sets up the “monkey patch detection” bits, it’s also a natural place to inform the JIT compiler that someone has done something catastrophic and that it should de-optimize. In fact, you can see those calls right here.

A weird thing is that since ruby_vm_redefined_flag is just a list bitmaps, it’s technically possible for us to track the definition of Integer#pack even though that method doesn’t exist:

ruby_vm_redefined_flag[BOP_PACK] |= INTEGER_REDEFINED_OP_FLAG;

I guess that means there’s a lot of bit space that isn’t used, but I don’t really think it’s a big deal.

Anyway, have a good day!

Eliminating Intermediate Array Allocations

Recently I gave a talk at RailsWorld (hopefully they’ll post the video soon), and part of my presentation was about eliminating allocations in tokenizers. I presented a simple function for measuring allocations:

def allocations
  x = GC.stat(:total_allocated_objects)
  yield
  GC.stat(:total_allocated_objects) - x
end

Everything in Ruby is an object, but not all objects actually make allocations. We can use the above function to measure allocations made in a block. Here are some examples of code that never allocate:

p allocations { true }                  # => 0
p allocations { false }                 # => 0
p allocations { nil }                   # => 0
p allocations { :hello }                # => 0
p allocations { 1 }                     # => 0
p allocations { 2.3 }                   # => 0
p allocations { 0xFFFF_FFFF_FFFF_FFFF } # => 0

Literals like booleans, nil, symbols, integers, and floats are represented internally to CRuby as “tagged pointers” and they don’t allocate anything when executed.

Here is an example of code that sometimes allocates:

# Depends on the size of the number
p allocations { 1 + 2 }                     # => 0
p allocations { 0x3FFF_FFFF_FFFF_FFFF + 1 } # => 1

# Depends on `frozen_string_literal`
p allocations { "hello!" }                  # => 0 or 1

Math on integers generally doesn’t allocate anything, but it depends on the integer. When a number gets large enough, CRuby will allocate an object to represent that number. On 64 bit platforms, the largest whole number we can represent without allocating is 0x3FFF_FFFF_FFFF_FFFF.

String literals will sometimes allocate, but it depends on the frozen_string_literal setting in your program.

Here is an example of code that always allocates:

p allocations { [1, 2] }      # => 1
p allocations { { a: :b } }   # => 1
p allocations { Object.new }  # => 1
p allocations { "foo"[0, 1] } # => 1

Hopefully these examples are fairly straightforward. Arrays, hashes, objects, string slices, etc will allocate an object.

Eliminating Intermediate Array Allocations

At the Shopify after-party at RailsWorld, someone asked me a really great question. Their codebase has a RuboCop rule that says that when doing min or max calculations, you should always have code like this:

def foo(x, y)
  [x, y].max
end

They were concerned this is wasteful as it has an Array literal, so it will be allocating an array every time!

I think this is a really great question, and if you read my earlier allocation measurement examples, I think it’s a very reasonable conclusion. However, it’s actually not the case. This code in particular will not allocate an array, and I thought we’d look in to how that works.

The compiler in Ruby is able to tell a few important things about this code. First, we’re calling a method on an array literal which means that we’re guaranteed that the max method will be sent to an array object. Second, we know statically that we’re calling the max method. Third, the max method that is implemented in core Ruby will not mutate its receiver, and it returns some value (an integer) that isn’t the array literal.

Since the compiler knows that the array literal is ephemeral, it allocates the array on the stack, does the max calculation, then throws away the array, never asking the GC for a new object.

To get a more concrete picture, lets look at the instruction sequences for the above code:

def foo(x, y)
  [x, y].max
end

insn = RubyVM::InstructionSequence.of(method(:foo))
puts insn.disasm
== disasm: #<ISeq:foo@x.rb:1 (1,0)-(1,30)>
local table (size: 2, argc: 2 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] x@0<Arg>   [ 1] y@1<Arg>
0000 getlocal_WC_0                          x@0                       (   1)[LiCa]
0002 getlocal_WC_0                          y@1
0004 opt_newarray_send                      2, 1
0007 leave                                  [Re]

The first two instructions fetch the locals x and y, and push them on the stack. Next we have a special instruction opt_newarray_send. This instruction takes two parameters, 2, 1. It’s a bit cryptic, but the 2 means that this instruction is going to operate on two stack elements. The 1 is an enum and means “we want to call the max method”.

The opt_newarray_send instruction will first check to see if Array#max has been monkey patched. If it has been monkey patched, then the instruction will allocate a regular array and call the monkey patched method. If it hasn’t been monkey patched, then it calls a “max” function which uses Ruby’s stack as an array buffer.

Here is what the stack looks like before executing opt_newarray_send:

+----+-------------+-------------+
|    | Stack Index | Stack Value |
+----+-------------+-------------+
|    | -2          | x           |
|    | -1          | y           |
| SP | 0           | Undef       |
+----+-------------+-------------+

The opt_newarray_send instruction was passed the value 2, so it knows to start the array at negative 2 relative to the stack pointer (SP). Since the stack is just an array, it calls the same function that the max function would normally call, popping 2 values from the stack, then pushing the return value of the max function.

In this way we can calculate the max value without allocating the intermediate array.

If we use our allocations function, we can confirm that the foo method indeed does not allocate anything:

def foo(x, y)
  [x, y].max
end

allocations { foo(3, 4) } # heat inline caches
p allocations { foo(3, 4) } # => 0

Aaron’s Opinion Corner

I don’t really know RuboCop very well, but I think that in cases like this it would be really helpful if the linter were to tell you why a particular rule is a rule. Personally, I dislike following rules unless I understand the reason behind them. Even if the reasoning is simply “this is just how our team styles our code”. If such a feature is already available in RuboCop, then please feel free to link to this blog post for this particular rule.

I can only assume the rule that enforced this style was “performance” related. I’m not a huge fan of linting, but I’m even less of a fan when it comes to rules around “performance”. If idiomatic Ruby is not performant, then I think there can be a strong case to be made that the CRuby team (which I am a part of) should make that code performant. If the CRuby team does make the code performant, then there is no need for the performance rule because most people write idiomatic Ruby code (by definition).

Of course there are cases where you may need to write non-idiomatic Ruby for performance reasons, but hopefully those cases are few and far between. Should the time arrive when you need to write odd code for performance reasons, it will require knowledge, experience, and nuance that neither a linter nor an AI can provide. Fortunately, this is a case where idiomatic Ruby is also “the fast way to do things”, so I definitely recommend people use the [x, y].max pattern.

More Stuff

Array#max isn’t the only method that uses this trick. It works with Array#min, Array#pack and Array#hash. If you need to implement a custom hash method on an object, then I highly recommend doing something like this:

def hash
  [@ivar1, @ivar2, ...].hash
end

Finally, there are cases where CRuby won’t apply this trick. Lets look at the instructions for the following method:

def foo
  [3, 4].max
end

insn = RubyVM::InstructionSequence.of(method(:foo))
puts insn.disasm
== disasm: #<ISeq:foo@x.rb:1 (1,0)-(3,3)>
0000 duparray                               [3, 4]                    (   2)[LiCa]
0002 opt_send_without_block                 <calldata!mid:max, argc:0, ARGS_SIMPLE>
0004 leave                                                            (   3)[Re]

If you read these instructions carefully, you’ll see it has a duparray instruction. This instruction allocates an array, and then we call the max method on the array.

When all of the elements of the array are static, CRuby applies an optimization to allocate the array once, embed it in the instructions, and then do a dup on the array. Copying an existing array is much faster than allocating a new one. Unfortunately, this optimization is applied before the “max” method optimization, so it doesn’t apply both.

For those of you at home saying “the compiler could calculate the max of [3, 4] and eliminate the array all together!” just remember that someone could monkey patch Array#max and we’d need to respect it. Argh!! Fixing this particular case is not worth the code complexity, in my opinion. We all know that 4 is greater than 3, so we could “manually inline” this case and just write 4.

Anyway, all this to say is that these optimizations are context dependent. Attempting to “prescribe” more optimal code seems like it could become a hairy situation, especially since the linter can’t know what the Ruby compiler will do.

I do like the idea of language servers possible suggesting possibly faster code, but only as a teaching opportunity for the developer. The real goal should be to help build understanding so that this type of linting becomes unnecessary.

Anyway, I had a really great time at RailsWorld. I am very happy I got this question, and I hope that this post helps someone!

Using Serial Ports with Ruby

Lets mess around with serial ports today! I love doing hardware hacking, and dealing with serial ports is a common thing you have to do when working with embedded systems. Of course I want to do everything with Ruby, and I had found Ruby serial port libraries to be either lacking, or too complex, so I decided to write my own. I feel like I’ve not done a good enough job promoting the library, so today we’re going to mess with serial ports using the UART gem. Don’t let the last commit date on the repo fool you, despite being over 6 years ago, this library is actively maintained (and I use it every day!).

I’ve got a GMC-320 Geiger counter. Not only is the price pretty reasonable, but it also has a serial port interface! You can log data, then download the logged data via serial port. Today we’re just going to write a very simple program that gets the firmware version via serial port, and then gets live updates from the device. This will allow us to start with UART basics in Ruby, and then work with streaming data and timeouts.

The company that makes the Geiger counter published a spec for the UART commands that the device supports, so all we need to do is send the commands and read the results.

According to the spec, the default UART config for my Geiger counter is 115200 BPS, Data bit: 8, no parity, Stop bit: 1, and no control. This is pretty easy to configure with the UART gem, all of these values are default except for the baud rate. The UART gem defaults to 9600 for the baud rate, so that’s the only thing we’ll have to configure.

Getting the hardware version

To get the hardware model and version, we just have to send <GETVER>> over the serial port and then read the response. Let’s write a small program that will fetch the hardware model and version and print them out.

require "uart"

UART.open ARGV[0], 115200 do |serial|
  # Send the "get version" command
  serial.write "<GETVER>>"

  # read and print the result
  puts serial.read
end

The first thing we do is require the uart library (make sure to gem install uart if you haven’t done so yet). Then we open the serial interface. We’ll pass the tty file in via the command line, so ARGV[0] will have the path to the tty. When I plug my Geiger counter in, it shows up as /dev/tty.usbserial-111240. We also configure the baud rate to 115200.

Once the serial port is open, we are free to read and write to it as if it were a Ruby file object. In fact, this is because it really is just a regular file object.

First we’ll send the command <GETVER>>, then we’ll read the response from the serial port.

Here’s what it looks like when I run it on my machine:

$ ruby rad.rb /dev/tty.usbserial-111240
GMC-320Re 4.09

Live updates

According to the documentation, we can get live updates from the hardware. To do that, we just need to send the <HEARTBEAT1>> command. Once we send that command, the hardware will write a value to the serial port every second, and it’s our job to read the data when it becomes available. We can use IO#wait_readable to wait until there is data to be read from the serial port.

According to the specification, there are two bytes (a 16 bit integer), and we need to ignore the top 2 bits. We’ll create a mask to ignore the top two bits, and combine that with the two bytes we read to get our value:

require "uart"

MASK = (~(3 << 14)) & 0xFFFF

UART.open ARGV[0], 115200 do |serial|
  # turn on heartbeat
  serial.write "<HEARTBEAT1>>"

  loop do
    if serial.wait_readable
      count = ((serial.readbyte << 8) | serial.readbyte) & MASK
      p count
    end
  end
ensure
  # make sure to turn off heartbeat
  serial.write "<HEARTBEAT0>>"
end

After we’ve sent the “start heartbeat” command, we enter a loop. Inside the loop, we block until there is data available to read by calling serial.wait_readable. Once there is data to read, we’ll read two bytes and combine them to a 16 bit integer. Then we mask off the integer using the MASK constant so that the two top bits are ignored. Finally we just print out the count.

The ensure section ensures that when the program exits, we’ll tell the hardware “hey, we don’t want to stream data anymore!”

When I run this on my machine, the output is like this (I hit Ctrl-C to stop the program):

$ ruby rad.rb /dev/tty.usbserial-111240
0
0
0
0
0
0
0
1
0
1
0
0
0
1
^Crad.rb:10:in 'IO#wait_readable': Interrupt
	from rad.rb:10:in 'block (2 levels) in <main>'
	from <internal:kernel>:191:in 'Kernel#loop'
	from rad.rb:9:in 'block in <main>'
	from /Users/aaron/.rubies/arm64/ruby-trunk/lib/ruby/gems/3.4.0+0/gems/uart-1.0.0/lib/uart.rb:57:in 'UART.open'
	from rad.rb:5:in '<main>'

Lets do two improvements, and then call it a day. First, lets specify a timeout, then lets calculate the CPM.

Specifying a timeout

Currently, serial.wait_readable will block forever, but we expect an update from the hardware about every second. If it takes longer than say 2 seconds for data to be available, then something must be wrong and we should print a message or exit the program.

Specifying a timeout is quite easy, we just pass the timeout (in seconds) to the wait_readable method like below:

require "uart"

MASK = (~(3 << 14)) & 0xFFFF

UART.open ARGV[0], 115200 do |serial|
  # turn on heartbeat
  serial.write "<HEARTBEAT1>>"

  loop do
    if serial.wait_readable(2)
      count = ((serial.readbyte << 8) | serial.readbyte) & MASK
      p count
    else
      $stderr.puts "oh no, something went wrong!"
      exit(1)
    end
  end
ensure
  # make sure to turn off heartbeat
  serial.write "<HEARTBEAT0>>"
end

When data becomes available, wait_readable will return a truthy value, and if the timeout was reached, then it will return a falsy value. So, if it takes more than 2 seconds for data to become available wait_readable will return nil, and we print an error message and exit the program.

Calculating CPM

CPM stands for “counts per minute”, meaning the number of ionization events the hardware has detected within one minute. However, the value that we’re reading from the serial port is actually the “counts per second” (or ionization events the hardware detected in the last second). Most of the time that value is 0 so it’s not super fun to read. Lets calculate the CPM and print that instead.

We know the samples are arriving about every second, so I’m just going to modify this code to keep a list of the last 60 samples and just sum those:

require "uart"

MASK = (~(3 << 14)) & 0xFFFF

UART.open ARGV[0], 115200 do |serial|
  # turn on heartbeat
  serial.write "<HEARTBEAT1>>"

  samples = []

  loop do
    if serial.wait_readable(2)
      # Push the sample on the list
      samples.push(((serial.readbyte << 8) | serial.readbyte) & MASK)

      # Make sure we only have 60 samples in the list
      while samples.length > 60; samples.shift; end

      # Print a sum of the samples (if we have 60)
      p CPM: samples.sum if samples.length == 60
    else
      $stderr.puts "oh no, something went wrong!"
      exit(1)
    end
  end
ensure
  # make sure to turn off heartbeat
  serial.write "<HEARTBEAT0>>"
end

Here is the output on my machine:

$ ruby rad.rb /dev/tty.usbserial-111240
{:CPM=>9}
{:CPM=>8}
{:CPM=>8}
{:CPM=>8}
{:CPM=>8}
{:CPM=>9}
{:CPM=>9}
{:CPM=>9}

After about a minute or so, it starts printing the CPM.

Conclusion

I love playing with embedded systems as well as dealing with UART. Next time you need to do any serial port communications in Ruby, UART to consider using my gem. Thanks for your time, and I hope you have a great weekend!

Fast Tokenizers with StringScanner

Lately I’ve been messing around with writing a GraphQL parser called TinyGQL. I wanted to see how fast I could make a GraphQL parser without writing any C extensions. I think I did pretty well, but I’ve learned some tricks for speeding up parsers and I want to share them.

Today we’re going to specifically look at the lexing part of parsing. Lexing is just breaking down an input string in to a series of tokens. It’s the parser’s job to interpret those tokens. My favorite tool for tokenizing documents in Ruby is StringScanner. Today we’re going to look at a few tricks for speeding up StringScanner based lexers. We’ll start with a very simple GraphQL lexer and apply a few tricks to speed it up.

A very basic lexer

Here is the lexer we’re going to work with today:

require "strscan"

class Lexer
  IDENTIFIER =    /[_A-Za-z][_0-9A-Za-z]*\b/
  WHITESPACE =  %r{ [, \c\r\n\t]+ }x
  COMMENTS   = %r{ \#.*$ }x
  INT =           /[-]?(?:[0]|[1-9][0-9]*)/
  FLOAT_DECIMAL = /[.][0-9]+/
  FLOAT_EXP =     /[eE][+-]?[0-9]+/
  FLOAT =  /#{INT}#{FLOAT_DECIMAL}#{FLOAT_EXP}|#{FLOAT_DECIMAL}|#{FLOAT_EXP}/

  KEYWORDS = [ "on", "fragment", "true", "false", "null", "query", "mutation",
    "subscription", "schema", "scalar", "type", "extend", "implements",
    "interface", "union", "enum", "input", "directive", "repeatable"
  ].freeze

  KW_RE = /#{Regexp.union(KEYWORDS.sort)}\b/
  KW_TABLE = Hash[KEYWORDS.map { |kw| [kw, kw.upcase.to_sym] }]

  module Literals
    LCURLY =        '{'
    RCURLY =        '}'
    LPAREN =        '('
    RPAREN =        ')'
    LBRACKET =      '['
    RBRACKET =      ']'
    COLON =         ':'
    VAR_SIGN =      '$'
    DIR_SIGN =      '@'
    EQUALS =        '='
    BANG =          '!'
    PIPE =          '|'
    AMP =           '&'
  end

  ELLIPSIS =      '...'

  include Literals

  PUNCTUATION = Regexp.union(Literals.constants.map { |name|
    Literals.const_get(name)
  })

  PUNCTUATION_TABLE = Literals.constants.each_with_object({}) { |x,o|
    o[Literals.const_get(x)] = x
  }

  def initialize doc
    @doc = doc
    @scan = StringScanner.new doc
  end

  def next_token
    return if @scan.eos?

    case
    when s = @scan.scan(WHITESPACE)  then [:WHITESPACE, s]
    when s = @scan.scan(COMMENTS)    then [:COMMENT, s]
    when s = @scan.scan(ELLIPSIS)    then [:ELLIPSIS, s]
    when s = @scan.scan(PUNCTUATION) then [PUNCTUATION_TABLE[s], s]
    when s = @scan.scan(KW_RE)       then [KW_TABLE[s], s]
    when s = @scan.scan(IDENTIFIER)  then [:IDENTIFIER, s]
    when s = @scan.scan(FLOAT)       then [:FLOAT, s]
    when s = @scan.scan(INT)         then [:INT, s]
    else
      [:UNKNOWN_CHAR, @scan.getch]
    end
  end
end

It only tokenizes a subset of GraphQL. Namely, it omits string literals. Matching string literals is kind of gross, and I wanted to keep this example small, so I removed them. I have a large document that I’ll use to measure some performance aspects of this lexer, and if you want to try it out, you can find the document here.

To use the lexer, just pass the document you want to tokenize, then repeatedly call next_token on the lexer until it returns nil:

lexer = Lexer.new input
while tok = lexer.next_token
  # do something
end

GraphQL documents look something like this:

mutation {
  a: likeStory(storyID: 12345) {
    b: story {
      c: likeCount
    }
  }
}

And with this lexer implementation, the tokens come out as tuples and they look something like this:

[:IDENTIFIER, "likeStory"]
[:LPAREN, "("]
[:IDENTIFIER, "storyID"]

Our benchmarking code is going to be very simple, we’re just going to use the lexer to pull all of the tokens out of the test document:

require "benchmark/ips"

def allocations
  x = GC.stat(:total_allocated_objects)
  yield
  GC.stat(:total_allocated_objects) - x
end

def go doc
  lexer = Lexer.new doc
  while tok = lexer.next_token
    # do nothing
  end
end

doc = ARGF.read

Benchmark.ips { |x| x.report { go doc } }
p ALLOCATIONS: allocations { go doc }

With this implementation of the lexer, here are the benchmark results on my machine:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        21.000  i/100ms
Calculating -------------------------------------
                        211.043  (± 0.9%) i/s -      1.071k in   5.075133s
{:ALLOCATIONS=>20745}

We can do a little over 200 iterations per second, and tokenizing the document allocates a bit over 20k objects.

StringScanner context

Before we get to optimizing this lexer, lets get a little background on StringScanner. StringScanner is one of my favorite utilities that ships with Ruby. You can think of this object as basically a “cursor” that points inside a string. When you successfully scan a token from the beginning of the cursor, StringScanner will move the cursor forward. If scanning fails, the cursor doesn’t move.

The inspect method on the StringScanner object makes this behavior very clear, so lets just look at some code in IRB:

>> scanner = StringScanner.new("the quick brown fox jumped over the lazy dog")
=> #<StringScanner 0/44 @ "the q...">
>> scanner.scan(/the /)
=> "the "
>> scanner
=> #<StringScanner 4/44 "the " @ "quick...">
>> scanner.scan(/qui/)
=> "qui"
>> scanner
=> #<StringScanner 7/44 "...e qui" @ "ck br...">
>> scanner.scan(/hello/)
=> nil
>> scanner
=> #<StringScanner 7/44 "...e qui" @ "ck br...">

The @ symbol in the inspect output shows where the cursor currently points, and the ratio at the beginning gives you kind of a “progress” counter. As I scanned through the string, the cursor moved forward. Near the end, you can see where I tried to scan “hello”, it returned nil, and the cursor stayed in place.

Combining StringScanner with the linear case / when in Ruby is a great combination for really easily writing tokenizers.

StringScanner also allows us to skip particular values, as well as ask for the current cursor position:

>> scanner
=> #<StringScanner 7/44 "...e qui" @ "ck br...">
>> scanner.skip(/ck /)
=> 3
>> scanner
=> #<StringScanner 10/44 "...uick " @ "brown...">
>> scanner.skip(/hello/)
=> nil
>> scanner
=> #<StringScanner 10/44 "...uick " @ "brown...">
>> scanner.pos
=> 10

Calling skip will try to skip a pattern. If skipping works, it returns the length of the string it matched, and if it fails, it returns nil. You can also get and set the position of the cursor using the pos and pos= methods.

Now lets try to speed up this lexer!

Speeding up this lexer

The name of the game for speeding up lexers (or really any code) is to reduce the number of method calls as well as the number of allocations. So we’re going to try applying some tricks to reduce both.

Whenever I’m trying to improve the performance of any code, I find it is important to think about the context of how that code is used. For example, our lexer currently yields tokens for comments and whitespace. However, the GraphQL grammar ignores comments and whitespace. Since the parser doesn’t actually need to know about whitespace or comments in order to understand the document, it is fine for the lexer to just skip them.

Our first optimization is to combine the whitespace and comment check, and then quit returning tokens:

diff --git a/test.rb b/test.rb
index 2c1e874..9130a54 100644
--- a/test.rb
+++ b/test.rb
@@ -2,8 +2,12 @@ require "strscan"
 
 class Lexer
   IDENTIFIER =    /[_A-Za-z][_0-9A-Za-z]*\b/
-  WHITESPACE =  %r{ [, \c\r\n\t]+ }x
-  COMMENTS   = %r{ \#.*$ }x
+  IGNORE   =       %r{
+    (?:
+      [, \c\r\n\t]+ |
+      \#.*$
+    )*
+  }x
   INT =           /[-]?(?:[0]|[1-9][0-9]*)/
   FLOAT_DECIMAL = /[.][0-9]+/
   FLOAT_EXP =     /[eE][+-]?[0-9]+/
@@ -51,11 +55,11 @@ class Lexer
   end
 
   def next_token
+    @scan.skip(IGNORE)
+
     return if @scan.eos?
 
     case
-    when s = @scan.scan(WHITESPACE)  then [:WHITESPACE, s]
-    when s = @scan.scan(COMMENTS)    then [:COMMENT, s]
     when s = @scan.scan(ELLIPSIS)    then [:ELLIPSIS, s]
     when s = @scan.scan(PUNCTUATION) then [PUNCTUATION_TABLE[s], s]
     when s = @scan.scan(KW_RE)       then [KW_TABLE[s], s]

By combining the whitespace and comment regex, we could eliminate one method call. We also changed the scan to a skip which eliminated string object allocations. Lets check the benchmark after this change:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        32.000  i/100ms
Calculating -------------------------------------
                        322.100  (± 0.3%) i/s -      1.632k in   5.066846s
{:ALLOCATIONS=>10527}

This is great! Our iterations per second (IPS) went from 211 to 322, and our allocations went from about 20k down to around 10k. So we cut our allocations in half and increased speed by about 50%.

Thinking Bigger

This lexer returns a tuple for each token. The tuple looks like this: [:LPAREN, "("]. But when the parser looks at the token, how often does it actually need the string value of the token?

When the parser looks at the first element, it is able to understand that the lexer found a left parenthesis just by looking at the symbol :LPAREN. The parser gets no benefit from the "(" string that is in the tuple.

Just by looking at the token name, the parser can tell what string the lexer found. This is true for all punctuation, as well as keywords.

Identifiers and numbers are a little bit different though. The parser doesn’t particularly care about the actual string value of any identifier or number. It only cares that an identifier or number was found. However, if we think one level up, it’s quite likely that consumers of the parser will care what field name or number was in the GraphQL document.

Since the parser doesn’t care about the actual token value, but the user does care about the token value, lets split the next_token method in two:

  1. One method to get the token (:INT, :LCURLY, etc)
  2. One method to get the token value

When the parser encounters a token where the token value actually matters, the parser can ask the lexer for the token value. For example, something like this:

lexer = Lexer.new doc
while tok = lexer.next_token
  if tok == :IDENTIFIER
    p lexer.token_value
  end
end

__END__
mutation {
  a: likeStory(storyID: 12345) {
    b: story {
      c: likeCount
    }
  }
}

This split buys us two really big wins. The first is that next_token doesn’t need to return an array anymore. That’s already one object per token saved. The second win is that we only ever allocate a string when we really need it.

Here is the new next_token method, and the token_value helper method:

  def next_token
    @scan.skip(IGNORE)

    return if @scan.eos?

    case
    when @scan.skip(ELLIPSIS)        then :ELLIPSIS
    when s = @scan.scan(PUNCTUATION) then PUNCTUATION_TABLE[s]
    when s = @scan.scan(KW_RE)       then KW_TABLE[s]
    when @scan.skip(IDENTIFIER)      then :IDENTIFIER
    when @scan.skip(FLOAT)           then :FLOAT
    when @scan.skip(INT)             then :INT
    else
      @scan.getch
      :UNKNOWN_CHAR
    end
  end

  def token_value
    @doc.byteslice(@scan.pos - @scan.matched_size, @scan.matched_size)
  end

We’ve changed the method to only return a symbol that identifies the token. We also changed most scan calls to skip calls. scan will return the string it matched (an allocation), but skip simply returns the length of the string it matched (not an allocation).

As the parser requests tokens from the lexer, if it encounters a token where it actually cares about the string value, it just calls token_value. This makes our benchmark a little bit awkward now because we’ve shifted the blame of “identifier” allocations from the lexer to the parser. If the parser wants an allocation, it’ll have to ask the lexer for it. But lets keep pushing forward with the same benchmark (just remembering that once we integrate the lexer with the parser, we’ll have allocations for identifiers).

With this change, our benchmark results look like this:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        35.000  i/100ms
Calculating -------------------------------------
                        360.209  (± 0.6%) i/s -      1.820k in   5.052764s
{:ALLOCATIONS=>1915}

We went from 322 IPS to 360 IPS, and from 10k allocations down to about 2k allocations.

Punctuation Lookup Table

Unfortunately we’ve still got two lines in the tokenizer that are doing allocations:

    when s = @scan.scan(PUNCTUATION) then PUNCTUATION_TABLE[s]
    when s = @scan.scan(KW_RE)       then KW_TABLE[s]

Let’s tackle the punctuation line first. We still extract a string from the scanner in order to do a hash lookup to find the symbol name for the token. We need the string ")" so that we can map it to the symbol :RPAREN. One interesting feature about these punctuation characters is that they are all only one byte and thus limited to values between 0 - 255. Instead of extracting a substring, we can get the byte at the current scanner position, then use the byte as an array index. If there is a value at that index in the array, then we know we’ve found a token.

First we’ll build the lookup table like this:

  PUNCTUATION_TABLE = Literals.constants.each_with_object([]) { |n, o|
    o[Literals.const_get(n).ord] = n
  }

This will create an array. The array will have a symbol at the index corresponding to the byte value of our punctuation. Any other index will return nil. And since we’re only dealing with one byte, we know the maximum value can only ever be 255. The code below gives us a sample of how this lookup table works:

  '()ab'.bytes.each do |byte|
    p PUNCTUATION_TABLE[byte]
  end

The output is like this:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
:LPAREN
:RPAREN
nil
nil

We can use the pos method on the StringScanner object to get our current cursor position (no allocation), then use that information to extract a byte from the string (also no allocation). If the byte has a value in the lookup table, we know we’ve found a token and we can push the StringScanner forward one byte.

After incorporating the punctuation lookup table, our next_token method looks like this:

  def next_token
    @scan.skip(IGNORE)

    return if @scan.eos?

    case
    when @scan.skip(ELLIPSIS)        then :ELLIPSIS
    when tok = PUNCTUATION_TABLE[@doc.getbyte(@scan.pos)] then
      # If the byte at the current position is inside our lookup table, push
      # the scanner position forward 1 and return the token
      @scan.pos += 1
      tok
    when s = @scan.scan(KW_RE)       then KW_TABLE[s]
    when @scan.skip(IDENTIFIER)      then :IDENTIFIER
    when @scan.skip(FLOAT)           then :FLOAT
    when @scan.skip(INT)             then :INT
    else
      @scan.getch
      :UNKNOWN_CHAR
    end
  end

Rerunning our benchmarks gives us these results:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        46.000  i/100ms
Calculating -------------------------------------
                        459.031  (± 1.1%) i/s -      2.300k in   5.011232s
{:ALLOCATIONS=>346}

We’ve gone from 360 IPS up to 459 IPS, and from about 2k allocations down to only 350 allocations.

Perfect Hashes and GraphQL Keywords

We have one more line in our lexer that is allocating objects:

    when s = @scan.scan(KW_RE)       then KW_TABLE[s]

This line is allocating objects because it needs to map the keyword it found in the source to a symbol:

>> Lexer::KW_TABLE["query"]
=> :QUERY
>> Lexer::KW_TABLE["schema"]
=> :SCHEMA

It would be great if we had a hash table that didn’t require us to extract a string from the source document. And that’s exactly what we’re going to build.

When this particular regular expression matches, we know that the lexer has found 1 of the 19 keywords listed in the KW_TABLE, we just don’t know which one. What we’d like to do is figure out which keyword matched, and do it without allocating any objects.

Here is the list of 19 GraphQL keywords we could possibly match:

["on",
 "true",
 "null",
 "enum",
 "type",
 "input",
 "false",
 "query",
 "union",
 "extend",
 "scalar",
 "schema",
 "mutation",
 "fragment",
 "interface",
 "directive",
 "implements",
 "repeatable",
 "subscription"]

StringScanner#skip will return the length of the match, and we know that if the length is 2 we unambiguously matched on, and if the length is 12 we unambiguously matched subscription. So if the matched length is 2 or 12, we can just return :ON or :SUBSCRIPTION. That leaves 17 other keywords we need to disambiguate.

Of the 17 remaining keywords, the 2nd and 3rd bytes uniquely identify that keyword:

>> (Lexer::KW_TABLE.keys - ["on", "subscription"]).length
=> 17
>> (Lexer::KW_TABLE.keys - ["on", "subscription"]).map { |w| w[1, 2] }
=> ["ra", "ru", "al", "ul", "ue", "ut", "ch", "ca", "yp", "xt", "mp", "nt", "ni", "nu", "np", "ir", "ep"]
>> (Lexer::KW_TABLE.keys - ["on", "subscription"]).map { |w| w[1, 2] }.uniq.length
=> 17

We can use these two bytes as a key to a hash table and design a “perfect hash” to look up the right token. A perfect hash is a hash table where the possible keys for the hash are known in advance, and the hashing function will never make a collision. In other words, no two hash keys will result in the same bucket index.

We know that the word we found is one of a limited set, so this seems like a good application for a perfect hash.

Building a Perfect Hash

A perfect hash function uses a pre-computed “convenient” constant that let us uniquely identify each key, but also limit the hash table to a small size. Basically we have a function like this:

def _hash key
  (key * SOME_CONSTANT) >> 27 & 0x1f
end

But we must figure out the right constant to use such that each entry in our perfect hash gets a unique bucket index. We’re going to use the upper 5 bits of a “32 bit integer” (it’s not actually 32 bits, we’re just going to treat it that way) to find our hash key. The reason we’re going to use 5 bits is because we have 17 keys, and 17 can’t fit in 4 bits. To find the value of SOME_CONSTANT, we’re just going to use a brute force method.

First lets convert the two bytes from each GraphQL keyword to a 16 bit integer:

>> keys = (Lexer::KW_TABLE.keys - ["on", "subscription"]).map { |w| w[1, 2].unpack1("s") }
=> [24946, 30066, 27745, 27765, 25973, 29813, 26723, 24931, 28793, 29816, 28781, 29806, 26990, 30062, 28782, 29289, 28773]

Next we’re going to use a brute force method to find a constant value such that we can convert these 16 bit numbers in to unique 5 bit numbers:

>> c = 13
=> 13
?> loop do
?>   z = keys.map { |k| ((k * c) >> 27) & 0x1f }
?>   break if z.uniq.length == z.length
?>   c += 1
>> end
=> nil
>> c
=> 18592990

We start our search at 13. Our loop tries applying the hashing function to all keys. If the hashing function returns unique values for all keys, then we found the right value for c, otherwise we increment c by one and try the next number.

After this loop finishes (it takes a while), we check c and that’s the value for our perfect hash!

Now we can write our hashing function like this:

def _hash key
  (key * 18592990) >> 27 & 0x1f
end

This function will return a unique value based on the 2nd and 3rd bytes of each GraphQL keyword. Lets prove that to ourselves in IRB:

?> def _hash key
?>   (key * 18592990) >> 27 & 0x1f
>> end
=> :_hash
>> keys = (Lexer::KW_TABLE.keys - ["on", "subscription"]).map { |w| w[1, 2].unpack1("s") }
=> [24946, 30066, 27745, 27765, 25973, 29813, 26723, 24931, 28793, 29816, 28781, 29806, 26990, 30062, 28782, 29289, 28773]
>> keys.map { |key| _hash(key) }
=> [31, 5, 3, 6, 14, 1, 21, 29, 20, 2, 18, 0, 26, 4, 19, 25, 17]

We’ll use these integers as an index in to an array that stores the symbol name associated with that particular keyword:

>> # Get a list of the array indices for each keyword
=> nil
>> array_indexes = keys.map { |key| _hash(key) }
=> [31, 5, 3, 6, 14, 1, 21, 29, 20, 2, 18, 0, 26, 4, 19, 25, 17]
>> # Insert a symbol in to an array at each index
=> nil
>> table = kws.zip(array_indexes).each_with_object([]) { |(kw, key),o| o[key] = kw.upcase.to_sym }
=> 
[:INTERFACE,
...

Now we have a table we can use to look up the symbol for a particular keyword given the keyword’s 2nd and 3rd bytes.

Take a breather

I think this is getting a little complicated so I want to step back and take a breather. What we’ve done so far is write a function that, given the 2nd and 3rd bytes of a string returns an index to an array.

Let’s take the keyword interface as an example. The 2nd and 3rd bytes are nt:

>> "interface"[1, 2]
=> "nt"

We can use unpack1 to convert nt in to a 16 bit integer:

>> "interface"[1, 2].unpack1("s")
=> 29806

Now we pass that integer to our hashing function (I called it _hash in IRB):

>> _hash("interface"[1, 2].unpack1("s"))
=> 0

And now we have the array index where to find the :INTERFACE symbol:

>> table[_hash("interface"[1, 2].unpack1("s"))]
=> :INTERFACE

This will work for any of the strings we used to build the perfect hash function. Lets try a few:

>> table[_hash("union"[1, 2].unpack1("s"))]
=> :UNION
>> table[_hash("scalar"[1, 2].unpack1("s"))]
=> :SCALAR
>> table[_hash("repeatable"[1, 2].unpack1("s"))]
=> :REPEATABLE

Integrating the Perfect Hash in to the Lexer

We’ve built our hash table and hash function, so the next step is to add them to the lexer:

  KW_TABLE = [:INTERFACE, :MUTATION, :EXTEND, :FALSE, :ENUM, :TRUE, :NULL,
              nil, nil, nil, nil, nil, nil, nil, :QUERY, nil, nil, :REPEATABLE,
              :IMPLEMENTS, :INPUT, :TYPE, :SCHEMA, nil, nil, nil, :DIRECTIVE,
              :UNION, nil, nil, :SCALAR, nil, :FRAGMENT]

  def _hash key
    (key * 18592990) >> 27 & 0x1f
  end

Remember we derived the magic constant 18592990 earlier via brute force.

In the next_token method, we need to extract the 2nd and 3rd bytes of the keyword, combine them to a 16 bit int, use the _hash method to convert the 16 bit int to a 5 bit array index, then look up the symbol (I’ve omitted the rest of the next_token method):

    when len = @scan.skip(KW_RE) then
      # Return early if uniquely identifiable via length
      return :ON if len == 2
      return :SUBSCRIPTION if len == 12

      # Get the position of the start of the keyword in the main document
      start = @scan.pos - len

      # Get the 2nd and 3rd byte of the keyword and combine to a 16 bit int
      key = (@doc.getbyte(start + 2) << 8) | @doc.getbyte(start + 1)

      # Get the array index
      index = _hash(key)

      # Return the symbol
      KW_TABLE[index]

We know the length of the token because it’s the return value of StringScanner#skip. If the token is uniquely identifiable based on its length, then we’ll return early. Otherwise, ask StringScanner for the cursor position and then use the length to calculate the index of the beginning of the token (remember StringScanner pushed the cursor forward when skip matched).

Once we have the beginning of the token, we’ll use getbyte (which doesn’t allocate) to get the 2nd and 3rd bytes of the keyword. Then we’ll combine the two bytes to a 16 bit int. Finally we pass the int to the hash function and use the return value of the hash function to look up the token symbol in the array.

Let’s check our benchmarks now!

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        46.000  i/100ms
Calculating -------------------------------------
                        468.978  (± 0.4%) i/s -      2.346k in   5.002449s
{:ALLOCATIONS=>3}

We went from 459 IPS up to 468 IPS, and from 346 allocations down to 3 allocations. 1 allocation for the Lexer object, 1 allocation for the StringScanner object, and 1 allocation for ????

Actually, if we run the allocation benchmark twice we’ll get different results:

require "benchmark/ips"

def allocations
  x = GC.stat(:total_allocated_objects)
  yield
  GC.stat(:total_allocated_objects) - x
end

def go doc
  lexer = Lexer.new doc
  while tok = lexer.next_token
  end
end

doc = ARGF.read

Benchmark.ips { |x| x.report { go doc } }
p ALLOCATIONS: allocations { go doc }
p ALLOCATIONS: allocations { go doc }

Output is this:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        46.000  i/100ms
Calculating -------------------------------------
                        465.071  (± 0.6%) i/s -      2.346k in   5.044626s
{:ALLOCATIONS=>3}
{:ALLOCATIONS=>2}

Ruby uses GC allocated objects to store some inline caches. Since it was the first time we called the allocations method, a new inline cache was allocated, and that dinged us. We’re actually able to tokenize this entire document with only 2 allocations: the lexer and the string scanner.

One more hack

Lets do one more trick. We want to reduce the number of method calls the scanner makes as much as we can. The case / when statement in next_token checks each when statement one at a time. One trick I like to do is rearrange the statements so that the most popular tokens come first.

If we tokenize our benchmark program and tally up all of the tokens that come out, it looks like this:

>> lexer = Lexer.new File.read "benchmark/fixtures/negotiate.gql"
=> 
#<Lexer:0x0000000105c33c90
...
>> list = []
=> []
?> while tok = lexer.next_token
?>   list << tok
>> end
=> nil
>> list.tally
=> {:QUERY=>1, :IDENTIFIER=>2976, :LPAREN=>15, :VAR_SIGN=>6, :COLON=>56, :BANG=>1,
    :RPAREN=>15, :LCURLY=>738, :RCURLY=>738, :ELLIPSIS=>350, :ON=>319, :INT=>24,
    :TYPE=>4, :INPUT=>1, :FRAGMENT=>18}

From this data, it looks like ELLIPSIS tokens aren’t as popular as punctuation or IDENTIFIER tokens. Yet we’re always checking for ELLIPSIS tokens first. Lets move the ELLIPSIS check below the identifier check. This makes looking for ELLIPSIS more expensive, but it makes finding punctuation and identifiers cheaper. Since punctuation and identifiers occur more frequently in our document, we should get a speedup.

I applied this patch:

diff --git a/test.rb b/test.rb
index ac147c2..275b8ba 100644
--- a/test.rb
+++ b/test.rb
@@ -59,7 +59,6 @@ class Lexer
     return if @scan.eos?
 
     case
-    when @scan.skip(ELLIPSIS)        then :ELLIPSIS
     when tok = PUNCTUATION_TABLE[@doc.getbyte(@scan.pos)] then
       # If the byte at the current position is inside our lookup table, push
       # the scanner position forward 1 and return the token
@@ -78,6 +77,7 @@ class Lexer
 
       KW_TABLE[_hash(key)]
     when @scan.skip(IDENTIFIER)      then :IDENTIFIER
+    when @scan.skip(ELLIPSIS)        then :ELLIPSIS
     when @scan.skip(FLOAT)           then :FLOAT
     when @scan.skip(INT)             then :INT
     else

Now when we rerun the benchmark, we get this:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        48.000  i/100ms
Calculating -------------------------------------
                        486.798  (± 0.4%) i/s -      2.448k in   5.028884s
{:ALLOCATIONS=>3}
{:ALLOCATIONS=>2}

Great, we went from 465 IPS to 486 IPS!

Conclusion

The lexer we started with tokenized the 80kb GraphQL document at 211 IPS, and where we left off it was running at 486 IPS. More than a 2x speed improvement!

Our starting lexer allocated over 20k objects, and when we finished we got it down to just 2 objects. Of course the parser may ask the lexer to allocate something, but we know that we’re only allocating the bare minimum. In fact, if the parser only records positional offsets, it could very well never ask the lexer to allocate anything!

When I’m doing this stuff I try to use as many tricks as possible for increasing speed. But I think the biggest and best payoffs come from trying to think about the problem from a higher level and adjust the problem space itself. Converting next_token to return only a symbol rather than a tuple cut our object allocations by more than half. Questioning the code’s design itself is much harder, but I think reaps a greater benefit.

Anyway, these are hacks I like to use! If you want to play around with the lexer we’ve been building in this post, I’ve put the source code here.

I hope you enjoyed this, and have a good day!

Bitmap Matrix and Undirected Graphs in Ruby

I’ve been working my way through Engineering a Compiler. I really enjoy the book, but one part has you build an interference graph for doing register allocation via graph coloring. An interference graph is an undirected graph, and one way you can represent an undirected graph is with a bitmap matrix.

A bitmap matrix is just a matrix but the values in the matrix can only be 1 or 0. If every node in your graph maps to an index, you can use the bitmap matrix to represent edges in the graph.

I made a bitmap matrix implementation that I like, but I think the code is too trivial to put in a Gem. Here is the code I used:

class BitMatrix
  def initialize size
    @size = size
    size = (size + 7) & -8 # round up to the nearest multiple of 8
    @row_bytes = size / 8
    @buffer = "\0".b * (@row_bytes * size)
  end

  def initialize_copy other
    @buffer = @buffer.dup
  end

  def set x, y
    raise IndexError if y >= @size || x >= @size

    x, y = [y, x].sort

    row = x * @row_bytes
    column_byte = y / 8
    column_bit = 1 << (y % 8)

    @buffer.setbyte(row + column_byte, @buffer.getbyte(row + column_byte) | column_bit)
  end

  def set? x, y
    raise IndexError if y >= @size || x >= @size

    x, y = [y, x].sort

    row = x * @row_bytes
    column_byte = y / 8
    column_bit = 1 << (y % 8)

    (@buffer.getbyte(row + column_byte) & column_bit) != 0
  end

  def each_pair
    return enum_for(:each_pair) unless block_given?

    @buffer.bytes.each_with_index do |byte, i|
      row = i / @row_bytes
      column = i % @row_bytes
      8.times do |j|
        if (1 << j) & byte != 0
          yield [row, (column * 8) + j]
        end
      end
    end
  end

  def to_dot
    "graph g {\n" + each_pair.map { |x, y| "#{x} -- #{y};" }.join("\n") + "\n}"
  end
end

I like this implementation because all bits are packed in to a binary string. Copying the matrix is trivial because we just have to dup the string. Memory usage is much smaller than if every node in the graph were to store an actual reference to other nodes.

Anyway, this was fun to write and I hope someone finds it useful!

Vim, tmux, and Fish

I do most of my text editing with MacVim, but when I pair with people I like to use tmate. tmate is just an easy way to connect tmux sessions with a remote person. But this means that I go from coding in a GUI to coding in a terminal. Normally this wouldn’t be a problem, but I had made a Fish alias that would open the MacVim GUI every time I typed vim in the terminal. Of course when I’m pairing via tmate, the other people cannot see the GUI, so I would have to remember a different command to open Vim.

Today I did about 10min of research to fix this problem and came up with the following Fish command:

$ cat .config/fish/functions/vim.fish 
function vim --wraps='vim' --description 'open Vim'
  if set -q TMUX # if we're in a TMUX session, open in terminal
    command vim $argv
  else
    # Otherwise open macvim
    open -a MacVim $argv; 
  end
end

All it does is open terminal Vim if I’m in a TMUX session, otherwise it opens the MacVim GUI.

Instead of putting up with this frustration for such a long time, I should have taken the 10 min required to fix the situation. This was a good reminder for me, and hopefully I’ll be better about it in the future!

In Memory of a Giant

The Ruby community has lost a giant. As a programmer, I always feel as if I’m standing on the shoulders of giants. Chris Seaton was one of those giants.

I’ve been working at the same company as Chris for the past 2 years. However, I first met him through the open source world many years ago. He was working on a Ruby implementation called TruffleRuby, and got his PhD in Ruby. Can you believe that? A PhD in Ruby? I’d never heard of such a thing. My impression was that nobody in academia cared about Ruby, but here was Chris, the Ruby Doctor. I was impressed.

Patience

As a college dropout, I’ve always felt underqualified. Embarrassment about my lack of knowledge and credentials has driven me to study hard on my own time. But Chris never once made me feel out of place. Any time I had questions, without judgement, he would take the time to explain things to me.

I’ve always looked up to Chris. I was at a bar in London with a few coworkers. We started talking about age, and I found out that Chris was much younger than me. I said “You’re so smart and accomplished! How can I possibly catch up to you?” Chris said “Don’t worry, I’ll just tell you everything I know!”

Meeting Chris in London

Puns

My team is fully remote, so every Friday we have a team meeting over video to just hang out and talk about stuff. Eventually I’ll make a really great pun, most people will sigh, Kevin Menard will get angry, and Chris would just be straight faced. No reaction from Chris. Every. Single. Time.

One time, someone asked Chris “do you know that he’s making a joke? Or do you just not think it’s funny?” Chris responded “I know he’s making a pun, I just don’t react because I don’t want to encourage him.” I said “This just encourages me more because now I feel challenged!”

I wish I had tried harder because now I’ll never get that reaction.

Kindness

My last conversation with Chris was Thursday December 1st at RubyConf in Houston. We all went to dinner at a Ramen shop. I find British English to be extremely adorable, so any time I hear fun British phrases in the news I always ask my British coworkers about it. The latest one was “Wonky Veg” so I asked Chris if he’d been buying any at the store. He said no, but that one of his favorite things to do was find weird things at the local supermarket, take photos of it, then share with his coworkers. He flipped through photos on his phone, showing me pics of him shopping with his daughter. Some of the products he showed me were quite funny and we both had a good laugh.

Dinner with Chris

Memory

I feel honored to have had the opportunity to work with Chris.

I feel grateful for the time that we had together.

I feel angry that I can’t learn more from him.

I feel sad that he is gone from my life.

Chris was an important part of the community, his family, and his country. I will never forget the time I spent with Chris, a Giant.

Cross Platform Machine Code

I hate writing if statements.

I’ve been working on a couple different assemblers for Ruby. Fisk is a pure Ruby x86 assembler. You can use it to generate bytes that can be executed on x86 machines. AArch64 is a pure Ruby ARM64 assembler. You can use it to generate bytes that can be executed on ARM64 machines.

Both of these libraries just generate bytes that can be interpreted by their respective processors. Unfortunately you can’t just generate bytes and expect the CPU to execute them. You first need to put the bytes in executable memory before you can hand them off to the CPU for execution. Executable memory is basically the same thing regardless of CPU architecture, so I decided to make a library called JITBuffer that encapsulates executable memory manipulation.

To use the JITBuffer, you write platform specific bytes to the buffer, then give the buffer to the CPU for execution. Here is an example on the ARM64 platform:

require "aarch64"
require "jit_buffer"
require "fiddle"

asm = AArch64::Assembler.new

# Make some instructions.  These instructions simply
# return the value 0xF00DCAFE
asm.pretty do
  asm.movz x0, 0xCAFE
  asm.movk x0, 0xF00D, lsl(16)
  asm.ret
end

# Write the bytes to executable memory:
buf = JITBuffer.new(4096)
buf.writeable!
asm.write_to buf
buf.executable!

# Point the CPU at the executable memory
func = Fiddle::Function.new(buf.to_i, [], -Fiddle::TYPE_INT)
p func.call.to_s(16) # => "f00dcafe"

The example uses the AArch64 gem to assemble ARM64 specific bytes, the JITBuffer gem to allocate executable memory, and the Fiddle gem to point the CPU at the executable memory and run it.

Tests are important I guess, so I thought it would be a good idea to write tests for the JITBuffer gem. My goal for the test is to ensure that it’s actually possible to execute the bytes in the buffer itself. I’m not a huge fan of stubs or mocks and I try to avoid them if possible, so I wanted to write a test that would actually execute the bytes in the buffer. I also want the test to be “cross platform” (where “cross platform” means “works on x86_64 and ARM64”).

Writing a test like this would mean writing something like the following:

def test_can_execute
  buf = JITBuffer.new(4096)

  platform = figure_out_what_platform_we_are_on()
  if platform == "arm64"
    # write arm64 specific bytes
    buf.write(...)
  else
    # write x86_64 specific bytes
    buf.write(...)
  end

  # Use fiddle to execute
end

As I said at the start though, I hate writing if statements, and I’d rather avoid it if possible. In addition, how do you reliably figure out what platform you’re executing on? I really don’t want to figure that out. Not to mention, I just don’t think this code is cool.

My test requirements:

  • No if statements
  • Self contained (I don’t want to shell out or use other libraries)
  • Must have pizzazz

Since machine code is just bytes that the CPU interprets, it made me wonder “is there a set of bytes that execute both on an x86_64 CPU and an ARM64 CPU?” It turns out there are, and I want to walk through them here.

x86_64 Instructions

First lets look at the x86_64 instructions we’ll execute. Below is the assembly code (in Intel syntax):

.start:
  mov rax, 0x2b ; put 0x2b in the rax register
  ret           ; return from the function
  jmp start     ; jump to .start

This assembly code puts the value 0x2b in the rax register and returns from the current “C” function. I put “C” in quotes because we’re writing assembly code, but the assembly code is conforming to the C calling convention and we’ll treat it as if it’s a C function when we call it. The x86 C calling convention states that the value in the rax register is the “return value” of the C function. So we’ve created a function that returns 0x2b. At the end of the code there is a jmp instruction that jumps to the start of this sequence. However, since we return from the function before getting to the jump, the jump is never used (or is it?????)

Machine code is just bytes, and here are the bytes for the above x86 machine code:

0x48 0xC7 0xC0 0x2b 0x00 0x00 0x00  ; mov rax, 0x2b
0xC3                                ; ret
0xEB 0xF6                           ; jmp start

x86 uses a “variable width” encoding, meaning that the number of bytes each instruction uses can vary. In this example, the mov instruction used 7 bytes, and the ret instruction used 1 byte. This means that the jmp instruction is the 9th byte, or offset 8.

ARM64 Instructions

Below are some ARM64 instructions we can execute:

movz X11, 0x7b7 ; put 0x7b7 in the X11 register
movz X0, 0x2b   ; put 0x2b in the X0 register
ret             ; return from the function

This machine code puts the value 0x7b7 in to the register X11. Then it puts the value 0x2b in the X0 register. The third instruction returns from the function. Again we are abiding by the C calling convention, but this time on the ARM64 platform. On the ARM64 platform, the value stored in X0 is the return value. So the above machine code will return the value 0x2b to the caller just like the x86_64 machine code did.

Here are the bytes that represent the above ARM64 machine code:

0xEB 0xF6 0x80 0xD2  ; movz X11, 0x7b7
0x60 0x05 0x80 0xD2  ; movz X0, 0x2b
0xC0 0x03 0x5F 0xD6  ; ret

ARM64 uses fixed width instructions. All instructions on ARM64 are 32 bits wide.

Cross Platform Machine Code

Lets look at the byte blocks next to each other:

; x86_64 bytes
0x48 0xC7 0xC0 0x2b 0x00 0x00 0x00  ; mov rax, 0x2b
0xC3                                ; ret
0xEB 0xF6                           ; jmp start
; ARM64 bytes
0xEB 0xF6 0x80 0xD2  ; movz X11, 0x7b7
0x60 0x05 0x80 0xD2  ; movz X0, 0x2b
0xC0 0x03 0x5F 0xD6  ; ret

Looking at the bytes, you’ll notice that the first two bytes of the ARM64 code (0xEB 0xF6) are exactly the same as the last two bytes of the x86_64 code. The first movz instruction in the ARM64 code was specially crafted as to have the same bytes as the last jmp instruction in the x86 code.

If we combine these bytes, then tell the CPU to execute starting at a particular offset, then the interpretation of the bytes will change depending on the CPU, but the result is the same.

Here are the bytes combined:

          0x48 0xC7 0xC0 0x2b 0x00 0x00 0x00  ; mov rax, 0x2b
          0xC3                                ; ret
start ->  0xEB 0xF6 0x80 0xD2                 ; (jmp start, or movz X11, 0x7b7)
          0x60 0x05 0x80 0xD2                 ; movz X0, 0x2b
          0xC0 0x03 0x5F 0xD6                 ; ret

Regardless of platform, we’ll tell the CPU to start executing from offset 8 in the byte buffer. If it’s an x86 CPU, it will interpret the bytes as a jump, execute the top bytes, return at the ret, and ignore the rest of the bytes in the buffer (as they are never reached). If it’s an ARM64 machine, then it will interpret the bytes as “put 0x7b7 in the X11 register” and continue, never seeing the x86 specific bytes at the start of the buffer.

Both x86_64 and ARM64 platforms will return the same value 0x2b.

Now we can write a test without if statements like this:

def test_execute
  # Cross platform bytes
  bytes = [0x48, 0xc7, 0xc0, 0x2b, 0x00, 0x00, 0x00, # x86_64 mov rax, 0x2b
           0xc3,                                     # x86_64 ret
           0xeb, 0xf6,                               # x86 jmp
           0x80, 0xd2,                               # ARM movz X11, 0x7b7
           0x60, 0x05, 0x80, 0xd2,                   # ARM movz X0, #0x2b
           0xc0, 0x03, 0x5f, 0xd6]                   # ARM ret

  # Write them to the buffer
  jit = JITBuffer.new(4096)
  jit.writeable!
  jit.write bytes.pack("C*")
  jit.executable!

  # start at offset 8
  offset = 8
  func = Fiddle::Function.new(jit.to_i + offset, [], Fiddle::TYPE_INT)

  # Check the return value
  assert_equal 0x2b, func.call
end

So simple!

So cool!

Tons of pizzazz!

This test will execute machine code on both x86_64 as well as ARM64 and the machine code will return the same value. Not to mention, there’s no way RuboCop or Flay could possibly complain about this code. 🤣

I hope this inspires you to try writing cross platform machine code. This code only supports 2 platforms, but it does make me wonder how far we could stretch this and how many platforms we could support.

Anyway, hope you have a good day!

Homebrew, Rosetta, and Ruby

Hi everyone! I finally upgraded to an M1. It’s really really great, but the main problem is that some projects I work on like TenderJIT and YJIT only really work on x86_64 and these new M1 machines use ARM chips. Fortunately we can run x86_64 software via Rosetta, so we can still do development work on x86 specific software.

I’ve seen some solutions for setting up a dev environment that uses Rosetta, but I’d like to share what I did.

Installing Homebrew

I think most people recommend that you install two different versions of Homebrew, one that targets ARM, and the other that targets x86.

So far, I’ve found this to be the best solution, so I went with it. Just do the normal Homebrew installation for ARM like this:

$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then run the installer again under Rosetta like this:

$ arch -x86_64 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

After doing this, I ended up with a Homebrew installation in /opt/homebrew (the ARM version), and another installation in /usr/local (the x86 version).

Configuring your Terminal

I read many places on the web that recommend you duplicate terminal, then rename it and modify the renamed version to run under Rosetta.

I really didn’t like this solution. The problem for me is that I’d end up with two different terminal icons when doing cmd-tab, and I really can’t be bothered to read whether the terminal is the Rosetta one or not. It makes switching to the right terminal take way too long.

Instead I decided to make my shell figure out what architecture I’m using, then update $PATH depending on whether I’m using x86 or ARM. To accomplish this, I installed Fish (I use Fish shell) in both the x86 and ARM installations of Homebrew:

$ /opt/homebrew/bin/brew install fish
$ arch -x86_64 /usr/local/bin/brew install fish

If you’re not using Fish you don’t need to do this step. 😆

Next is the “interesting” part. In my config.fish, I added this:

if test (arch) = "i386"
  set HOMEBREW_PREFIX /usr/local
else
  set HOMEBREW_PREFIX /opt/homebrew
end

# Add the Homebrew prefix to $PATH. -m flag ensures it's at the beginning
# of the path since the path might already be in $PATH (just not at the start)
fish_add_path -m --path $HOMEBREW_PREFIX/bin

alias intel 'arch -x86_64 /usr/local/bin/fish'

The arch command will tell you which architecture you’re on. If I’m on i386, set the Homebrew prefix to /usr/local, otherwise set it to /opt/homebrew. Then use fish_add_path to prepend the Homebrew prefix to my $PATH environment variable. The -m switch moves the path to the front if $PATH already contained the path I’m trying to add.

Finally I added an alias intel that just starts a new shell but under Rosetta. So my default workflow is to open a terminal under ARM, and if I need to work on an intel project, just run intel.

How do I know my current architecture?

The arch command will tell you the current architecture, but I don’t want to run that every time I want to verify my current architecture. My solution was to add an emoji to my prompt. I don’t like adding more text to my prompt, but this seems important enough to warrant the addition.

My fish_prompt function looks like this:

function fish_prompt --description 'Write out the prompt'
    if not set -q __fish_prompt_normal
        set -g __fish_prompt_normal (set_color normal)
    end

    if not set -q __fish_prompt_cwd
        set -g __fish_prompt_cwd (set_color $fish_color_cwd)
    end

    if test (arch) = "i386"
      set emote 🧐
    else
      set emote 💪
    end

    echo -n -s "[$USER" @ (prompt_hostname) $emote ' ' "$__fish_prompt_cwd" (prompt_pwd) (__fish_vcs_prompt) "$__fish_prompt_normal" ']$ '
end

If I’m on ARM, the prompt will have an 💪 emoji, and if I’m on x86, the prompt will have a 🧐 emoji.

Just to give an example, here is a sample session in my terminal:

Last login: Fri Jan  7 12:37:59 on ttys001
Welcome to fish, the friendly interactive shell
[aaron@tc-lan-adapter💪 ~]$ which brew
/opt/homebrew/bin/brew
[aaron@tc-lan-adapter💪 ~]$ arch
arm64
[aaron@tc-lan-adapter💪 ~]$ intel
Welcome to fish, the friendly interactive shell
[aaron@tc-lan-adapter🧐 ~]$ which brew
/usr/local/bin/brew
[aaron@tc-lan-adapter🧐 ~]$ arch
i386
[aaron@tc-lan-adapter🧐 ~]$ exit
[aaron@tc-lan-adapter💪 ~]$ arch
arm64
[aaron@tc-lan-adapter💪 ~]$ which brew
/opt/homebrew/bin/brew
[aaron@tc-lan-adapter💪 ~]$

Now I can easily switch back and forth between x86 and ARM and my prompt tells me which I’m using.

Ruby with chruby

My Ruby dev environment is still a work in progress. I use chruby for changing Ruby versions. The problem is that all Ruby versions live in the same directory. chruby doesn’t know the difference between ARM versions and x86 versions. So for now I’m adding the architecture name to the directory:

[aaron@tc-lan-adapter💪 ~]$ chruby
   ruby-3.0.2
   ruby-arm64
   ruby-i386
   ruby-trunk
[aaron@tc-lan-adapter💪 ~]$

So I have to be careful which Ruby I switch to. I’ve filed a ticket on ruby-install, and I think we can make this nicer.

Specifically I’d like to add a subfolder in ~/.rubies for each architecture, then point chruby at the right subfolder depending on my current architecture. Essentially the same trick I used for $PATH and Homebrew, but for pointing chruby at the right place given my current architecture.

For now I just have to be careful though!

One huge caveat for Fish users is that the current version of chruby-fish is broken such that changes to $PATH end up getting lost (see this issue).

To work around that issue, I’m using @ioquatix’s fork of chruby-fish which can be found here. I just checked out that version of chruby-fish in my git project folder and added this to my config.fish:

# Point Fish at our local checkout of chruby-fish
set fish_function_path $fish_function_path $HOME/git/chruby-fish/share/fish/vendor_functions.d

Conclusion

Getting a dev environment up and running with Rosetta wasn’t too bad, but I think having the shell fix up $PATH is a better solution than having two copies of Terminal.app

The scripts I presented here were all Fish specific, but I don’t think it should be too hard to translate them to whatever shell you use.

Anyway, I hope you have a good weekend!

Publishing Gems With Your YubiKey

The recent compromise of ua-parser-js has put the security and trust of published packages at the top of my mind lately. In order to mitigate the risk of any Ruby Gems I manage from being hijacked, I enabled 2FA on my RubyGems.org account. This means that whenever I publish a Ruby Gem, I have to enter a one time passcode.

I have to admit, I find this to be a pain. Whenever I do a release of Rails, I have to enter a passcode over and over again because you can only push one Gem at a time.

Finally I’ve found a way to deal with this. I can maintain account security and also not be hassled with OTP codes again, thanks to my YubiKey.

This is just a short post about how to set up your YubiKey as an authenticator for RubyGems.org, and how to publish Gems without getting an OTP prompt.

Install ykman

ykman is a command line utility for interacting with your YubiKey. I installed it on my Mac with Homebrew:

$ brew install ykman

Set up 2FA as usual

If you already have 2FA enabled, you’ll have to temporarily disable it.

Just go through the normal 2FA setup process and when you’re presented with a QR code, you’ll use the text key to configure your YubiKey.

Just do:

$ ykman oath accounts add -t -o TOTP rubygems.org:youremail@example.org 123456

But use your email address and replace 123456 with the code you got from RubyGems.org. The -t flag will require you to touch the YubiKey when you want to generate an OTP.

Generate an OTP

You can now generate an OTP like this:

$ ykman oath accounts code -s rubygems.org

Publishing a Gem without OTP Prompts

You can supply an OTP code to the gem interface via an environment variable or a command line argument.

The environment variable version is like this:

$ GEM_HOST_OTP_CODE=$(ykman oath accounts code -s rubygems.org) gem push cool-gem-0.0.0.gem

The command line argument is like this:

$ gem push cool-gem-0.0.0.gem --otp $(ykman oath accounts code -s rubygems.org)

I’ve used the environment variable version, but not the command line argument though.

Final Thoughts

I also did this for NPM, but I haven’t tried pushing a package yet so I’ll see how that goes. I don’t really have any other thoughts except that everyone should enable 2FA so that we can prevent situations like ua-parser-js. I’m not particularly interested in installing someone’s Bitcoin miner on my machine, and I’m also not interested in being hassled because my package was hijacked.

Everyone, please stay safe and enable 2FA!

–Aaron

<3 <3