Fast Tokenizers with StringScanner

Sep 2, 2023 @ 1:00 pm

Lately I’ve been messing around with writing a GraphQL parser called TinyGQL. I wanted to see how fast I could make a GraphQL parser without writing any C extensions. I think I did pretty well, but I’ve learned some tricks for speeding up parsers and I want to share them.

Today we’re going to specifically look at the lexing part of parsing. Lexing is just breaking down an input string in to a series of tokens. It’s the parser’s job to interpret those tokens. My favorite tool for tokenizing documents in Ruby is StringScanner. Today we’re going to look at a few tricks for speeding up StringScanner based lexers. We’ll start with a very simple GraphQL lexer and apply a few tricks to speed it up.

A very basic lexer

Here is the lexer we’re going to work with today:

require "strscan"

class Lexer
  IDENTIFIER =    /[_A-Za-z][_0-9A-Za-z]*\b/
  WHITESPACE =  %r{ [, \c\r\n\t]+ }x
  COMMENTS   = %r{ \#.*$ }x
  INT =           /[-]?(?:[0]|[1-9][0-9]*)/
  FLOAT_DECIMAL = /[.][0-9]+/
  FLOAT_EXP =     /[eE][+-]?[0-9]+/
  FLOAT =  /#{INT}#{FLOAT_DECIMAL}#{FLOAT_EXP}|#{FLOAT_DECIMAL}|#{FLOAT_EXP}/

  KEYWORDS = [ "on", "fragment", "true", "false", "null", "query", "mutation",
    "subscription", "schema", "scalar", "type", "extend", "implements",
    "interface", "union", "enum", "input", "directive", "repeatable"
  ].freeze

  KW_RE = /#{Regexp.union(KEYWORDS.sort)}\b/
  KW_TABLE = Hash[KEYWORDS.map { |kw| [kw, kw.upcase.to_sym] }]

  module Literals
    LCURLY =        '{'
    RCURLY =        '}'
    LPAREN =        '('
    RPAREN =        ')'
    LBRACKET =      '['
    RBRACKET =      ']'
    COLON =         ':'
    VAR_SIGN =      '$'
    DIR_SIGN =      '@'
    EQUALS =        '='
    BANG =          '!'
    PIPE =          '|'
    AMP =           '&'
  end

  ELLIPSIS =      '...'

  include Literals

  PUNCTUATION = Regexp.union(Literals.constants.map { |name|
    Literals.const_get(name)
  })

  PUNCTUATION_TABLE = Literals.constants.each_with_object({}) { |x,o|
    o[Literals.const_get(x)] = x
  }

  def initialize doc
    @doc = doc
    @scan = StringScanner.new doc
  end

  def next_token
    return if @scan.eos?

    case
    when s = @scan.scan(WHITESPACE)  then [:WHITESPACE, s]
    when s = @scan.scan(COMMENTS)    then [:COMMENT, s]
    when s = @scan.scan(ELLIPSIS)    then [:ELLIPSIS, s]
    when s = @scan.scan(PUNCTUATION) then [PUNCTUATION_TABLE[s], s]
    when s = @scan.scan(KW_RE)       then [KW_TABLE[s], s]
    when s = @scan.scan(IDENTIFIER)  then [:IDENTIFIER, s]
    when s = @scan.scan(FLOAT)       then [:FLOAT, s]
    when s = @scan.scan(INT)         then [:INT, s]
    else
      [:UNKNOWN_CHAR, @scan.getch]
    end
  end
end

It only tokenizes a subset of GraphQL. Namely, it omits string literals. Matching string literals is kind of gross, and I wanted to keep this example small, so I removed them. I have a large document that I’ll use to measure some performance aspects of this lexer, and if you want to try it out, you can find the document here.

To use the lexer, just pass the document you want to tokenize, then repeatedly call next_token on the lexer until it returns nil:

lexer = Lexer.new input
while tok = lexer.next_token
  # do something
end

GraphQL documents look something like this:

mutation {
  a: likeStory(storyID: 12345) {
    b: story {
      c: likeCount
    }
  }
}

And with this lexer implementation, the tokens come out as tuples and they look something like this:

[:IDENTIFIER, "likeStory"]
[:LPAREN, "("]
[:IDENTIFIER, "storyID"]

Our benchmarking code is going to be very simple, we’re just going to use the lexer to pull all of the tokens out of the test document:

require "benchmark/ips"

def allocations
  x = GC.stat(:total_allocated_objects)
  yield
  GC.stat(:total_allocated_objects) - x
end

def go doc
  lexer = Lexer.new doc
  while tok = lexer.next_token
    # do nothing
  end
end

doc = ARGF.read

Benchmark.ips { |x| x.report { go doc } }
p ALLOCATIONS: allocations { go doc }

With this implementation of the lexer, here are the benchmark results on my machine:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        21.000  i/100ms
Calculating -------------------------------------
                        211.043  (± 0.9%) i/s -      1.071k in   5.075133s
{:ALLOCATIONS=>20745}

We can do a little over 200 iterations per second, and tokenizing the document allocates a bit over 20k objects.

StringScanner context

Before we get to optimizing this lexer, lets get a little background on StringScanner. StringScanner is one of my favorite utilities that ships with Ruby. You can think of this object as basically a “cursor” that points inside a string. When you successfully scan a token from the beginning of the cursor, StringScanner will move the cursor forward. If scanning fails, the cursor doesn’t move.

The inspect method on the StringScanner object makes this behavior very clear, so lets just look at some code in IRB:

>> scanner = StringScanner.new("the quick brown fox jumped over the lazy dog")
=> #<StringScanner 0/44 @ "the q...">
>> scanner.scan(/the /)
=> "the "
>> scanner
=> #<StringScanner 4/44 "the " @ "quick...">
>> scanner.scan(/qui/)
=> "qui"
>> scanner
=> #<StringScanner 7/44 "...e qui" @ "ck br...">
>> scanner.scan(/hello/)
=> nil
>> scanner
=> #<StringScanner 7/44 "...e qui" @ "ck br...">

The @ symbol in the inspect output shows where the cursor currently points, and the ratio at the beginning gives you kind of a “progress” counter. As I scanned through the string, the cursor moved forward. Near the end, you can see where I tried to scan “hello”, it returned nil, and the cursor stayed in place.

Combining StringScanner with the linear case / when in Ruby is a great combination for really easily writing tokenizers.

StringScanner also allows us to skip particular values, as well as ask for the current cursor position:

>> scanner
=> #<StringScanner 7/44 "...e qui" @ "ck br...">
>> scanner.skip(/ck /)
=> 3
>> scanner
=> #<StringScanner 10/44 "...uick " @ "brown...">
>> scanner.skip(/hello/)
=> nil
>> scanner
=> #<StringScanner 10/44 "...uick " @ "brown...">
>> scanner.pos
=> 10

Calling skip will try to skip a pattern. If skipping works, it returns the length of the string it matched, and if it fails, it returns nil. You can also get and set the position of the cursor using the pos and pos= methods.

Now lets try to speed up this lexer!

Speeding up this lexer

The name of the game for speeding up lexers (or really any code) is to reduce the number of method calls as well as the number of allocations. So we’re going to try applying some tricks to reduce both.

Whenever I’m trying to improve the performance of any code, I find it is important to think about the context of how that code is used. For example, our lexer currently yields tokens for comments and whitespace. However, the GraphQL grammar ignores comments and whitespace. Since the parser doesn’t actually need to know about whitespace or comments in order to understand the document, it is fine for the lexer to just skip them.

Our first optimization is to combine the whitespace and comment check, and then quit returning tokens:

diff --git a/test.rb b/test.rb
index 2c1e874..9130a54 100644
--- a/test.rb
+++ b/test.rb
@@ -2,8 +2,12 @@ require "strscan"
 
 class Lexer
   IDENTIFIER =    /[_A-Za-z][_0-9A-Za-z]*\b/
-  WHITESPACE =  %r{ [, \c\r\n\t]+ }x
-  COMMENTS   = %r{ \#.*$ }x
+  IGNORE   =       %r{
+    (?:
+      [, \c\r\n\t]+ |
+      \#.*$
+    )*
+  }x
   INT =           /[-]?(?:[0]|[1-9][0-9]*)/
   FLOAT_DECIMAL = /[.][0-9]+/
   FLOAT_EXP =     /[eE][+-]?[0-9]+/
@@ -51,11 +55,11 @@ class Lexer
   end
 
   def next_token
+    @scan.skip(IGNORE)
+
     return if @scan.eos?
 
     case
-    when s = @scan.scan(WHITESPACE)  then [:WHITESPACE, s]
-    when s = @scan.scan(COMMENTS)    then [:COMMENT, s]
     when s = @scan.scan(ELLIPSIS)    then [:ELLIPSIS, s]
     when s = @scan.scan(PUNCTUATION) then [PUNCTUATION_TABLE[s], s]
     when s = @scan.scan(KW_RE)       then [KW_TABLE[s], s]

By combining the whitespace and comment regex, we could eliminate one method call. We also changed the scan to a skip which eliminated string object allocations. Lets check the benchmark after this change:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        32.000  i/100ms
Calculating -------------------------------------
                        322.100  (± 0.3%) i/s -      1.632k in   5.066846s
{:ALLOCATIONS=>10527}

This is great! Our iterations per second (IPS) went from 211 to 322, and our allocations went from about 20k down to around 10k. So we cut our allocations in half and increased speed by about 50%.

Thinking Bigger

This lexer returns a tuple for each token. The tuple looks like this: [:LPAREN, "("]. But when the parser looks at the token, how often does it actually need the string value of the token?

When the parser looks at the first element, it is able to understand that the lexer found a left parenthesis just by looking at the symbol :LPAREN. The parser gets no benefit from the "(" string that is in the tuple.

Just by looking at the token name, the parser can tell what string the lexer found. This is true for all punctuation, as well as keywords.

Identifiers and numbers are a little bit different though. The parser doesn’t particularly care about the actual string value of any identifier or number. It only cares that an identifier or number was found. However, if we think one level up, it’s quite likely that consumers of the parser will care what field name or number was in the GraphQL document.

Since the parser doesn’t care about the actual token value, but the user does care about the token value, lets split the next_token method in two:

One method to get the token (:INT, :LCURLY, etc)
One method to get the token value

When the parser encounters a token where the token value actually matters, the parser can ask the lexer for the token value. For example, something like this:

lexer = Lexer.new doc
while tok = lexer.next_token
  if tok == :IDENTIFIER
    p lexer.token_value
  end
end

__END__
mutation {
  a: likeStory(storyID: 12345) {
    b: story {
      c: likeCount
    }
  }
}

This split buys us two really big wins. The first is that next_token doesn’t need to return an array anymore. That’s already one object per token saved. The second win is that we only ever allocate a string when we really need it.

Here is the new next_token method, and the token_value helper method:

  def next_token
    @scan.skip(IGNORE)

    return if @scan.eos?

    case
    when @scan.skip(ELLIPSIS)        then :ELLIPSIS
    when s = @scan.scan(PUNCTUATION) then PUNCTUATION_TABLE[s]
    when s = @scan.scan(KW_RE)       then KW_TABLE[s]
    when @scan.skip(IDENTIFIER)      then :IDENTIFIER
    when @scan.skip(FLOAT)           then :FLOAT
    when @scan.skip(INT)             then :INT
    else
      @scan.getch
      :UNKNOWN_CHAR
    end
  end

  def token_value
    @doc.byteslice(@scan.pos - @scan.matched_size, @scan.matched_size)
  end

We’ve changed the method to only return a symbol that identifies the token. We also changed most scan calls to skip calls. scan will return the string it matched (an allocation), but skip simply returns the length of the string it matched (not an allocation).

As the parser requests tokens from the lexer, if it encounters a token where it actually cares about the string value, it just calls token_value. This makes our benchmark a little bit awkward now because we’ve shifted the blame of “identifier” allocations from the lexer to the parser. If the parser wants an allocation, it’ll have to ask the lexer for it. But lets keep pushing forward with the same benchmark (just remembering that once we integrate the lexer with the parser, we’ll have allocations for identifiers).

With this change, our benchmark results look like this:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        35.000  i/100ms
Calculating -------------------------------------
                        360.209  (± 0.6%) i/s -      1.820k in   5.052764s
{:ALLOCATIONS=>1915}

We went from 322 IPS to 360 IPS, and from 10k allocations down to about 2k allocations.

Punctuation Lookup Table

Unfortunately we’ve still got two lines in the tokenizer that are doing allocations:

    when s = @scan.scan(PUNCTUATION) then PUNCTUATION_TABLE[s]
    when s = @scan.scan(KW_RE)       then KW_TABLE[s]

Let’s tackle the punctuation line first. We still extract a string from the scanner in order to do a hash lookup to find the symbol name for the token. We need the string ")" so that we can map it to the symbol :RPAREN. One interesting feature about these punctuation characters is that they are all only one byte and thus limited to values between 0 - 255. Instead of extracting a substring, we can get the byte at the current scanner position, then use the byte as an array index. If there is a value at that index in the array, then we know we’ve found a token.

First we’ll build the lookup table like this:

  PUNCTUATION_TABLE = Literals.constants.each_with_object([]) { |n, o|
    o[Literals.const_get(n).ord] = n
  }

This will create an array. The array will have a symbol at the index corresponding to the byte value of our punctuation. Any other index will return nil. And since we’re only dealing with one byte, we know the maximum value can only ever be 255. The code below gives us a sample of how this lookup table works:

  '()ab'.bytes.each do |byte|
    p PUNCTUATION_TABLE[byte]
  end

The output is like this:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
:LPAREN
:RPAREN
nil
nil

We can use the pos method on the StringScanner object to get our current cursor position (no allocation), then use that information to extract a byte from the string (also no allocation). If the byte has a value in the lookup table, we know we’ve found a token and we can push the StringScanner forward one byte.

After incorporating the punctuation lookup table, our next_token method looks like this:

  def next_token
    @scan.skip(IGNORE)

    return if @scan.eos?

    case
    when @scan.skip(ELLIPSIS)        then :ELLIPSIS
    when tok = PUNCTUATION_TABLE[@doc.getbyte(@scan.pos)] then
      # If the byte at the current position is inside our lookup table, push
      # the scanner position forward 1 and return the token
      @scan.pos += 1
      tok
    when s = @scan.scan(KW_RE)       then KW_TABLE[s]
    when @scan.skip(IDENTIFIER)      then :IDENTIFIER
    when @scan.skip(FLOAT)           then :FLOAT
    when @scan.skip(INT)             then :INT
    else
      @scan.getch
      :UNKNOWN_CHAR
    end
  end

Rerunning our benchmarks gives us these results:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        46.000  i/100ms
Calculating -------------------------------------
                        459.031  (± 1.1%) i/s -      2.300k in   5.011232s
{:ALLOCATIONS=>346}

We’ve gone from 360 IPS up to 459 IPS, and from about 2k allocations down to only 350 allocations.

Perfect Hashes and GraphQL Keywords

We have one more line in our lexer that is allocating objects:

    when s = @scan.scan(KW_RE)       then KW_TABLE[s]

This line is allocating objects because it needs to map the keyword it found in the source to a symbol:

>> Lexer::KW_TABLE["query"]
=> :QUERY
>> Lexer::KW_TABLE["schema"]
=> :SCHEMA

It would be great if we had a hash table that didn’t require us to extract a string from the source document. And that’s exactly what we’re going to build.

When this particular regular expression matches, we know that the lexer has found 1 of the 19 keywords listed in the KW_TABLE, we just don’t know which one. What we’d like to do is figure out which keyword matched, and do it without allocating any objects.

Here is the list of 19 GraphQL keywords we could possibly match:

["on",
 "true",
 "null",
 "enum",
 "type",
 "input",
 "false",
 "query",
 "union",
 "extend",
 "scalar",
 "schema",
 "mutation",
 "fragment",
 "interface",
 "directive",
 "implements",
 "repeatable",
 "subscription"]

StringScanner#skip will return the length of the match, and we know that if the length is 2 we unambiguously matched on, and if the length is 12 we unambiguously matched subscription. So if the matched length is 2 or 12, we can just return :ON or :SUBSCRIPTION. That leaves 17 other keywords we need to disambiguate.

Of the 17 remaining keywords, the 2nd and 3rd bytes uniquely identify that keyword:

>> (Lexer::KW_TABLE.keys - ["on", "subscription"]).length
=> 17
>> (Lexer::KW_TABLE.keys - ["on", "subscription"]).map { |w| w[1, 2] }
=> ["ra", "ru", "al", "ul", "ue", "ut", "ch", "ca", "yp", "xt", "mp", "nt", "ni", "nu", "np", "ir", "ep"]
>> (Lexer::KW_TABLE.keys - ["on", "subscription"]).map { |w| w[1, 2] }.uniq.length
=> 17

We can use these two bytes as a key to a hash table and design a “perfect hash” to look up the right token. A perfect hash is a hash table where the possible keys for the hash are known in advance, and the hashing function will never make a collision. In other words, no two hash keys will result in the same bucket index.

We know that the word we found is one of a limited set, so this seems like a good application for a perfect hash.

Building a Perfect Hash

A perfect hash function uses a pre-computed “convenient” constant that let us uniquely identify each key, but also limit the hash table to a small size. Basically we have a function like this:

def _hash key
  (key * SOME_CONSTANT) >> 27 & 0x1f
end

But we must figure out the right constant to use such that each entry in our perfect hash gets a unique bucket index. We’re going to use the upper 5 bits of a “32 bit integer” (it’s not actually 32 bits, we’re just going to treat it that way) to find our hash key. The reason we’re going to use 5 bits is because we have 17 keys, and 17 can’t fit in 4 bits. To find the value of SOME_CONSTANT, we’re just going to use a brute force method.

First lets convert the two bytes from each GraphQL keyword to a 16 bit integer:

>> keys = (Lexer::KW_TABLE.keys - ["on", "subscription"]).map { |w| w[1, 2].unpack1("s") }
=> [24946, 30066, 27745, 27765, 25973, 29813, 26723, 24931, 28793, 29816, 28781, 29806, 26990, 30062, 28782, 29289, 28773]

Next we’re going to use a brute force method to find a constant value such that we can convert these 16 bit numbers in to unique 5 bit numbers:

>> c = 13
=> 13
?> loop do
?>   z = keys.map { |k| ((k * c) >> 27) & 0x1f }
?>   break if z.uniq.length == z.length
?>   c += 1
>> end
=> nil
>> c
=> 18592990

We start our search at 13. Our loop tries applying the hashing function to all keys. If the hashing function returns unique values for all keys, then we found the right value for c, otherwise we increment c by one and try the next number.

After this loop finishes (it takes a while), we check c and that’s the value for our perfect hash!

Now we can write our hashing function like this:

def _hash key
  (key * 18592990) >> 27 & 0x1f
end

This function will return a unique value based on the 2nd and 3rd bytes of each GraphQL keyword. Lets prove that to ourselves in IRB:

?> def _hash key
?>   (key * 18592990) >> 27 & 0x1f
>> end
=> :_hash
>> keys = (Lexer::KW_TABLE.keys - ["on", "subscription"]).map { |w| w[1, 2].unpack1("s") }
=> [24946, 30066, 27745, 27765, 25973, 29813, 26723, 24931, 28793, 29816, 28781, 29806, 26990, 30062, 28782, 29289, 28773]
>> keys.map { |key| _hash(key) }
=> [31, 5, 3, 6, 14, 1, 21, 29, 20, 2, 18, 0, 26, 4, 19, 25, 17]

We’ll use these integers as an index in to an array that stores the symbol name associated with that particular keyword:

>> # Get a list of the array indices for each keyword
=> nil
>> array_indexes = keys.map { |key| _hash(key) }
=> [31, 5, 3, 6, 14, 1, 21, 29, 20, 2, 18, 0, 26, 4, 19, 25, 17]
>> # Insert a symbol in to an array at each index
=> nil
>> table = kws.zip(array_indexes).each_with_object([]) { |(kw, key),o| o[key] = kw.upcase.to_sym }
=> 
[:INTERFACE,
...

Now we have a table we can use to look up the symbol for a particular keyword given the keyword’s 2nd and 3rd bytes.

Take a breather

I think this is getting a little complicated so I want to step back and take a breather. What we’ve done so far is write a function that, given the 2nd and 3rd bytes of a string returns an index to an array.

Let’s take the keyword interface as an example. The 2nd and 3rd bytes are nt:

>> "interface"[1, 2]
=> "nt"

We can use unpack1 to convert nt in to a 16 bit integer:

>> "interface"[1, 2].unpack1("s")
=> 29806

Now we pass that integer to our hashing function (I called it _hash in IRB):

>> _hash("interface"[1, 2].unpack1("s"))
=> 0

And now we have the array index where to find the :INTERFACE symbol:

>> table[_hash("interface"[1, 2].unpack1("s"))]
=> :INTERFACE

This will work for any of the strings we used to build the perfect hash function. Lets try a few:

>> table[_hash("union"[1, 2].unpack1("s"))]
=> :UNION
>> table[_hash("scalar"[1, 2].unpack1("s"))]
=> :SCALAR
>> table[_hash("repeatable"[1, 2].unpack1("s"))]
=> :REPEATABLE

Integrating the Perfect Hash in to the Lexer

We’ve built our hash table and hash function, so the next step is to add them to the lexer:

  KW_TABLE = [:INTERFACE, :MUTATION, :EXTEND, :FALSE, :ENUM, :TRUE, :NULL,
              nil, nil, nil, nil, nil, nil, nil, :QUERY, nil, nil, :REPEATABLE,
              :IMPLEMENTS, :INPUT, :TYPE, :SCHEMA, nil, nil, nil, :DIRECTIVE,
              :UNION, nil, nil, :SCALAR, nil, :FRAGMENT]

  def _hash key
    (key * 18592990) >> 27 & 0x1f
  end

Remember we derived the magic constant 18592990 earlier via brute force.

In the next_token method, we need to extract the 2nd and 3rd bytes of the keyword, combine them to a 16 bit int, use the _hash method to convert the 16 bit int to a 5 bit array index, then look up the symbol (I’ve omitted the rest of the next_token method):

    when len = @scan.skip(KW_RE) then
      # Return early if uniquely identifiable via length
      return :ON if len == 2
      return :SUBSCRIPTION if len == 12

      # Get the position of the start of the keyword in the main document
      start = @scan.pos - len

      # Get the 2nd and 3rd byte of the keyword and combine to a 16 bit int
      key = (@doc.getbyte(start + 2) << 8) | @doc.getbyte(start + 1)

      # Get the array index
      index = _hash(key)

      # Return the symbol
      KW_TABLE[index]

We know the length of the token because it’s the return value of StringScanner#skip. If the token is uniquely identifiable based on its length, then we’ll return early. Otherwise, ask StringScanner for the cursor position and then use the length to calculate the index of the beginning of the token (remember StringScanner pushed the cursor forward when skip matched).

Once we have the beginning of the token, we’ll use getbyte (which doesn’t allocate) to get the 2nd and 3rd bytes of the keyword. Then we’ll combine the two bytes to a 16 bit int. Finally we pass the int to the hash function and use the return value of the hash function to look up the token symbol in the array.

Let’s check our benchmarks now!

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        46.000  i/100ms
Calculating -------------------------------------
                        468.978  (± 0.4%) i/s -      2.346k in   5.002449s
{:ALLOCATIONS=>3}

We went from 459 IPS up to 468 IPS, and from 346 allocations down to 3 allocations. 1 allocation for the Lexer object, 1 allocation for the StringScanner object, and 1 allocation for ????

Actually, if we run the allocation benchmark twice we’ll get different results:

require "benchmark/ips"

def allocations
  x = GC.stat(:total_allocated_objects)
  yield
  GC.stat(:total_allocated_objects) - x
end

def go doc
  lexer = Lexer.new doc
  while tok = lexer.next_token
  end
end

doc = ARGF.read

Benchmark.ips { |x| x.report { go doc } }
p ALLOCATIONS: allocations { go doc }
p ALLOCATIONS: allocations { go doc }

Output is this:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        46.000  i/100ms
Calculating -------------------------------------
                        465.071  (± 0.6%) i/s -      2.346k in   5.044626s
{:ALLOCATIONS=>3}
{:ALLOCATIONS=>2}

Ruby uses GC allocated objects to store some inline caches. Since it was the first time we called the allocations method, a new inline cache was allocated, and that dinged us. We’re actually able to tokenize this entire document with only 2 allocations: the lexer and the string scanner.

One more hack

Lets do one more trick. We want to reduce the number of method calls the scanner makes as much as we can. The case / when statement in next_token checks each when statement one at a time. One trick I like to do is rearrange the statements so that the most popular tokens come first.

If we tokenize our benchmark program and tally up all of the tokens that come out, it looks like this:

>> lexer = Lexer.new File.read "benchmark/fixtures/negotiate.gql"
=> 
#<Lexer:0x0000000105c33c90
...
>> list = []
=> []
?> while tok = lexer.next_token
?>   list << tok
>> end
=> nil
>> list.tally
=> {:QUERY=>1, :IDENTIFIER=>2976, :LPAREN=>15, :VAR_SIGN=>6, :COLON=>56, :BANG=>1,
    :RPAREN=>15, :LCURLY=>738, :RCURLY=>738, :ELLIPSIS=>350, :ON=>319, :INT=>24,
    :TYPE=>4, :INPUT=>1, :FRAGMENT=>18}

From this data, it looks like ELLIPSIS tokens aren’t as popular as punctuation or IDENTIFIER tokens. Yet we’re always checking for ELLIPSIS tokens first. Lets move the ELLIPSIS check below the identifier check. This makes looking for ELLIPSIS more expensive, but it makes finding punctuation and identifiers cheaper. Since punctuation and identifiers occur more frequently in our document, we should get a speedup.

I applied this patch:

diff --git a/test.rb b/test.rb
index ac147c2..275b8ba 100644
--- a/test.rb
+++ b/test.rb
@@ -59,7 +59,6 @@ class Lexer
     return if @scan.eos?
 
     case
-    when @scan.skip(ELLIPSIS)        then :ELLIPSIS
     when tok = PUNCTUATION_TABLE[@doc.getbyte(@scan.pos)] then
       # If the byte at the current position is inside our lookup table, push
       # the scanner position forward 1 and return the token
@@ -78,6 +77,7 @@ class Lexer
 
       KW_TABLE[_hash(key)]
     when @scan.skip(IDENTIFIER)      then :IDENTIFIER
+    when @scan.skip(ELLIPSIS)        then :ELLIPSIS
     when @scan.skip(FLOAT)           then :FLOAT
     when @scan.skip(INT)             then :INT
     else

Now when we rerun the benchmark, we get this:

$ ruby -I lib test.rb benchmark/fixtures/negotiate.gql
Warming up --------------------------------------
                        48.000  i/100ms
Calculating -------------------------------------
                        486.798  (± 0.4%) i/s -      2.448k in   5.028884s
{:ALLOCATIONS=>3}
{:ALLOCATIONS=>2}

Great, we went from 465 IPS to 486 IPS!

Conclusion

The lexer we started with tokenized the 80kb GraphQL document at 211 IPS, and where we left off it was running at 486 IPS. More than a 2x speed improvement!

Our starting lexer allocated over 20k objects, and when we finished we got it down to just 2 objects. Of course the parser may ask the lexer to allocate something, but we know that we’re only allocating the bare minimum. In fact, if the parser only records positional offsets, it could very well never ask the lexer to allocate anything!

When I’m doing this stuff I try to use as many tricks as possible for increasing speed. But I think the biggest and best payoffs come from trying to think about the problem from a higher level and adjust the problem space itself. Converting next_token to return only a symbol rather than a tuple cut our object allocations by more than half. Questioning the code’s design itself is much harder, but I think reaps a greater benefit.

Anyway, these are hacks I like to use! If you want to play around with the lexer we’ve been building in this post, I’ve put the source code here.

I hope you enjoyed this, and have a good day!