Inline caching in MRI

Dec 23, 2015 @ 9:58 am

Inline Caching

Every time you call a method, Ruby must look up the method you want to call before actually calling that method. Trying to figure out what method needs to be called can be a costly operation. So one thing that virtual machine authors try to do is to speed up that process by caching the method lookup. MRI (and most other Ruby virtual machines) store this cache “inline”. Most of us know what a cache is, but what is “inline”?

What is an “inline cache”?

Inline caches are just caches that are stored “inline” with the bytecode generated from your Ruby program. If we write a simple program and disassemble it, we can see the inline caches.

Try out this program:

def foo bar, baz
  bar.baz
  baz.baz
end

ins = RubyVM::InstructionSequence.of method(:foo)
puts ins.disasm

If you run the above program, you’ll get output that looks very similar to this:

$ ruby x.rb 
== disasm: #<ISeq:foo@x.rb>=============================================
local table (size: 3, argc: 2 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 3] bar<Arg>   [ 2] baz<Arg>   
0000 trace            8                                               (   1)
0002 trace            1                                               (   2)
0004 getlocal_OP__WC__0 3
0006 opt_send_without_block <callinfo!mid:baz, argc:0, ARGS_SIMPLE>, <callcache>
0009 pop              
0010 trace            1                                               (   3)
0012 getlocal_OP__WC__0 2
0014 opt_send_without_block <callinfo!mid:baz, argc:0, ARGS_SIMPLE>, <callcache>
0017 trace            16                                              (   4)
0019 leave                                                            (   3)

On lines 0006 and 0014, you’ll see three values. The first value opt_send_without_block is the virtual machine instruction, and the next two values are the parameters to the instruction. The first parameter is the method you want to call along with some information about the method parameters. The second parameter <callcache> is an instance of a “call cache” and it is stored inline with the bytecode (hence “inline cache”).

What is the cache key and value?

This bit differs from VM to VM, but in MRI, there are two keys to the cache. The two keys are the “global method state” and a “class serial number”. The value of the cache is the method we’re going to call.

Global Method State

The “global method state” is a global serial number that increments every time certain classes get mutated (for example with monkey patching). You can see the current value of the global method state with RubyVM.stat. Here is an example to demonstrate changing the global state:

p RubyVM.stat

module Kernel
  def aaron
  end
end

p RubyVM.stat

If you run the above code, you’ll see output like this:

{:global_method_state=>132, :global_constant_state=>824, :class_serial=>5664}
{:global_method_state=>133, :global_constant_state=>824, :class_serial=>5664}

Adding a method to Kernel caused the global method state number to change. Changing this value at runtime is very bad because it invalidates all of the call caches (remember the call cache key is the “class serial number” and the “global method state”).

Class Serial Number

The other cache key is called the “class serial number”. This number is derived from the class (also any singleton classes associated with the object) of the instance which is the recipient of the method call at a particular call site. Each class in Ruby has a serial number associated with that class. We can see the class serial number change by looking at RubyVM.stat again:

p RubyVM.stat

class A
end

p RubyVM.stat

class B
end

p RubyVM.stat

If you run the above program, it will look something like this:

{:global_method_state=>132, :global_constant_state=>824, :class_serial=>5664}
{:global_method_state=>132, :global_constant_state=>825, :class_serial=>5666}
{:global_method_state=>132, :global_constant_state=>826, :class_serial=>5668}

You’ll notice that the class serial number is increasing. Each class gets assigned a serial number. Now, doing certain things to the class will increase the serial number. For example, if we reopen the class, the serial number will change:

p RubyVM.stat

class A
  def foo; end
end

p RubyVM.stat

class A
  def bar; end
end

p RubyVM.stat

Output:

{:global_method_state=>132, :global_constant_state=>824, :class_serial=>5664}
{:global_method_state=>132, :global_constant_state=>825, :class_serial=>5667}
{:global_method_state=>132, :global_constant_state=>825, :class_serial=>5668}

We can’t inspect the serial number associated with each class, but inspecting the output of RubyVM.stat does show the side effects of reopening the class on the serial numbers. The lesson from this is don’t reopen classes at runtime because this will invalidate caches that use that class as a key.

Recap

Lets have a short recap. Given this code:

class A
  def baz; end
end

def foo bar
  bar.baz
end

foo(A.new)

There will be one call cache at the bar.baz line, the key for that cache will be the “global method state” (just some global value) and the class serial number for A. The value of the cache will be the location of the method baz for that class.

Types of call sites and caches

So far, we’ve talk about where the inline caches are, as well as the key and value for the cache. But we haven’t talked about the size of the cache, or different types of caches. The size of the cache in MRI is one. Yes, one. It caches exactly one key and one value. This is called a “monomorphic” cache. Lets take the following example:

class A
  def baz; end
end

def foo bar
  bar.baz
end

foo(A.new)
foo(A.new)

The first time foo is called, MRI compares the cache entry’s global method state with the current global method state, and the serial number for the A class with the caches serial number. You can see the code for that here. Since there is no entry in the cache at this point, it is a cache miss, so Ruby goes through the slow path and looks up the method and caches it. You can see the cache value being stored here. The second time the method is called in this example, it’s a hit and we don’t have to do the slow path.

Now consider the following code:

class A
  def baz; end
end

class B
  def baz; end
end

def foo bar
  bar.baz
end

loop do
  foo(A.new)
  foo(B.new)
end

The above example will never hit the cache because the type for bar changes every time foo is called. Remember that part of the cache key is the class serial number. Since the serial number for A is different than the serial number for B, the cache at the bar.baz call site is never hit. We can call this call site “polymorphic” because it has multiple types for the variable bar.

Lets do one more example:

def foo bar
  bar.baz
end

loop do
  klass = Class.new {
    def baz
    end
  }
  foo(klass.new)
end

This example never hits the cache either because every instance of the class has it’s own serial number. Also, in the previous example, the bar.baz call site only ever saw two types, A and B. This case sees an infinite number of types. We can call this case “megamorphic”. I consider this case to be “too hot to handle”.

Recap

So, we essentially have 3 types of call sites: “monomorphic” is a call site that only sees one type, “polymorpic” sees many types, and “megamorphic” that sees too many types. Megamorphic call sites are also polymorphic, but they just see so many types that we don’t want to deal with them.

Ruby’s inline cache only caches one value, and is most effective when the call site is monomorphic. However, Ruby’s cache always stores the last value seen. This means that if a call site is polymorphic, but doesn’t switch types very frequently, then the cost of method lookup when switching types will be amortized.

Inspecting cache hits / misses

I’ve posted a fork of Ruby that you can find here that has a patch to let you see cache hits and misses. It adds a tracepoint for tracking those inline cache hits and misses. Don’t use this in production as it slows down your interpreter, but it is fun for testing.

Lets take our first example, and trace the cache hit and misses:

class A
  def baz; end
end

def foo bar
  bar.baz
end

tp = TracePoint.new(:inline_cache_hit, :inline_cache_miss) do |x|
  p x => x.callcache_id
end

tp.enable
foo(A.new)
foo(A.new)

If you run with my fork, you’ll see this output:

{#<TracePoint:inline_cache_miss@x.rb:14>=>[3386, 5666, "Class"]}
{#<TracePoint:inline_cache_miss@x.rb:14>=>[3387, 4052, "Object"]}
{#<TracePoint:inline_cache_miss@x.rb:6>=>[3380, 5667, "A"]}
{#<TracePoint:inline_cache_miss@x.rb:15>=>[3388, 5666, "Class"]}
{#<TracePoint:inline_cache_miss@x.rb:15>=>[3389, 4052, "Object"]}
{#<TracePoint:inline_cache_hit@x.rb:6>=>[3380, 5667, "A"]}

The left side of the hash is the tracepoint type, and the right side is an array containing a unique ID for that call site, the serial number of the class, and the class name itself. You can see the very last line is a cache hit for A (as we expected).

Lets try with the second example:

class A
  def baz; end
end

class B
  def baz; end
end

def foo bar
  bar.baz
end

tp = TracePoint.new(:inline_cache_hit, :inline_cache_miss) do |x|
  p x => x.callcache_id
end

tp.enable
2.times do
  foo(A.new)
  foo(B.new)
end

The output looks like this:

{#<TracePoint:inline_cache_miss@x.rb:18>=>[3391, 4869, "Fixnum"]}
{#<TracePoint:inline_cache_miss@x.rb:19>=>[3384, 5666, "Class"]}
{#<TracePoint:inline_cache_miss@x.rb:19>=>[3385, 4052, "Object"]}
{#<TracePoint:inline_cache_miss@x.rb:10>=>[3381, 5667, "A"]}
{#<TracePoint:inline_cache_miss@x.rb:20>=>[3386, 5669, "Class"]}
{#<TracePoint:inline_cache_miss@x.rb:20>=>[3387, 4052, "Object"]}
{#<TracePoint:inline_cache_miss@x.rb:10>=>[3381, 5670, "B"]}
{#<TracePoint:inline_cache_hit@x.rb:19>=>[3384, 5666, "Class"]}
{#<TracePoint:inline_cache_hit@x.rb:19>=>[3385, 4052, "Object"]}
{#<TracePoint:inline_cache_miss@x.rb:10>=>[3381, 5667, "A"]}
{#<TracePoint:inline_cache_hit@x.rb:20>=>[3386, 5669, "Class"]}
{#<TracePoint:inline_cache_hit@x.rb:20>=>[3387, 4052, "Object"]}
{#<TracePoint:inline_cache_miss@x.rb:10>=>[3381, 5670, "B"]}

You’ll see some cache hits in here, but none of them are inside the foo method.

END

I’m getting tired of writing this post. For those that want a sneak peek at what I’ll write about next, you can find a patch I wrote to give MRI a polymorphic inline cache here. The TL;DR for that post will be “yes the PIC works, no it doesn’t speed up our app”.