Eliminating Intermediate Array Allocations

Sep 29, 2024 @ 7:06 pm

Recently I gave a talk at RailsWorld (hopefully they’ll post the video soon), and part of my presentation was about eliminating allocations in tokenizers. I presented a simple function for measuring allocations:

def allocations
  x = GC.stat(:total_allocated_objects)
  yield
  GC.stat(:total_allocated_objects) - x
end

Everything in Ruby is an object, but not all objects actually make allocations. We can use the above function to measure allocations made in a block. Here are some examples of code that never allocate:

p allocations { true }                  # => 0
p allocations { false }                 # => 0
p allocations { nil }                   # => 0
p allocations { :hello }                # => 0
p allocations { 1 }                     # => 0
p allocations { 2.3 }                   # => 0
p allocations { 0xFFFF_FFFF_FFFF_FFFF } # => 0

Literals like booleans, nil, symbols, integers, and floats are represented internally to CRuby as “tagged pointers” and they don’t allocate anything when executed.

Here is an example of code that sometimes allocates:

# Depends on the size of the number
p allocations { 1 + 2 }                     # => 0
p allocations { 0x3FFF_FFFF_FFFF_FFFF + 1 } # => 1

# Depends on `frozen_string_literal`
p allocations { "hello!" }                  # => 0 or 1

Math on integers generally doesn’t allocate anything, but it depends on the integer. When a number gets large enough, CRuby will allocate an object to represent that number. On 64 bit platforms, the largest whole number we can represent without allocating is 0x3FFF_FFFF_FFFF_FFFF.

String literals will sometimes allocate, but it depends on the frozen_string_literal setting in your program.

Here is an example of code that always allocates:

p allocations { [1, 2] }      # => 1
p allocations { { a: :b } }   # => 1
p allocations { Object.new }  # => 1
p allocations { "foo"[0, 1] } # => 1

Hopefully these examples are fairly straightforward. Arrays, hashes, objects, string slices, etc will allocate an object.

Eliminating Intermediate Array Allocations

At the Shopify after-party at RailsWorld, someone asked me a really great question. Their codebase has a RuboCop rule that says that when doing min or max calculations, you should always have code like this:

def foo(x, y)
  [x, y].max
end

They were concerned this is wasteful as it has an Array literal, so it will be allocating an array every time!

I think this is a really great question, and if you read my earlier allocation measurement examples, I think it’s a very reasonable conclusion. However, it’s actually not the case. This code in particular will not allocate an array, and I thought we’d look in to how that works.

The compiler in Ruby is able to tell a few important things about this code. First, we’re calling a method on an array literal which means that we’re guaranteed that the max method will be sent to an array object. Second, we know statically that we’re calling the max method. Third, the max method that is implemented in core Ruby will not mutate its receiver, and it returns some value (an integer) that isn’t the array literal.

Since the compiler knows that the array literal is ephemeral, it allocates the array on the stack, does the max calculation, then throws away the array, never asking the GC for a new object.

To get a more concrete picture, lets look at the instruction sequences for the above code:

def foo(x, y)
  [x, y].max
end

insn = RubyVM::InstructionSequence.of(method(:foo))
puts insn.disasm

== disasm: #<ISeq:foo@x.rb:1 (1,0)-(1,30)>
local table (size: 2, argc: 2 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1])
[ 2] x@0<Arg>   [ 1] y@1<Arg>
0000 getlocal_WC_0                          x@0                       (   1)[LiCa]
0002 getlocal_WC_0                          y@1
0004 opt_newarray_send                      2, 1
0007 leave                                  [Re]

The first two instructions fetch the locals x and y, and push them on the stack. Next we have a special instruction opt_newarray_send. This instruction takes two parameters, 2, 1. It’s a bit cryptic, but the 2 means that this instruction is going to operate on two stack elements. The 1 is an enum and means “we want to call the max method”.

The opt_newarray_send instruction will first check to see if Array#max has been monkey patched. If it has been monkey patched, then the instruction will allocate a regular array and call the monkey patched method. If it hasn’t been monkey patched, then it calls a “max” function which uses Ruby’s stack as an array buffer.

Here is what the stack looks like before executing opt_newarray_send:

+----+-------------+-------------+
|    | Stack Index | Stack Value |
+----+-------------+-------------+
|    | -2          | x           |
|    | -1          | y           |
| SP | 0           | Undef       |
+----+-------------+-------------+

The opt_newarray_send instruction was passed the value 2, so it knows to start the array at negative 2 relative to the stack pointer (SP). Since the stack is just an array, it calls the same function that the max function would normally call, popping 2 values from the stack, then pushing the return value of the max function.

In this way we can calculate the max value without allocating the intermediate array.

If we use our allocations function, we can confirm that the foo method indeed does not allocate anything:

def foo(x, y)
  [x, y].max
end

allocations { foo(3, 4) } # heat inline caches
p allocations { foo(3, 4) } # => 0

Aaron’s Opinion Corner

I don’t really know RuboCop very well, but I think that in cases like this it would be really helpful if the linter were to tell you why a particular rule is a rule. Personally, I dislike following rules unless I understand the reason behind them. Even if the reasoning is simply “this is just how our team styles our code”. If such a feature is already available in RuboCop, then please feel free to link to this blog post for this particular rule.

I can only assume the rule that enforced this style was “performance” related. I’m not a huge fan of linting, but I’m even less of a fan when it comes to rules around “performance”. If idiomatic Ruby is not performant, then I think there can be a strong case to be made that the CRuby team (which I am a part of) should make that code performant. If the CRuby team does make the code performant, then there is no need for the performance rule because most people write idiomatic Ruby code (by definition).

Of course there are cases where you may need to write non-idiomatic Ruby for performance reasons, but hopefully those cases are few and far between. Should the time arrive when you need to write odd code for performance reasons, it will require knowledge, experience, and nuance that neither a linter nor an AI can provide. Fortunately, this is a case where idiomatic Ruby is also “the fast way to do things”, so I definitely recommend people use the [x, y].max pattern.

More Stuff

Array#max isn’t the only method that uses this trick. It works with Array#min, Array#pack and Array#hash. If you need to implement a custom hash method on an object, then I highly recommend doing something like this:

def hash
  [@ivar1, @ivar2, ...].hash
end

Finally, there are cases where CRuby won’t apply this trick. Lets look at the instructions for the following method:

def foo
  [3, 4].max
end

insn = RubyVM::InstructionSequence.of(method(:foo))
puts insn.disasm

== disasm: #<ISeq:foo@x.rb:1 (1,0)-(3,3)>
0000 duparray                               [3, 4]                    (   2)[LiCa]
0002 opt_send_without_block                 <calldata!mid:max, argc:0, ARGS_SIMPLE>
0004 leave                                                            (   3)[Re]

If you read these instructions carefully, you’ll see it has a duparray instruction. This instruction allocates an array, and then we call the max method on the array.

When all of the elements of the array are static, CRuby applies an optimization to allocate the array once, embed it in the instructions, and then do a dup on the array. Copying an existing array is much faster than allocating a new one. Unfortunately, this optimization is applied before the “max” method optimization, so it doesn’t apply both.

For those of you at home saying “the compiler could calculate the max of [3, 4] and eliminate the array all together!” just remember that someone could monkey patch Array#max and we’d need to respect it. Argh!! Fixing this particular case is not worth the code complexity, in my opinion. We all know that 4 is greater than 3, so we could “manually inline” this case and just write 4.

Anyway, all this to say is that these optimizations are context dependent. Attempting to “prescribe” more optimal code seems like it could become a hairy situation, especially since the linter can’t know what the Ruby compiler will do.

I do like the idea of language servers possible suggesting possibly faster code, but only as a teaching opportunity for the developer. The real goal should be to help build understanding so that this type of linting becomes unnecessary.

Anyway, I had a really great time at RailsWorld. I am very happy I got this question, and I hope that this post helps someone!