Debugging an Assertion Error in Ruby
2021-02-03 @ 17:13I hope nobody runs in to a problem where they need the information in this post, but in case you do, I hope this post is helpful. (I’m talking to you, future Aaron! lol)
I committed a patch to Ruby that caused the tests to start failing. This was the patch:
commit 1be84e53d76cff30ae371f0b397336dee934499d
Author: Aaron Patterson <tenderlove@ruby-lang.org>
Date: Mon Feb 1 10:42:13 2021 -0800
Don't pin `val` passed in to `rb_define_const`.
The caller should be responsible for holding a pinned reference (if they
need that)
diff --git a/variable.c b/variable.c
index 92d7d11eab..ff4f7964a7 100644
--- a/variable.c
+++ b/variable.c
@@ -3154,7 +3154,6 @@ rb_define_const(VALUE klass, const char *name, VALUE val)
if (!rb_is_const_id(id)) {
rb_warn("rb_define_const: invalid name `%s' for constant", name);
}
- rb_gc_register_mark_object(val);
rb_const_set(klass, id, val);
}
This patch is supposed to allow objects passed in to rb_define_const
to move.
As the commit message says, the caller should be responsible for keeping the value pinned.
At the time I committed the patch, I thought that most callers of the function were marking the value passed in (as val
), so we were pinning objects that something else would already pin.
In other words, this code was being wasteful by chewing up GC time by pinning objects that were already pinned.
Unfortunately the CI started to error shortly after I committed this patch. Clearly the patch was related, but how?
In this post I am going to walk through the debugging tricks I used to find the error.
Reproduction
I was able to reproduce the error on my Linux machine by running the same command CI ran. Unfortunately since this bug is related to GC, the error was intermittent. To reproduce it, I just ran the tests in a loop until the process crashed like this:
$ while test $status -eq 0
env RUBY_TESTOPTS='-q --tty=no' make -j16 -s check
end
Before running this loop though, I made sure to do ulimit -c unlimited
so that I would get a core file when the process crashed.
The Error
After the process crashed, the top of the error looked like this:
<OBJ_INFO:rb_ractor_confirm_belonging@./ractor_core.h:327> 0x000055be8657f180 [0 ] T_NONE
/home/aaron/git/ruby/lib/bundler/environment_preserver.rb:47: [BUG] id == 0 but not shareable
ruby 3.1.0dev (2021-02-03T17:35:37Z master 6b4814083b) [x86_64-linux]
The Ractor verification routines crashed the process because a T_NONE
object is “not sharable”.
In other words you can’t share an object of type T_NONE between Ractors.
This makes sense because T_NONE
objects are actually empty slots in the GC.
If a Ractor, or any other Ruby code sees a T_NONE
object, then it’s clearly an error.
Only the GC internals should ever be dealing with this type.
The top of the C backtrace looked like this:
-- C level backtrace information -------------------------------------------
/home/aaron/git/ruby/ruby(rb_print_backtrace+0x14) [0x55be856e9816] vm_dump.c:758
/home/aaron/git/ruby/ruby(rb_vm_bugreport) vm_dump.c:1020
/home/aaron/git/ruby/ruby(bug_report_end+0x0) [0x55be854e2a69] error.c:778
/home/aaron/git/ruby/ruby(rb_bug_without_die) error.c:778
/home/aaron/git/ruby/ruby(rb_bug+0x7d) [0x55be854e2bb0] error.c:786
/home/aaron/git/ruby/ruby(rb_ractor_confirm_belonging+0x102) [0x55be856cf6e2] ./ractor_core.h:328
/home/aaron/git/ruby/ruby(vm_exec_core+0x4ff3) [0x55be856b0003] vm.inc:2224
/home/aaron/git/ruby/tool/lib/test/unit/parallel.rb(rb_vm_exec+0x886) [0x55be856c9946]
/home/aaron/git/ruby/ruby(load_iseq_eval+0xbb) [0x55be8554f66b] load.c:594
/home/aaron/git/ruby/ruby(require_internal+0x394) [0x55be8554e3e4] load.c:1065
/home/aaron/git/ruby/ruby(rb_require_string+0x973c4) [0x55be8554d8a4] load.c:1142
/home/aaron/git/ruby/ruby(rb_f_require) load.c:838
/home/aaron/git/ruby/ruby(vm_call_cfunc_with_frame+0x11a) [0x55be856dd6fa] ./vm_insnhelper.c:2897
/home/aaron/git/ruby/ruby(vm_call_method_each_type+0xaa) [0x55be856d4d3a] ./vm_insnhelper.c:3387
/home/aaron/git/ruby/ruby(vm_call_alias+0x87) [0x55be856d68e7] ./vm_insnhelper.c:3037
/home/aaron/git/ruby/ruby(vm_sendish+0x200) [0x55be856d08e0] ./vm_insnhelper.c:4498
The function rb_ractor_confirm_belonging
was the function raising an exception.
Debugging the Core File with LLDB
I usually use clang / lldb when debugging. I’ve added scripts to Ruby’s lldb tools that let me track down problems more easily, so I prefer it over gcc / gdb.
First I inspected the backtrace in the corefile:
(lldb) target create "./ruby" --core "core.456156"
Core file '/home/aaron/git/ruby/core.456156' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'ruby', stop reason = signal SIGABRT
* frame #0: 0x00007fdc5fc8918b libc.so.6`raise + 203
frame #1: 0x00007fdc5fc68859 libc.so.6`abort + 299
frame #2: 0x000056362ac38bc6 ruby`die at error.c:765:5
frame #3: 0x000056362ac38bb5 ruby`rb_bug(fmt=<unavailable>) at error.c:788:5
frame #4: 0x000056362ae256e2 ruby`rb_ractor_confirm_belonging(obj=<unavailable>) at ractor_core.h:328:13
frame #5: 0x000056362ae06003 ruby`vm_exec_core(ec=<unavailable>, initial=<unavailable>) at vm.inc:2224:5
frame #6: 0x000056362ae1f946 ruby`rb_vm_exec(ec=<unavailable>, mjit_enable_p=<unavailable>) at vm.c:0
frame #7: 0x000056362aca566b ruby`load_iseq_eval(ec=0x000056362b176710, fname=0x000056362ce96660) at load.c:594:5
frame #8: 0x000056362aca43e4 ruby`require_internal(ec=<unavailable>, fname=<unavailable>, exception=1) at load.c:1065:21
frame #9: 0x000056362aca38a4 ruby`rb_f_require [inlined] rb_require_string(fname=0x00007fdc38033178) at load.c:1142:18
frame #10: 0x000056362aca3880 ruby`rb_f_require(obj=<unavailable>, fname=0x00007fdc38033178) at load.c:838
frame #11: 0x000056362ae336fa ruby`vm_call_cfunc_with_frame(ec=0x000056362b176710, reg_cfp=0x00007fdc5f958de0, calling=<unavailable>) at vm_insnhelper.c:2897:11
frame #12: 0x000056362ae2ad3a ruby`vm_call_method_each_type(ec=0x000056362b176710, cfp=0x00007fdc5f958de0, calling=0x00007ffe3b552128) at vm_insnhelper.c:3387:16
frame #13: 0x000056362ae2c8e7 ruby`vm_call_alias(ec=0x000056362b176710, cfp=0x00007fdc5f958de0, calling=0x00007ffe3b552128) at vm_insnhelper.c:3037:12
It’s very similar to the backtrace in the crash report.
The first thing that was interesting to me was frame 5 in vm_exec_core
.
vm_exec_core
is the main loop for the YARV VM.
This program was crashing when executing some kind of instruction in the virtual machine.
(lldb) f 5
frame #5: 0x000056362ae06003 ruby`vm_exec_core(ec=<unavailable>, initial=<unavailable>) at vm.inc:2224:5
2221 /* ### Instruction trailers. ### */
2222 CHECK_VM_STACK_OVERFLOW_FOR_INSN(VM_REG_CFP, INSN_ATTR(retn));
2223 CHECK_CANARY(leaf, INSN_ATTR(bin));
-> 2224 PUSH(val);
2225 if (leaf) ADD_PC(INSN_ATTR(width));
2226 # undef INSN_ATTR
2227
(lldb)
Checking frame 5, we can see that it’s crashing when we push a value on to the stack.
The Ractor function checks the value of objects being pushed on the VM stack, and in this case we have an object that is a T_NONE
.
The question is where did this value come from?
The crash happened in the file vm.inc
, line 2224. This file is a generated
file, so I can’t link to it, but I wanted to know which instruction was being
executed, so I pulled up that file.
Line 2224 happened to be inside the opt_send_without_block
instruction.
So something is calling a method, and the return value of the method is a T_NONE
object.
But what method is being called, and on what object?
Finding the called method
The value ec
, or “Execution Context” contains information about the virtual machine at runtime.
On the ec
, we can find the cfp
or “Control Frame Pointer” which is a data structure representing the current executing stack frame.
In lldb, I could see that frame 7 had the ec
available, so I went to that frame to look at the cfp
:
(lldb) f 7
frame #7: 0x000056362aca566b ruby`load_iseq_eval(ec=0x000056362b176710, fname=0x000056362ce96660) at load.c:594:5
591 rb_ast_dispose(ast);
592 }
593 rb_exec_event_hook_script_compiled(ec, iseq, Qnil);
-> 594 rb_iseq_eval(iseq);
595 }
596
597 static inline enum ruby_tag_type
(lldb) p *ec->cfp
(rb_control_frame_t) $1 = {
pc = 0x000056362c095d58
sp = 0x00007fdc5f859330
iseq = 0x000056362ca051f0
self = 0x000056362b1d92c0
ep = 0x00007fdc5f859328
block_code = 0x0000000000000000
__bp__ = 0x00007fdc5f859330
}
The control frame pointer has a pointer to the iseq
or “Instruction Sequence” that is currently being executed.
It also has a pc
or “Program Counter”, and the program counter usually points at the instruction that will be executed next (in other words, not the currently executing instruction).
Of other interest, the iseq
also has the source location that corresponds to those instructions.
Getting the Source File
If we examine the iseq structure, we can find the source location of the code that is currently being executed:
(lldb) p ec->cfp->iseq->body->location
(rb_iseq_location_t) $4 = {
pathobj = 0x000056362ca06960
base_label = 0x000056362ce95a30
label = 0x000056362ce95a30
first_lineno = 0x0000000000000051
node_id = 137
code_location = {
beg_pos = (lineno = 40, column = 4)
end_pos = (lineno = 50, column = 7)
}
}
(lldb) command script import -r ~/git/ruby/misc/lldb_cruby.py
lldb scripts for ruby has been installed.
(lldb) rp 0x000056362ca06960
bits [ ]
T_STRING: [FROZEN] (const char [57]) $6 = "/home/aaron/git/ruby/lib/bundler/environment_preserver.rb"
(lldb)
The location info clearly shows us that the instructions are on line 40.
The pathobj
member contains the file name, but it is stored as a Ruby string.
To print out the string, I imported the lldb CRuby extensions, then used the rp
command and gave it the address of the path object.
From the output, we can see that it’s crashing in the “environment_preserver.rb” file inside of the instructions that are defined on line 40. We’re not crashing on line 40, but the instructions are defined there.
Those instructions are this method:
def replace_with_backup ENV.replace(backup) unless Gem.win_platform? # Fallback logic for Windows below to workaround # https://bugs.ruby-lang.org/issues/16798. Can be dropped once all # supported rubies include the fix for that. ENV.clear backup.each {|k, v| ENV[k] = v } end
It’s still not clear which of these method calls is breaking.
In this function we have some method call that is returning a T_NONE
.
Finding The Method Call
To find the method call, I disassembled the instruction sequence and checked the program counter:
(lldb) command script import -r misc/lldb_disasm.py
lldb Ruby disasm installed.
(lldb) rbdisasm ec->cfp->iseq
PC IDX insn_name(operands)
0x56362c095c20 0000 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095ee0 )
0x56362c095c38 0003 putobject( (VALUE)0x14 )
0x56362c095c48 0005 getconstant( ID: 0x807b )
0x56362c095c58 0007 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095ee0 )
0x56362c095c68 0009 opt_send_without_block( (struct rb_call_data *)0x56362c095f20 )
0x56362c095c78 0011 branchif( 15 )
0x56362c095c88 0013 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095ef0 )
0x56362c095ca0 0016 putobject( (VALUE)0x14 )
0x56362c095cb0 0018 getconstant( ID: 0x370b )
0x56362c095cc0 0020 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095ef0 )
0x56362c095cd0 0022 putself
0x56362c095cd8 0023 opt_send_without_block( (struct rb_call_data *)0x56362c095f30 )
0x56362c095ce8 0025 opt_send_without_block( (struct rb_call_data *)0x56362c095f40 )
0x56362c095cf8 0027 pop
0x56362c095d00 0028 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095f00 )
0x56362c095d18 0031 putobject( (VALUE)0x14 )
0x56362c095d28 0033 getconstant( ID: 0x370b )
0x56362c095d38 0035 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095f00 )
0x56362c095d48 0037 opt_send_without_block( (struct rb_call_data *)0x56362c095f50 )
0x56362c095d58 0039 pop
0x56362c095d60 0040 putself
0x56362c095d68 0041 opt_send_without_block( (struct rb_call_data *)0x56362c095f60 )
0x56362c095d78 0043 send( (struct rb_call_data *)0x56362c095f70, (rb_iseq_t *)0x56362ca05178 )
0x56362c095d90 0046 leave
(lldb) p ec->cfp->pc
(const VALUE *) $9 = 0x000056362c095d58
First I loaded the disassembly helper script. It provides the rbdisasm
function.
Then I used rbdisasm
on the instruction sequence.
This printed out the instructions in mostly human readable form.
Printing the PC showed a value of 0x000056362c095d58
.
Looking at the PC list in the disassembly shows that 0x000056362c095d58
corresponds to a pop
instruction.
But the PC always points at the next instruction that will execute, not the currently executing instruction.
The currently executing instruction is the one right before the PC.
In this case we can see it is opt_send_without_block
, which lines up with the information we discovered from vm.inc
.
This is the 3rd from last method call in the block.
At 0041
there is another opt_send_without_block
, and then at 0043
there is a generic send
call.
Looking at the Ruby code, from the bottom of the method, we see a call to backup
.
It’s not a local variable, so it must be a method call.
The code calls each
on that, and each
takes a block.
These must correspond to the opt_send_without_block
and the send
at the end of the instruction sequence.
Our crash is happening just before these two, so it must be the call to ENV.clear
.
If we read the implementation of ENV.clear
, we can see that it returns a global variable called envtbl
:
VALUE rb_env_clear(void) { VALUE keys; long i; keys = env_keys(TRUE); for (i=0; i<RARRAY_LEN(keys); i++) { VALUE key = RARRAY_AREF(keys, i); const char *nam = RSTRING_PTR(key); ruby_setenv(nam, 0); } RB_GC_GUARD(keys); return envtbl; }
This object is allocated here:
envtbl = rb_obj_alloc(rb_cObject);
And then it calls rb_define_global_const
to define the ENV
constant as a global:
/* * ENV is a Hash-like accessor for environment variables. * * See ENV (the class) for more details. */ rb_define_global_const("ENV", envtbl);
If we read rb_define_global_const
we can see that it just calls rb_define_const
:
void rb_define_global_const(const char *name, VALUE val) { rb_define_const(rb_cObject, name, val); }
Before my patch, any object passed to rb_define_const
would be pinned.
Once I removed the pinning, that allowed the ENV
variable to move around even though it shouldn’t.
I reverted that patch here, and then sent a pull request to make rb_gc_register_mark_object
a little bit smarter here.
Conclusion
TBH I don’t know what to conclude this with. Debugging errors kind of sucks, but I hope that the LLDB scripts I wrote make it suck a little less. Hope you’re having a good day!!!