Debugging an Assertion Error in Ruby

2021-02-03 @ 17:13

I hope nobody runs in to a problem where they need the information in this post, but in case you do, I hope this post is helpful. (I’m talking to you, future Aaron! lol)

I committed a patch to Ruby that caused the tests to start failing. This was the patch:

commit 1be84e53d76cff30ae371f0b397336dee934499d
Author: Aaron Patterson <tenderlove@ruby-lang.org>
Date:   Mon Feb 1 10:42:13 2021 -0800

    Don't pin `val` passed in to `rb_define_const`.
    
    The caller should be responsible for holding a pinned reference (if they
    need that)

diff --git a/variable.c b/variable.c
index 92d7d11eab..ff4f7964a7 100644
--- a/variable.c
+++ b/variable.c
@@ -3154,7 +3154,6 @@ rb_define_const(VALUE klass, const char *name, VALUE val)
     if (!rb_is_const_id(id)) {
        rb_warn("rb_define_const: invalid name `%s' for constant", name);
     }
-    rb_gc_register_mark_object(val);
     rb_const_set(klass, id, val);
 }

This patch is supposed to allow objects passed in to rb_define_const to move. As the commit message says, the caller should be responsible for keeping the value pinned. At the time I committed the patch, I thought that most callers of the function were marking the value passed in (as val), so we were pinning objects that something else would already pin. In other words, this code was being wasteful by chewing up GC time by pinning objects that were already pinned.

Unfortunately the CI started to error shortly after I committed this patch. Clearly the patch was related, but how?

In this post I am going to walk through the debugging tricks I used to find the error.

Reproduction

I was able to reproduce the error on my Linux machine by running the same command CI ran. Unfortunately since this bug is related to GC, the error was intermittent. To reproduce it, I just ran the tests in a loop until the process crashed like this:

$ while test $status -eq 0
    env RUBY_TESTOPTS='-q --tty=no' make -j16 -s check
  end

Before running this loop though, I made sure to do ulimit -c unlimited so that I would get a core file when the process crashed.

The Error

After the process crashed, the top of the error looked like this:

<OBJ_INFO:rb_ractor_confirm_belonging@./ractor_core.h:327> 0x000055be8657f180 [0     ] T_NONE 
/home/aaron/git/ruby/lib/bundler/environment_preserver.rb:47: [BUG] id == 0 but not shareable
ruby 3.1.0dev (2021-02-03T17:35:37Z master 6b4814083b) [x86_64-linux]

The Ractor verification routines crashed the process because a T_NONE object is “not sharable”. In other words you can’t share an object of type T_NONE between Ractors. This makes sense because T_NONE objects are actually empty slots in the GC. If a Ractor, or any other Ruby code sees a T_NONE object, then it’s clearly an error. Only the GC internals should ever be dealing with this type.

The top of the C backtrace looked like this:

-- C level backtrace information -------------------------------------------
/home/aaron/git/ruby/ruby(rb_print_backtrace+0x14) [0x55be856e9816] vm_dump.c:758
/home/aaron/git/ruby/ruby(rb_vm_bugreport) vm_dump.c:1020
/home/aaron/git/ruby/ruby(bug_report_end+0x0) [0x55be854e2a69] error.c:778
/home/aaron/git/ruby/ruby(rb_bug_without_die) error.c:778
/home/aaron/git/ruby/ruby(rb_bug+0x7d) [0x55be854e2bb0] error.c:786
/home/aaron/git/ruby/ruby(rb_ractor_confirm_belonging+0x102) [0x55be856cf6e2] ./ractor_core.h:328
/home/aaron/git/ruby/ruby(vm_exec_core+0x4ff3) [0x55be856b0003] vm.inc:2224
/home/aaron/git/ruby/tool/lib/test/unit/parallel.rb(rb_vm_exec+0x886) [0x55be856c9946]
/home/aaron/git/ruby/ruby(load_iseq_eval+0xbb) [0x55be8554f66b] load.c:594
/home/aaron/git/ruby/ruby(require_internal+0x394) [0x55be8554e3e4] load.c:1065
/home/aaron/git/ruby/ruby(rb_require_string+0x973c4) [0x55be8554d8a4] load.c:1142
/home/aaron/git/ruby/ruby(rb_f_require) load.c:838
/home/aaron/git/ruby/ruby(vm_call_cfunc_with_frame+0x11a) [0x55be856dd6fa] ./vm_insnhelper.c:2897
/home/aaron/git/ruby/ruby(vm_call_method_each_type+0xaa) [0x55be856d4d3a] ./vm_insnhelper.c:3387
/home/aaron/git/ruby/ruby(vm_call_alias+0x87) [0x55be856d68e7] ./vm_insnhelper.c:3037
/home/aaron/git/ruby/ruby(vm_sendish+0x200) [0x55be856d08e0] ./vm_insnhelper.c:4498

The function rb_ractor_confirm_belonging was the function raising an exception.

Debugging the Core File with LLDB

I usually use clang / lldb when debugging. I’ve added scripts to Ruby’s lldb tools that let me track down problems more easily, so I prefer it over gcc / gdb.

First I inspected the backtrace in the corefile:

(lldb) target create "./ruby" --core "core.456156"
Core file '/home/aaron/git/ruby/core.456156' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'ruby', stop reason = signal SIGABRT
  * frame #0: 0x00007fdc5fc8918b libc.so.6`raise + 203
    frame #1: 0x00007fdc5fc68859 libc.so.6`abort + 299
    frame #2: 0x000056362ac38bc6 ruby`die at error.c:765:5
    frame #3: 0x000056362ac38bb5 ruby`rb_bug(fmt=<unavailable>) at error.c:788:5
    frame #4: 0x000056362ae256e2 ruby`rb_ractor_confirm_belonging(obj=<unavailable>) at ractor_core.h:328:13
    frame #5: 0x000056362ae06003 ruby`vm_exec_core(ec=<unavailable>, initial=<unavailable>) at vm.inc:2224:5
    frame #6: 0x000056362ae1f946 ruby`rb_vm_exec(ec=<unavailable>, mjit_enable_p=<unavailable>) at vm.c:0
    frame #7: 0x000056362aca566b ruby`load_iseq_eval(ec=0x000056362b176710, fname=0x000056362ce96660) at load.c:594:5
    frame #8: 0x000056362aca43e4 ruby`require_internal(ec=<unavailable>, fname=<unavailable>, exception=1) at load.c:1065:21
    frame #9: 0x000056362aca38a4 ruby`rb_f_require [inlined] rb_require_string(fname=0x00007fdc38033178) at load.c:1142:18
    frame #10: 0x000056362aca3880 ruby`rb_f_require(obj=<unavailable>, fname=0x00007fdc38033178) at load.c:838
    frame #11: 0x000056362ae336fa ruby`vm_call_cfunc_with_frame(ec=0x000056362b176710, reg_cfp=0x00007fdc5f958de0, calling=<unavailable>) at vm_insnhelper.c:2897:11
    frame #12: 0x000056362ae2ad3a ruby`vm_call_method_each_type(ec=0x000056362b176710, cfp=0x00007fdc5f958de0, calling=0x00007ffe3b552128) at vm_insnhelper.c:3387:16
    frame #13: 0x000056362ae2c8e7 ruby`vm_call_alias(ec=0x000056362b176710, cfp=0x00007fdc5f958de0, calling=0x00007ffe3b552128) at vm_insnhelper.c:3037:12

It’s very similar to the backtrace in the crash report. The first thing that was interesting to me was frame 5 in vm_exec_core. vm_exec_core is the main loop for the YARV VM. This program was crashing when executing some kind of instruction in the virtual machine.

(lldb) f 5
frame #5: 0x000056362ae06003 ruby`vm_exec_core(ec=<unavailable>, initial=<unavailable>) at vm.inc:2224:5
   2221	    /* ### Instruction trailers. ### */
   2222	    CHECK_VM_STACK_OVERFLOW_FOR_INSN(VM_REG_CFP, INSN_ATTR(retn));
   2223	    CHECK_CANARY(leaf, INSN_ATTR(bin));
-> 2224	    PUSH(val);
   2225	    if (leaf) ADD_PC(INSN_ATTR(width));
   2226	#   undef INSN_ATTR
   2227	
(lldb)

Checking frame 5, we can see that it’s crashing when we push a value on to the stack. The Ractor function checks the value of objects being pushed on the VM stack, and in this case we have an object that is a T_NONE. The question is where did this value come from?

The crash happened in the file vm.inc, line 2224. This file is a generated file, so I can’t link to it, but I wanted to know which instruction was being executed, so I pulled up that file.

Line 2224 happened to be inside the opt_send_without_block instruction. So something is calling a method, and the return value of the method is a T_NONE object.

But what method is being called, and on what object?

Finding the called method

The value ec, or “Execution Context” contains information about the virtual machine at runtime. On the ec, we can find the cfp or “Control Frame Pointer” which is a data structure representing the current executing stack frame. In lldb, I could see that frame 7 had the ec available, so I went to that frame to look at the cfp:

(lldb) f 7
frame #7: 0x000056362aca566b ruby`load_iseq_eval(ec=0x000056362b176710, fname=0x000056362ce96660) at load.c:594:5
   591 	        rb_ast_dispose(ast);
   592 	    }
   593 	    rb_exec_event_hook_script_compiled(ec, iseq, Qnil);
-> 594 	    rb_iseq_eval(iseq);
   595 	}
   596 	
   597 	static inline enum ruby_tag_type
(lldb) p *ec->cfp
(rb_control_frame_t) $1 = {
  pc = 0x000056362c095d58
  sp = 0x00007fdc5f859330
  iseq = 0x000056362ca051f0
  self = 0x000056362b1d92c0
  ep = 0x00007fdc5f859328
  block_code = 0x0000000000000000
  __bp__ = 0x00007fdc5f859330
}

The control frame pointer has a pointer to the iseq or “Instruction Sequence” that is currently being executed. It also has a pc or “Program Counter”, and the program counter usually points at the instruction that will be executed next (in other words, not the currently executing instruction). Of other interest, the iseq also has the source location that corresponds to those instructions.

Getting the Source File

If we examine the iseq structure, we can find the source location of the code that is currently being executed:

(lldb) p ec->cfp->iseq->body->location
(rb_iseq_location_t) $4 = {
  pathobj = 0x000056362ca06960
  base_label = 0x000056362ce95a30
  label = 0x000056362ce95a30
  first_lineno = 0x0000000000000051
  node_id = 137
  code_location = {
    beg_pos = (lineno = 40, column = 4)
    end_pos = (lineno = 50, column = 7)
  }
}
(lldb) command script import -r ~/git/ruby/misc/lldb_cruby.py
lldb scripts for ruby has been installed.
(lldb) rp 0x000056362ca06960
bits [     ]
T_STRING: [FROZEN] (const char [57]) $6 = "/home/aaron/git/ruby/lib/bundler/environment_preserver.rb"
(lldb)

The location info clearly shows us that the instructions are on line 40. The pathobj member contains the file name, but it is stored as a Ruby string. To print out the string, I imported the lldb CRuby extensions, then used the rp command and gave it the address of the path object.

From the output, we can see that it’s crashing in the “environment_preserver.rb” file inside of the instructions that are defined on line 40. We’re not crashing on line 40, but the instructions are defined there.

Those instructions are this method:

    def replace_with_backup
      ENV.replace(backup) unless Gem.win_platform?

      # Fallback logic for Windows below to workaround
      # https://bugs.ruby-lang.org/issues/16798. Can be dropped once all
      # supported rubies include the fix for that.

      ENV.clear

      backup.each {|k, v| ENV[k] = v }
    end

It’s still not clear which of these method calls is breaking. In this function we have some method call that is returning a T_NONE.

Finding The Method Call

To find the method call, I disassembled the instruction sequence and checked the program counter:

(lldb) command script import -r misc/lldb_disasm.py
lldb Ruby disasm installed.
(lldb) rbdisasm ec->cfp->iseq
PC             IDX  insn_name(operands) 
0x56362c095c20 0000 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095ee0 )
0x56362c095c38 0003 putobject( (VALUE)0x14 )
0x56362c095c48 0005 getconstant( ID: 0x807b )
0x56362c095c58 0007 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095ee0 )
0x56362c095c68 0009 opt_send_without_block( (struct rb_call_data *)0x56362c095f20 )
0x56362c095c78 0011 branchif( 15 )
0x56362c095c88 0013 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095ef0 )
0x56362c095ca0 0016 putobject( (VALUE)0x14 )
0x56362c095cb0 0018 getconstant( ID: 0x370b )
0x56362c095cc0 0020 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095ef0 )
0x56362c095cd0 0022 putself
0x56362c095cd8 0023 opt_send_without_block( (struct rb_call_data *)0x56362c095f30 )
0x56362c095ce8 0025 opt_send_without_block( (struct rb_call_data *)0x56362c095f40 )
0x56362c095cf8 0027 pop
0x56362c095d00 0028 opt_getinlinecache( 6, (struct iseq_inline_cache_entry *)0x56362c095f00 )
0x56362c095d18 0031 putobject( (VALUE)0x14 )
0x56362c095d28 0033 getconstant( ID: 0x370b )
0x56362c095d38 0035 opt_setinlinecache( (struct iseq_inline_cache_entry *)0x56362c095f00 )
0x56362c095d48 0037 opt_send_without_block( (struct rb_call_data *)0x56362c095f50 )
0x56362c095d58 0039 pop
0x56362c095d60 0040 putself
0x56362c095d68 0041 opt_send_without_block( (struct rb_call_data *)0x56362c095f60 )
0x56362c095d78 0043 send( (struct rb_call_data *)0x56362c095f70, (rb_iseq_t *)0x56362ca05178 )
0x56362c095d90 0046 leave
(lldb) p ec->cfp->pc
(const VALUE *) $9 = 0x000056362c095d58

First I loaded the disassembly helper script. It provides the rbdisasm function. Then I used rbdisasm on the instruction sequence. This printed out the instructions in mostly human readable form. Printing the PC showed a value of 0x000056362c095d58. Looking at the PC list in the disassembly shows that 0x000056362c095d58 corresponds to a pop instruction. But the PC always points at the next instruction that will execute, not the currently executing instruction. The currently executing instruction is the one right before the PC. In this case we can see it is opt_send_without_block, which lines up with the information we discovered from vm.inc.

This is the 3rd from last method call in the block. At 0041 there is another opt_send_without_block, and then at 0043 there is a generic send call.

Looking at the Ruby code, from the bottom of the method, we see a call to backup. It’s not a local variable, so it must be a method call. The code calls each on that, and each takes a block. These must correspond to the opt_send_without_block and the send at the end of the instruction sequence. Our crash is happening just before these two, so it must be the call to ENV.clear.

If we read the implementation of ENV.clear, we can see that it returns a global variable called envtbl:

VALUE
rb_env_clear(void)
{
    VALUE keys;
    long i;

    keys = env_keys(TRUE);
    for (i=0; i<RARRAY_LEN(keys); i++) {
        VALUE key = RARRAY_AREF(keys, i);
        const char *nam = RSTRING_PTR(key);
        ruby_setenv(nam, 0);
    }
    RB_GC_GUARD(keys);
    return envtbl;
}

This object is allocated here:

    envtbl = rb_obj_alloc(rb_cObject);

And then it calls rb_define_global_const to define the ENV constant as a global:

    /*
     * ENV is a Hash-like accessor for environment variables.
     *
     * See ENV (the class) for more details.
     */
    rb_define_global_const("ENV", envtbl);

If we read rb_define_global_const we can see that it just calls rb_define_const:

void
rb_define_global_const(const char *name, VALUE val)
{
    rb_define_const(rb_cObject, name, val);
}

Before my patch, any object passed to rb_define_const would be pinned. Once I removed the pinning, that allowed the ENV variable to move around even though it shouldn’t.

I reverted that patch here, and then sent a pull request to make rb_gc_register_mark_object a little bit smarter here.

Conclusion

TBH I don’t know what to conclude this with. Debugging errors kind of sucks, but I hope that the LLDB scripts I wrote make it suck a little less. Hope you’re having a good day!!!

« go back