Speeding up Ruby with Shared Strings
Feb 12, 2018 @ 10:00 amIt’s not often I am able to write a patch that not only reduces memory usage,
but increases speed as well. Usually I find myself trading memory for speed, so
it’s a real treat when I can improve both in one patch. Today I want to talk
about the patch I submitted to Ruby in this ticket.
It decreases “after boot” memory usage of a Rails application by 4% and speeds
up require
by about 35%.
When I was writing this patch, I was actually focusing on trying to reduce memory usage. It just happens that reducing memory usage also resulted in faster runtime. So really I wanted to title this post “Reducing Memory Usage in Ruby”, but I already made a post with that title.
Shared String Optimization
As I mentioned in previous posts, Ruby objects are limited to 40
bytes. But a string can be much longer than 40 bytes, so how are they
stored? If we look at the struct that represents strings, we’ll find there is a char *
pointer:
struct RString {
struct RBasic basic;
union {
struct {
long len;
char *ptr;
union {
long capa;
VALUE shared;
} aux;
} heap;
char ary[RSTRING_EMBED_LEN_MAX + 1];
} as;
};
The ptr
field in the string struct points to a byte array which is our string.
So the actual memory usage of a string is approximately 40 bytes for the object,
plus however long the string is. If we were to visualize the layout, it would
look something like this:
In this case, there are really two allocations: the RString
object and the
“hello world” character array. The RString
object is the 40 byte Ruby object
allocated using the GC, and the character array was allocated using the system’s
malloc
implementation.
Side note: There is another optimization called “embedding”. Without getting
too far off track, “embedding” is just keeping strings that are “small enough”
stored directly inside the RString
structure. We can talk about that in a
different post, but today pretend there are always two distinct allocations.
We can take advantage of this character array and represent substrings by just pointing at a different location. For example, we can have two Ruby objects, one representing the string “hello world” and the other representing the string “world” and only allocate one character array buffer:
This example only has 3 allocations: 2 from the GC for the Ruby string objects,
and one malloc
for the character array. Using ObjectSpace
, we can actually
observe this optimization by measuring memory size of the objects after slicing
them:
>> require 'objspace'
=> true
>> str = "x" * 9000; nil
=> nil
>> ObjectSpace.memsize_of str
=> 9041
>> substr = str[30, str.length - 30]; nil
=> nil
>> str.length
=> 9000
>> substr.length
=> 8970
>> ObjectSpace.memsize_of substr
=> 40
The example above first allocates a string that is 9000 characters. Next we measure the memory size of the string. The total size is 9000 for the characters, plus some overhead for the Ruby object for a total of 9041. Next we take a substring, slicing off the first 30 characters of the original. As expected, the original string is 9000 characters, and the substring is 8970. However, if we measure the size of the substring it is only 40 bytes! This is because the new string only requires a new Ruby object to be allocated, and the new object just points at a different location in the original string’s character buffer, just like the graph above showed.
This optimization isn’t limited to just strings, we can use it with arrays too:
>> list = ["x"] * 9000; nil
=> nil
>> ObjectSpace.memsize_of(list)
=> 72040
>> list2 = list[30, list.length - 30]; nil
=> nil
>> ObjectSpace.memsize_of(list2)
=> 40
In fact, functional languages where data structures are immutable can take great advantage of this optimization. In languages that allow mutations, we have to deal with the case that the original string might be mutated, where languages with immutable data structures can be even more aggressive about optimization.
Limits of the Shared String Optimization
This shared string optimization isn’t without limits though. To take advantage of this optimization, we have to always go to the end of the string. In other words, we can’t take a slice from the middle of the string and get the optimization. Lets take our sample string and slice 15 characters off each side and see what the memsize is:
>> str = "x" * 9000; nil
=> nil
>> str.length
=> 9000
>> substr = str[15, str.length - 30]; nil
=> nil
>> substr.length
=> 8970
>> ObjectSpace.memsize_of(substr)
=> 9011
We can see in the above example that the memsize of the substring is much larger than in the first example. That is because Ruby had to create a new buffer to store the substring. So our lesson here is: if you have to slice strings, start from the left and go all the way to the end.
Here is an interesting thing to think about. At the end of the following
program, what is the memsize of substr
? How much memory is this program
actually consuming? Is the str
object still alive, and how can we find out?
require 'objspace'
str = "x" * 9000
substr = str[30, str.length - 30]
str = nil
GC.start
# What is the memsize of substr?
# How much memory is this program actually consuming?
# Is `str` still alive even though we did a GC?
# Hint: use `ObjectSpace.dump_all`
# (if you try this out, I recommend running the program with `--disable-gems`)
The optimization I explained above works exactly the same way for strings in C
as it does in Ruby. We will use this optimization to reduce memory usage and
speed up require
in Ruby.
Reducing Memory Usage and Speeding Up require
I’ve already described the technique we’re going to use to speed up require
,
so lets take a look at the problem. After that, we’ll apply the shared string
optimization to improve performance of require
.
Every time a program requires a file, Ruby has to check to see if that file has
already been required. The global variable $LOADED_FEATURES
is a list of all
the files that have been required so far. Of course, searching through a list
for a file would be quite slow and get slower as the list grows, so Ruby keeps a
hash to look up entries in the $LOADED_FEATURES
list. This hash is called the
loaded_features_index
, and it’s stored on the virtual machine structure
here.
The keys of this hash are strings that could be passed to require
to require a
particular file, and the value is the index in the $LOADED_FEATURES
array of
the file that actually got required. So, for example if you have a file on your
system: /a/b/c.rb
, the keys to the hash will be:
- “/a/b/c.rb”
- “a/b/c.rb”
- “b/c.rb”
- “c.rb”
- “/a/b/c”
- “a/b/c”
- “b/c”
- “c”
Given a well crafted load path, any of the strings above could be used to load
the /a/b/c.rb
file, so the index needs to keep all of them. For example, you
could do ruby -I / -e"require 'a/b/c'"
, or ruby -I /a -e"require 'b/c'"'
,
etc, and they all point to the same file.
The loaded_features_index
hash is built in the features_index_add
function.
Lets pick apart this function a little.
static void
features_index_add(VALUE feature, VALUE offset)
{
VALUE short_feature;
const char *feature_str, *feature_end, *ext, *p;
feature_str = StringValuePtr(feature);
feature_end = feature_str + RSTRING_LEN(feature);
for (ext = feature_end; ext > feature_str; ext--)
if (*ext == '.' || *ext == '/')
break;
if (*ext != '.')
ext = NULL;
/* Now `ext` points to the only string matching %r{^\.[^./]*$} that is
at the end of `feature`, or is NULL if there is no such string. */
This function takes a feature
and an offset
as parameters. The feature
is
the full name of the file that was required, extension and everything. offset
is the index in the loaded features list where this string is. The first part
of this function starts at the end of the string and scans backwards looking for
a period or a forward slash. If it finds a period, we know the file has an
extension (it is possible to require a Ruby file without an extension!), if it
finds a forward slash, it gives up and assumes there is no extension.
while (1) {
long beg;
p--;
while (p >= feature_str && *p != '/')
p--;
if (p < feature_str)
break;
/* Now *p == '/'. We reach this point for every '/' in `feature`. */
beg = p + 1 - feature_str;
short_feature = rb_str_subseq(feature, beg, feature_end - p - 1);
features_index_add_single(short_feature, offset);
if (ext) {
short_feature = rb_str_subseq(feature, beg, ext - p - 1);
features_index_add_single(short_feature, offset);
}
}
Next we scan backwards in the string looking for forward slashes. Every time
it finds a forward slash, it uses rb_str_subseq
to get a substring and then
calls features_index_add_single
to register that substring. rb_str_subseq
gets substrings in the same way we were doing above in Ruby, and applies the
same optimizations.
The if (ext)
conditional deals with files that have an extension, and this is
really where our problems begin. This conditional gets a substring of
feature
, but it doesn’t go all the way to the end of the string. It must
exclude the file extension. This means it will copy the underlying string.
So these two calls to rb_str_subseq
do 3 allocations total: 2 Ruby objects
(the function returns a Ruby object) and one malloc to copy the string for the
“no extension substring” case.
This function calls features_index_add_single
to add the substring to the
index. I want to call out one excerpt from the features_index_add_single
function:
features_index = get_loaded_features_index_raw();
st_lookup(features_index, (st_data_t)short_feature_cstr, (st_data_t *)&this_feature_index);
if (NIL_P(this_feature_index)) {
st_insert(features_index, (st_data_t)ruby_strdup(short_feature_cstr), (st_data_t)offset);
}
This code looks up the string in the index, and if the string isn’t in the
index, it adds it to the index. The caller allocated a new Ruby
string, and that string could get garbage collected, so this function calls
ruby_strdup
to copy the string for the hash key. It’s important to note that the
keys to this hash aren’t Ruby objects, but char *
pointers that came from
Ruby objects (the char *ptr
field that we were looking at earlier).
Lets count the allocations. So far, we have 2 Ruby objects: one with a file
extension and one without, 1 malloc for the non-sharable substring, then 2 more
mallocs to copy the string in to the hash. So each iteration of the while loop
in features_index_add
will do 5 allocations: 2 Ruby objects, and 3 mallocs.
In cases like this, a picture might help explain better. Below is a diagram of the allocated memory and how they relate to each other.
This diagram shows what the memory layout looks like when adding the path
/a/b/c.rb
to the index, resulting in 8 hash entries.
Blue nodes are allocations that were alive before the call to add the path to
the index. Red nodes are intermediate allocations done while populating the
index, and will be freed at some point. Black nodes are allocations made while
adding the path to the index but live after we’ve finished adding the path to
the index. Solid arrows represent actual references, where dotted lines
indicate a relationship but not actually a reference (like one string was
ruby_strdup
’d from another).
The graph has lots of nodes and is very complicated, but we will clean it up!
Applying the Shared String Optimization
I’ve translated the C code to Ruby code so that we can more easily see the optimization at work:
$features_index = {}
def features_index_add(feature, index)
ext = feature.index('.')
p = ext ? ext : feature.length
loop do
p -= 1
while p > 0 && feature[p] != '/'
p -= 1
end
break if p == 0
short_feature = feature[p + 1, feature.length - p - 1] # New Ruby Object
features_index_add_single(short_feature, index)
if ext # slice out the file extension if there is one
short_feature = feature[p + 1, ext - p - 1] # New Ruby Object + malloc
features_index_add_single(short_feature, index)
end
end
end
def features_index_add_single(str, index)
return if $features_index.key?(str)
$features_index[str.dup] = index # malloc
end
features_index_add "/a/b/c.rb", 1
As we already learned, the shared string optimization only works when the substrings include the end of the shared string. That is, we can only take substrings from the left side of the string.
The first change we can make is to split the strings in to two cases: one with an extension, and one without. Since the “no extension” if statement does not scan to the end of the string, it always allocates a new string. If we make a new string that doesn’t contain the extension, then we can eliminate one of the malloc cases:
$features_index = {}
def features_index_add(feature, index)
no_ext_feature = nil
p = feature.length
ext = feature.index('.')
if ext
p = ext
no_ext_feature = feature[0, ext] # New Ruby Object + malloc
end
loop do
p -= 1
while p > 0 && feature[p] != '/'
p -= 1
end
break if p == 0
short_feature = feature[p + 1, feature.length - p - 1] # New Ruby Object
features_index_add_single(short_feature, index)
if ext
len = no_ext_feature.length
short_feature = no_ext_feature[p + 1, len - p - 1] # New Ruby Object
features_index_add_single(short_feature, index)
end
end
end
def features_index_add_single(str, index)
return if $features_index.key?(str)
$features_index[str.dup] = index # malloc
end
features_index_add "/a/b/c.rb", 1
This changes the function to allocate one new string, but always scan to the end of both strings. Now we have two strings that we can use to “scan from the left”, we’re able to avoid new substring mallocs in the loop. You can see this change, where I allocate a new string without an extension here.
Below is a graph of what the memory layout and relationships look like after pulling up one slice, then sharing the string:
You can see from this graph that we were able to eliminate string buffers by allocating the “extensionless” substring first, then taking slices from it.
There are two more optimizations I applied in this patch. Unfortunately they are specific to the C language and not easy to explain using Ruby.
Eliminating Ruby Object Allocations
The existing code uses Ruby to slice strings. This allocates a new Ruby object. Now that we have two strings, we can always take substrings from the left, and that means we can use pointers in C to “create” substrings. Rather than asking Ruby APIs to slice the string for us, we simply use a pointer in C to point at where we want the substring to start. The hash table that maintains the index uses C strings as keys, so instead of passing Ruby objects around, we’ll just pass a pointer in to the string:
- short_feature = rb_str_subseq(feature, beg, feature_end - p - 1);
- features_index_add_single(short_feature, offset);
+ features_index_add_single(feature_str + beg, offset);
if (ext) {
- short_feature = rb_str_subseq(feature, beg, ext - p - 1);
- features_index_add_single(short_feature, offset);
+ features_index_add_single(feature_no_ext_str + beg, offset);
}
}
- features_index_add_single(feature, offset);
+ features_index_add_single(feature_str, offset);
if (ext) {
- short_feature = rb_str_subseq(feature, 0, ext - feature_str);
- features_index_add_single(short_feature, offset);
+ features_index_add_single(feature_no_ext_str, offset);
In this case, using a pointer in to the string simplifies our code.
feature_str
is a pointer to the head of the string that has a file
extension, and feature_no_ext_str
is a pointer to the head of the string that
doesn’t have a file extension. beg
is the number of characters from the
head of the string where we want to slice. All we have to do now is just add
beg
to the head of each pointer and pass that to features_index_add_single
.
In this graph you can see we no longer need the intermediate Ruby objects
because the “add single” function directly accesses the underlying char *
pointer:
Eliminating malloc Calls
Finally, lets eliminate the ruby_strdup
calls. As we covered earlier, new
Ruby strings could get allocated. These Ruby strings would get free’d by the
garbage collector, so we had to call ruby_strdup
to keep a copy around inside
the index hash. The feature
string passed in is also stored in the
$LOADED_FEATURES
global array, so there is no need to copy that string as the
array will prevent the GC from collecting it. However, we created a new string
that does not have an extension, and that object could get collected. If we can
prevent the GC from collecting those strings, then we don’t need to copy
anything.
To keep these new strings alive, I added an array to the virtual machine (the virtual machine lives for the life of the process):
vm->loaded_features = rb_ary_new();
vm->loaded_features_snapshot = rb_ary_tmp_new(0);
vm->loaded_features_index = st_init_strtable();
+ vm->loaded_features_index_pool = rb_ary_new();
Then I add the new string to the array via rb_ary_push
right after allocation:
+ short_feature_no_ext = rb_fstring(rb_str_freeze(rb_str_subseq(feature, 0, ext - feature_str)));
+ feature_no_ext_str = StringValuePtr(short_feature_no_ext);
+ rb_ary_push(get_loaded_features_index_pool_raw(), short_feature_no_ext);
Now all strings in the index hash are shared and kept alive. This means we can
safely remove the ruby_strdup
without any strings getting free’d by the GC:
if (NIL_P(this_feature_index)) {
- st_insert(features_index, (st_data_t)ruby_strdup(short_feature_cstr), (st_data_t)offset);
+ st_insert(features_index, (st_data_t)short_feature_cstr, (st_data_t)offset);
}
After this change, we don’t need to copy any memory because the hash keys can point directly in to the underlying character array inside the Ruby string object:
This new algorithm does 2 allocations: one to create a “no extension” copy of
the original string, and one RString
object to wrap it. The “loaded features
index pool” array keeps the newly created string from being garbage collected,
and now we can point directly in to the string arrays without needing to copy
the strings.
For any file added to the “loaded features” array, we changed it from requiring O(N) allocations (where N is the number of slashes in a string) to always requiring only two allocations regardless of the number of slashes in the string.
END
By using shared strings I was able to eliminate over 76000 system calls during
the Rails boot process on a basic app, reduce the memory footprint by 4%, and
speed up require
by 35%. Next week I will try to get some statistics from a
large application and see how well it performs there!
Thanks for reading!