Ruby Internals

Ruby Internals

Writing a Ruby script in your text editor is a simple task. Do you know what happens after you execute it?

Ruby takes your script and transforms it three times before it actually runs on the machine.

Your pretty Ruby script gets tokenized, parsed and finally compiled before turnning into machine instructions to be run.

Tokenize

Let’s see an example ruby script that you would write. We would use this simple example snippet throughout this article.

# Simple ruby script: snippet.rb

def add(a, b)
  a + b
end

puts add 1, 2

Ripper.lex

What happens when Ruby tokenize this script. It takes all the individual key words and makes them into a list of tokens.

# Sending it through Ripper for tokenization

require 'ripper'
require 'pp'

pp Ripper.lex("def add(a, b)\n  a + b\nend\n\nputs add 1, 2\n")

# Output

[[[1, 0], :on_kw, "def"],
 [[1, 3], :on_sp, " "],
 [[1, 4], :on_ident, "add"],
 [[1, 7], :on_lparen, "("],
 [[1, 8], :on_ident, "a"],
 [[1, 9], :on_comma, ","],
 [[1, 10], :on_sp, " "],
 [[1, 11], :on_ident, "b"],
 [[1, 12], :on_rparen, ")"],
 [[1, 13], :on_ignored_nl, "\n"],
 [[2, 0], :on_sp, "  "],
 [[2, 2], :on_ident, "a"],
 [[2, 3], :on_sp, " "],
 [[2, 4], :on_op, "+"],
 [[2, 5], :on_sp, " "],
 [[2, 6], :on_ident, "b"],
 [[2, 7], :on_nl, "\n"],
 [[3, 0], :on_kw, "end"],
 [[3, 3], :on_nl, "\n"],
 [[4, 0], :on_ignored_nl, "\n"],
 [[5, 0], :on_ident, "puts"],
 [[5, 4], :on_sp, " "],
 [[5, 5], :on_ident, "add"],
 [[5, 8], :on_sp, " "],
 [[5, 9], :on_int, "1"],
 [[5, 10], :on_comma, ","],
 [[5, 11], :on_sp, " "],
 [[5, 12], :on_int, "2"],
 [[5, 13], :on_nl, "\n"]]

As you can see the tokens split into individual lines. On the left, we have the line number, together with the character number. Next to it is the token as a symbol, :on_kw, on_sp, etc… Following that, are the characters that match the tokens.

Tokenization will not reflect any syntax errors. yylex will just tokenize the code without any objections. It’s the next step, during parsing where the parser will check for syntax errors.

Parse

After tokenization, Ruby will parse your code with Bison. It uses a specific grammar file and parses your code accordingly. In Ruby, the rules are in a 10,000 line file, parse.y. Taking a look at that file, we can see how Ripper hooks directly into the parser to pull out the list of tokens.

yydebug

To catch a glimpse into the complexities of parsing Ruby code, you can use Ruby’s -y, --yydebug or --dump yydebug option. It will output the internal debug information for each time the parser changes state.

Running the small snippet of code we had earlier.

$ ruby -y snippet.rb
Starting parse
Entering state 0
Reducing stack by rule 1 (line 859):
-> $$ = nterm $@1 ()
Stack now 0
Entering state 2
Reading a token: Next token is token keyword_def ()
Shifting token keyword_def ()
Entering state 7
Reducing stack by rule 347 (line 3091):
   $1 = token keyword_def ()
-> $$ = nterm k_def ()
Stack now 0 2
Entering state 96
Reading a token: Next token is token tIDENTIFIER ()
Shifting token tIDENTIFIER ()
Entering state 393
...
...

Ripper.sexp

Earlier, we learned how to use Ripper to show the tokens that Ruby uses to transform your code. Now let’s take a look at how to use Ripper to display information about how Ruby parses your code.

# Sending it through Ripper for parsing

require 'ripper'
require 'pp'

pp Ripper.sexp("def add(a, b)\n  a + b\nend\n\nputs add 1, 2\n")

# Output
[:program,
 [[:def,
   [:@ident, "add", [1, 4]],
   [:paren,
    [:params,
     [[:@ident, "a", [1, 8]], [:@ident, "b", [1, 11]]],
     nil,
     nil,
     nil,
     nil,
     nil,
     nil]],
   [:bodystmt,
    [[:binary,
      [:var_ref, [:@ident, "a", [2, 2]]],
      :+,
      [:var_ref, [:@ident, "b", [2, 6]]]]],
    nil,
    nil,
    nil]],
  [:command,
   [:@ident, "puts", [5, 0]],
   [[:command,
     [:@ident, "add", [5, 5]],
     [:args_add_block,
      [[:@int, "1", [5, 9]], [:@int, "2", [5, 12]]],
      false]]]]]]

What happened there? As Ruby parses your code, it transforms the tokens into a data structure called an Abstract Syntax Tree (AST). This AST holds the structure and meaning of your code. You can still see some of your code in the output.

parsetree

Ruby has a few hidden things that are quite fun to explore. Another option to show some debug information about the AST of your code is the --dump parsetree option. You can also use --dump parsetree_with_comment to show comment annotations in the tree.

Using the same snippet of code we had earlier.

$ ruby --dump parsetree snippet.rb
###########################################################
## Do NOT use this node dump for any purpose other than  ##
## debug and research.  Compatibility is not guaranteed. ##
###########################################################

# @ NODE_SCOPE (line: 5)
# +- nd_tbl: (empty)
# +- nd_args:
# |   (null node)
# +- nd_body:
#     @ NODE_BLOCK (line: 1)
#     +- nd_head:
#     |   @ NODE_DEFN (line: 1)
#     |   +- nd_mid: :add
#     |   +- nd_defn:
#     |       @ NODE_SCOPE (line: 3)
#     |       +- nd_tbl: :a,:b
#     |       +- nd_args:
#     |       |   @ NODE_ARGS (line: 1)
#     |       |   +- nd_ainfo->pre_args_num: 2
#     |       |   +- nd_ainfo->pre_init:
#     |       |   |   (null node)
#     |       |   +- nd_ainfo->post_args_num: 0
#     |       |   +- nd_ainfo->post_init:
#     |       |   |   (null node)
#     |       |   +- nd_ainfo->first_post_arg: (null)
#     |       |   +- nd_ainfo->rest_arg: (null)
#     |       |   +- nd_ainfo->block_arg: (null)
#     |       |   +- nd_ainfo->opt_args:
#     |       |   |   (null node)
#     |       |   +- nd_ainfo->kw_args:
#     |       |       (null node)
#     |       |   +- nd_ainfo->kw_rest_arg:
#     |       |       (null node)
#     |       +- nd_body:
#     |           @ NODE_CALL (line: 2)
#     |           +- nd_mid: :+
#     |           +- nd_recv:
#     |           |   @ NODE_LVAR (line: 2)
#     |           |   +- nd_vid: :a
#     |           +- nd_args:
#     |               @ NODE_ARRAY (line: 2)
#     |               +- nd_alen: 1
#     |               +- nd_head:
#     |               |   @ NODE_LVAR (line: 2)
#     |               |   +- nd_vid: :b
#     |               +- nd_next:
#     |                   (null node)
#     +- nd_next:
#         @ NODE_BLOCK (line: 5)
#         +- nd_head:
#         |   @ NODE_FCALL (line: 5)
#         |   +- nd_mid: :puts
#         |   +- nd_args:
#         |       @ NODE_ARRAY (line: 5)
#         |       +- nd_alen: 1
#         |       +- nd_head:
#         |       |   @ NODE_FCALL (line: 5)
#         |       |   +- nd_mid: :add
#         |       |   +- nd_args:
#         |       |       @ NODE_ARRAY (line: 5)
#         |       |       +- nd_alen: 2
#         |       |       +- nd_head:
#         |       |       |   @ NODE_LIT (line: 5)
#         |       |       |   +- nd_lit: 1
#         |       |       +- nd_next:
#         |       |           @ NODE_ARRAY (line: 5)
#         |       |           +- nd_alen: 139711399246960
#         |       |           +- nd_head:
#         |       |           |   @ NODE_LIT (line: 5)
#         |       |           |   +- nd_lit: 2
#         |       |           +- nd_next:
#         |       |               (null node)
#         |       +- nd_next:
#         |           (null node)
#         +- nd_next:
#             (null node)

We can see some similarities in this structure compared to the output of Ripper.sexp.

Compile

After your code is tokenized and parsed, it has one more step to go before Ruby is able to run it. That is the compilation step. Before we start on this step, we have to understand a bit about Ruby’s underlying infrastructure.

Ruby runs on Yet Another Ruby Virtual Machine (YARV). YARV executes your Ruby code. To use it, we have to first compile our code into bytecode, instructions that the virual machine can understand and execute. This is done through the compile.c.

Bytecode dump

To check out the bytecode generated after your code has been tokenized and parsed, we have two methods. We can use the option --dump insns or run our code through RubyVM::InstructionSequence.compile.

This is a disassembly of the Ruby bytecode is read and executed by YARV.

# dump insns
$ ruby --dump insns snippet.rb

# RubyVM
code = "def add(a, b)\n  a + b\nend\n\nputs add 1, 2\n"
puts RubyVM::InstructionSequence.compile(code).disasm

# Output
== disasm: <RubyVM::InstructionSequence:<main>@snippet.rb>==============
0000 trace            1                                               (   1)
0002 putspecialobject 1
0004 putspecialobject 2
0006 putobject        :add
0008 putiseq          add
0010 opt_send_simple  <callinfo!mid:core#define_method, argc:3, ARGS_SKIP>
0012 pop
0013 trace            1                                               (   5)
0015 putself
0016 putself
0017 putobject_OP_INT2FIX_O_1_C_
0018 putobject        2
0020 opt_send_simple  <callinfo!mid:add, argc:2, FCALL|ARGS_SKIP>
0022 opt_send_simple  <callinfo!mid:puts, argc:1, FCALL|ARGS_SKIP>
0024 leave
== disasm: <RubyVM::InstructionSequence:add@snippet.rb>=================
local table (size: 3, argc: 2 [opts: 0, rest: -1, post: 0, block: -1, keyword: 0@4] s1)
[ 3] a<Arg>     [ 2] b<Arg>
0000 trace            8                                               (   1)
0002 trace            1                                               (   2)
0004 getlocal_OP__WC__0 3
0006 getlocal_OP__WC__0 2
0008 opt_plus         <callinfo!mid:+, argc:1, ARGS_SKIP>
0010 trace            16                                              (   3)
0012 leave                                                            (   2)

That’s not all

We took a walk through Ruby internals and figured out what happens when you run a piece of ruby code. Everything from rails to your one liner goes through the same process. It is amazing what the Ruby core team has done to allow us to write Ruby.

However, that’s not all. We’ve only briefly scratched the surface of what goes on behind the scenes. If you’re interested in learning more, check out the awesome Pat Shaughnessy’s book, Ruby Under a Microscope.