What’s the Deal With Ruby GC and Copy-on-write

This post aims at answering the following questions:

Holy COW

The fork functionality in Unix systems uses an optimization strategy where memory is shared between the parent and child processes. The shared memory is maintained till either the parent or one of the children modify their copy of the resource. At that point, a true private copy is created to prevent the changes from being visible to other processes. The primary advantage is that if any of the processes do not make any modifications, no private copy needs to be ever created. This is called Copy-on-write (COW) technique.

fork_process_shared.rb
1
2
3
4
5
6
7
8
9
shared_array = [1,2,3,4,5]

if fork
  #will be executed by parent process
  parent_array = [6,7,8]
else
  #will be executed by child process
  child_array = [9,10,11]
end

If you are not familiar with Ruby fork, a call to fork creates a new process. In the above example, the code inside the if block will be executed by the parent process and else block will be executed by the child process (In the child process fork returns nil). A snapshot of how resources will be shared in memory is shown below.

As can be seen above, the shared_array is maintained as a common resource between parent and child processes. When shared_array is modified by the child process, as copy-on-write implies, a private copy is created so that it is not visible to the other process. This is illustrated below.

fork_process_dirty.rb
1
2
3
4
5
6
7
8
9
10
shared_array = [1,2,3,4,5]

if fork
  #will be executed by parent process
  parent_array = [6,7,8]
else
  #will be executed by child process
  shared_array << 5
  child_array = [9,10,11]
end

Mark and Sweep objects

The advantages of Copy-on-write cannot be leveraged between multiple ruby processes due to the inherent nature of the way the garbage collector works in Ruby. The garbage collector uses a simple Mark and Sweep algorithm to identify unused objects. In simple terms

Each object in memory has a flag (typically a single bit) reserved for the garbage collector. Starting from the root-set, all objects that can be accessed are recursively traversed and marked as being ‘in-use’. At the end of the cycle, the GC sweeps all objects that have not been marked and restores free space for future objects.

The way ruby creates objects, the GC flag or reserved bit is stored in the object itself. So, as you would have guessed by now, when the GC runs in one of the processes, the GC flag would be modified in all the objects. Now, by chance, if the objects are present in the shared pool, the OS would sense the objects as dirty and trigger a copy-on-write making private copies of the objects in each child’s memory space.

This is the reason why Ruby 1.8 or 1.9 is not COW friendly :(

GC Bit and objects should keep distance

The most sensible thing to do would be to pull out the GC flag (it’s actually called FL_MARK bit) from objects and maintain them separately. And this is exactly what Ruby 2.0 claims to do.

For each heap allocated by ruby, there is a corresponding bitmap which is linked to the header of the heap. The bitmap can store 0 or 1 values effectively replacing the GC bit which was previously stored inside the objects present in the heap. This technique is called bitmap marking. So in effect, the Mark step will not modify any live objects in the heap. Only the bitmap will be changed. Hence, the shared objects can remain that way until one of the processes actually modifies them.

References:

  1. Narihiro Nakamura’s patch for Bitmap Marking GC
  2. Pat Shaughnessy’s excellent blog post

3 Ways to Keep a Linux Process Alive

Assume that you have a ruby script data_cruncher.rb that will run for a long time. You would like to log into a VM, run the script and keep it running in the background when you log out. Now, I end up writing long running scripts like these all the time and I found the following ways to make them run uninterrupted:

1. Job Control

Job control allows you to interact with background jobs, suspend foreground jobs and manage multiple jobs in a single session. Also, when you later realize that the currently running command must be pushed to the background, job control makes it easy to do that.

That’s it. The job will run in the background. Note that using bg command later is the same as running the parent process with & at the end like this: ruby data_cruncher.rb &.

You can verify active jobs by using the jobs command. The -l option will include the process ID.

bash.sh
1
2
$ jobs -l
1]  + 13701 running    ruby data_cruncher.rb

Note that the number 1 at the beginning is the jobspec which is a number allocated to each job. 13701 is the process ID. The jobspec is useful when there are multiple active jobs. The significance of jobspec is illustrated below.

bash.sh
1
2
3
4
5
$ jobs -l
1]  + 13701 running    ruby data_cruncher.rb
2]  + 14312 running    tail -f data_cruncher.log
$ fg %1
[1]  + 13701 continued  ruby data_cruncher.rb

fg is used to pop out a process to the foreground. fg and bg commands can identify specific jobs with %<jobspec>. If you are using bash shell, before you logout, you will have to fire another command disown -h to make sure the process is not killed when the terminal session ends. -h is used to ignore HUP signal. For more information on HUP, see next section.

bash.sh
1
2
3
4
5
6
$ jobs -l
1]  + 13701 running    ruby data_cruncher.rb
2]  + 14312 running    tail -f data_cruncher.log
$ disown %1
$ jobs -l
1]  + 14312 running    tail -f data_cruncher.log

What disown actually does is mark the job so that SIGHUP signal is not sent to the job when the parent receives it. To explain SIGHUP, let’s dive into the next technique.

2. nohup to the rescue

To give some context, inter-process communication happens in linux using signals. When a signal is sent, the OS interrupts the normal flow of execution of the process. The most important signals relevant to this blog post are described below. For an exhaustive list, see here

SIGINT (CTRL+C) - Sent to a process from controlling terminal when user wishes to interrupt the process

SIGTSTP (CTRL+Z) - Sent to a process from controlling terminal to request it to stop temporarily

SIGHUP - Sent to a process when controlling terminal is closed

When a user logs out of a terminal session, the HUP signal is used to warn all dependent processes of the logout action. nohup command is used to ignore the HUP signal for a specific process.

bash.sh
1
$ nohup ruby data_cruncher.rb &

This will cause the script to run in the background and will pay no heed to HUP signal, if received. By default, STDOUT and STDERR will be redirected to nohup.out. To explicitly mention the output log file,

bash.sh
1
$ nohup ruby data_cruncher.rb > data_cruncher.log 2>&1 &

What does 2>&1 mean? 1 and 2 are the standard file descriptors for STDOUT and STDERR. We redirect STDERR to STDOUT. Why the ampersand? 1 instead of &1 would indicate a regular file by the name 1 and not the STDOUT file descriptor.

3. Screen your commands

Screen is the most sophisticated way to run background jobs when you want to continously monitor their activity. According to wikipedia,

GNU Screen is a software application that can be used to multiplex several virtual consoles, allowing a user to access multiple separate terminal sessions inside a single terminal window or remote terminal session.

What follows is a smash-course of Screen:

Now, this is where things get interesting. Now press ctrl+A+D You will be detached from current screen session and back to the bash shell. Note that the session has not been killed and all the programs that you started running with screen will be still active.

To reattach, type screen -r Press ctrl+A+" and you see that nothing has been lost. One obvious perk of screen that job control or nohup does not offer is that you always have easy and immediate access to the process that you were running in the background.

Screen is an awesome utility and this page has more valuable info.