This post aims at answering the following questions:
What is Unix Copy-on-write (COW)
Why is current version of ruby (1.x.x) not COW friendly
How does the GC packaged with Ruby 2.0 fix that
Holy COW
The fork functionality in Unix systems uses an optimization strategy where memory is shared between the parent and child processes. The shared memory is maintained till either the parent or one of the children modify their copy of the resource. At that point, a true private copy is created to prevent the changes from being visible to other processes. The primary advantage is that if any of the processes do not make any modifications, no private copy needs to be ever created. This is called Copy-on-write (COW) technique.
fork_process_shared.rb
123456789
shared_array=[1,2,3,4,5]iffork#will be executed by parent processparent_array=[6,7,8]else#will be executed by child processchild_array=[9,10,11]end
If you are not familiar with Ruby fork, a call to fork creates a new process. In the above example, the code inside the if block will be executed by the parent process and else block will be executed by the child process (In the child process fork returns nil). A snapshot of how resources will be shared in memory is shown below.
As can be seen above, the shared_array is maintained as a common resource between parent and child processes. When shared_array is modified by the child process, as copy-on-write implies, a private copy is created so that it is not visible to the other process. This is illustrated below.
fork_process_dirty.rb
12345678910
shared_array=[1,2,3,4,5]iffork#will be executed by parent processparent_array=[6,7,8]else#will be executed by child processshared_array<<5child_array=[9,10,11]end
Mark and Sweep objects
The advantages of Copy-on-write cannot be leveraged between multiple ruby processes due to the inherent nature of the way the garbage collector works in Ruby. The garbage collector uses a simple Mark and Sweep algorithm to identify unused objects. In simple terms
Each object in memory has a flag (typically a single bit) reserved for the garbage collector. Starting from the root-set, all objects that can be accessed are recursively traversed and marked as being ‘in-use’. At the end of the cycle, the GC sweeps all objects that have not been marked and restores free space for future objects.
The way ruby creates objects, the GC flag or reserved bit is stored in the object itself. So, as you would have guessed by now, when the GC runs in one of the processes, the GC flag would be modified in all the objects. Now, by chance, if the objects are present in the shared pool, the OS would sense the objects as dirty and trigger a copy-on-write making private copies of the objects in each child’s memory space.
This is the reason why Ruby 1.8 or 1.9 is not COW friendly :(
GC Bit and objects should keep distance
The most sensible thing to do would be to pull out the GC flag (it’s actually called FL_MARK bit) from objects and maintain them separately. And this is exactly what Ruby 2.0 claims to do.
For each heap allocated by ruby, there is a corresponding bitmap which is linked to the header of the heap. The bitmap can store 0 or 1 values effectively replacing the GC bit which was previously stored inside the objects present in the heap. This technique is called bitmap marking. So in effect, the Mark step will not modify any live objects in the heap. Only the bitmap will be changed. Hence, the shared objects can remain that way until one of the processes actually modifies them.
Assume that you have a ruby script data_cruncher.rb that will run for a long time. You would like to log into a VM, run the script and keep it running in the background when you log out. Now, I end up writing long running scripts like these all the time and I found the following ways to make them run uninterrupted:
1. Job Control
Job control allows you to interact with background jobs, suspend foreground jobs and manage multiple jobs in a single session. Also, when you later realize that the currently running command must be pushed to the background, job control makes it easy to do that.
Start running the process using the command ruby data_cruncher.rb
Press ctrl+z to suspend the process
type bg to resume it in the background
That’s it. The job will run in the background. Note that using bg command later is the same as running the parent process with & at the end like this: ruby data_cruncher.rb &.
You can verify active jobs by using the jobs command. The -l option will include the process ID.
bash.sh
12
$ jobs -l
1] + 13701 running ruby data_cruncher.rb
Note that the number 1 at the beginning is the jobspec which is a number allocated to each job. 13701 is the process ID. The jobspec is useful when there are multiple active jobs. The significance of jobspec is illustrated below.
fg is used to pop out a process to the foreground. fg and bg commands can identify specific jobs with %<jobspec>. If you are using bash shell, before you logout, you will have to fire another command disown -h to make sure the process is not killed when the terminal session ends. -h is used to ignore HUP signal. For more information on HUP, see next section.
What disown actually does is mark the job so that SIGHUP signal is not sent to the job when the parent receives it. To explain SIGHUP, let’s dive into the next technique.
2. nohup to the rescue
To give some context, inter-process communication happens in linux using signals. When a signal is sent, the OS interrupts the normal flow of execution of the process. The most important signals relevant to this blog post are described below. For an exhaustive list, see here
SIGINT (CTRL+C) - Sent to a process from controlling terminal when user wishes to interrupt the process
SIGTSTP (CTRL+Z) - Sent to a process from controlling terminal to request it to stop temporarily
SIGHUP - Sent to a process when controlling terminal is closed
When a user logs out of a terminal session, the HUP signal is used to warn all dependent processes of the logout action. nohup command is used to ignore the HUP signal for a specific process.
bash.sh
1
$ nohup ruby data_cruncher.rb &
This will cause the script to run in the background and will pay no heed to HUP signal, if received. By default, STDOUT and STDERR will be redirected to nohup.out. To explicitly mention the output log file,
What does 2>&1 mean? 1 and 2 are the standard file descriptors for STDOUT and STDERR. We redirect STDERR to STDOUT. Why the ampersand? 1 instead of &1 would indicate a regular file by the name 1 and not the STDOUT file descriptor.
3. Screen your commands
Screen is the most sophisticated way to run background jobs when you want to continously monitor their activity. According to wikipedia,
GNU Screen is a software application that can be used to multiplex several virtual consoles, allowing a user to access multiple separate terminal sessions inside a single terminal window or remote terminal session.
What follows is a smash-course of Screen:
Start screen by typing screen. You will be greeted with a welcome message.
Every program under screen runs in a virtual window. The windows are numbered from 0.
You are currently in window 0. To start another window, press ctrl+A+C
Now you have two windows 0 and 1. There are several ways to switch between them. We will use ctrl+A+"
You should see all the windows that are active. Choose the one you want.
Now, this is where things get interesting. Now press ctrl+A+D You will be detached from current screen session and back to the bash shell. Note that the session has not been killed and all the programs that you started running with screen will be still active.
To reattach, type screen -r Press ctrl+A+" and you see that nothing has been lost. One obvious perk of screen that job control or nohup does not offer is that you always have easy and immediate access to the process that you were running in the background.
Screen is an awesome utility and this page has more valuable info.