Makeflow User's Manual

Last edited: July 2014

Makeflow is Copyright (C) 2009 The University of Notre Dame.
This software is distributed under the GNU General Public License.
See the file COPYING for details.

Overview

Makeflow is a workflow engine for distributed computing. It accepts a specification of a large amount of work to be performed, and runs it on remote machines in parallel where possible. In addition, Makeflow is fault-tolerant, so you can use it to coordinate very large tasks that may run for days or weeks in the face of failures. Makeflow is designed to be similar to Make, so if you can write a Makefile, then you can write a Makeflow.

You can run a Makeflow on your local machine to test it out. If you have a multi-core machine, then you can run multiple tasks simultaneously. If you have a Condor pool or a Sun Grid Engine batch system, then you can send your jobs there to run. If you don't already have a batch system, Makeflow comes with a system called Work Queue that will let you distribute the load across any collection of machines, large or small.

Makeflow is part of the Cooperating Computing Tools. You can download the CCTools from this web page, follow the installation instructions, and you are ready to go.

The Makeflow Language

The Makeflow language is very similar to Make. A Makeflow script consists of a set of rules. Each rule specifies a set of target files to create, a set of source files needed to create them, and a command that generates the target files from the source files.

Makeflow attempts to generate all of the target files in a script. It examines all of the rules and determines which rules must run before others. Where possible, it runs commands in parallel to reduce the execution time.

Here is a Makeflow that uses the convert utility to make an animation. It downloads an image from the web, creates four variations of the image, and then combines them back together into an animation. The first and the last task are marked as LOCAL to force them to run on the controlling machine. CURL=/usr/bin/curl CONVERT=/usr/bin/convert URL=http://ccl.cse.nd.edu/images/capitol.jpg capitol.montage.gif: capitol.jpg capitol.90.jpg capitol.180.jpg capitol.270.jpg capitol.360.jpg LOCAL $CONVERT -delay 10 -loop 0 capitol.jpg capitol.90.jpg capitol.180.jpg capitol.270.jpg capitol.360.jpg capitol.270.jpg capitol.180.jpg capitol.90.jpg capitol.montage.gif capitol.90.jpg: capitol.jpg $CONVERT $CONVERT -swirl 90 capitol.jpg capitol.90.jpg capitol.180.jpg: capitol.jpg $CONVERT $CONVERT -swirl 180 capitol.jpg capitol.180.jpg capitol.270.jpg: capitol.jpg $CONVERT $CONVERT -swirl 270 capitol.jpg capitol.270.jpg capitol.360.jpg: capitol.jpg $CONVERT $CONVERT -swirl 360 capitol.jpg capitol.360.jpg capitol.jpg: $CURL LOCAL $CURL -o capitol.jpg $URL Note that Makeflow differs from Make in a few important ways. Read section 4 below to get all of the details.

Running Makeflow

To try out the example above, copy and paste it into a file named example.makeflow. To run it on your local machine: % makeflow example.makeflow Note that if you run it a second time, nothing will happen, because all of the files are built: % makeflow example.makeflow makeflow: nothing left to do Use the -c option to clean everything up before trying it again: % makeflow -c example.makeflow If you have access to a batch system running SGE, then you can direct Makeflow to run your jobs there: % makeflow -T sge example.makeflow Or, if you have a Condor Pool, then you can direct Makeflow to run your jobs there: % makeflow -T condor example.makeflow To submit Makeflow as a Condor job that submits more Condor jobs: % condor_submit_makeflow example.makeflow You will notice that a workflow can run very slowly if you submit each batch job to SGE or Condor, because it typically takes 30 seconds or so to start each batch job running. To get around this limitation, we provide the Work Queue system. This allows Makeflow to function as a master process that quickly dispatches work to remote worker processes.

To begin, let's assume that you are logged into a machine named barney.nd.edu. start your Makeflow like this: % makeflow -T wq example.makeflow Then, submit 10 worker processes to Condor like this: % condor_submit_workers barney.nd.edu 9123 10 Submitting job(s).......... Logging submit event(s).......... 10 job(s) submitted to cluster 298. Or, submit 10 worker processes to SGE like this: % sge_submit_workers barney.nd.edu 9123 10 Or, you can start workers manually on any other machine you can log into: % work_queue_worker barney.nd.edu 9123 Once the workers begin running, Makeflow will dispatch multiple tasks to each one very quickly. If a worker should fail, Makeflow will retry the work elsewhere, so it is safe to submit many workers to an unreliable system.

When the Makeflow completes, your workers will still be available, so you can either run another Makeflow with the same workers, remove them from the batch system, or wait for them to expire. If you do nothing for 15 minutes, they will automatically exit.

Note that condor_submit_workers and sge_submit_workers, are simple shell scripts, so you can edit them directly if you would like to change batch options or other details.

The Fine Details

The Makeflow language is very similar to Make, but it does have a few important differences that you should be aware of.

Get the Dependencies Right

You must be careful to accurately specify all of the files that a rule requires and creates, including any custom executables. This is because Makeflow requires all these information to construct the environment for a remote job. For example, suppose that you have written a simulation program called mysim.exe that reads calib.data and then produces and output file. The following rule won't work, because it doesn't inform Makeflow what files are neded to execute the simulation:

# This is an incorrect rule. output.txt: ./mysim.exe -c calib.data -o output.txt

However, the following is correct, because the rule states all of the files needed to run the simulation. Makeflow will use this information to construct a batch job that consists of mysim.exe and calib.data and uses it to produce output.txt:

# This is a correct rule. output.txt: mysim.exe calib.data ./mysim.exe -c calib.data -o output.txt

Note that when a directory is specified as an input dependency, it means that the command relies on the directory and all of its contents. So, if you have a large collection of input data, you can place it in a single directory, and then simply give the name of that directory.

No Phony Rules

For a similar reason, you cannot have "phony" rules that don't actually create the specified files. For example, it is common practice to define a clean rule in Make that deletes all derived files. This doesn't make sense in Makeflow, because such a rule does not actually create a file named clean. Instead use the -c option as shown above.

Just Plain Rules

Makeflow does not support all of the syntax that you find in various versions of Make. Each rule must have exactly one command to execute. If you have multiple commands, simply join them together with semicolons. Makeflow allows you to define and use variables, but it does not support pattern rules, wildcards, or special variables like $< or $@. You simply have to write out the rules longhand, or write a script in your favorite language to generate a large Makeflow.

Local Job Execution

Certain jobs don't make much sense to distribute. For example, if you have a very fast running job that consumes a large amount of data, then it should simply run on the same machine as Makeflow. To force this, simply add the word LOCAL to the beginning of the command line in the rule.

Rule Lexical Scope

Variables in Makeflow have global scope, that is, once defined, their value can be accessed from any rule. Sometimes it is useful to define a variable locally inside a rule, without affecting the global value. In Makeflow, this can be achieved by defining the variables after the rule's requirements, but before the rule's command, and prepending the name of the variable with @, as follows:

SOME_VARIABLE=original_value #rule 1 target_1: source_1 command_1 #rule 2 target_2: source_2 @SOME_VARIABLE=local_value_for_2 command_2 #rule 3 target_3: source_3 command_3 In this example, SOME_VARIABLE has the value 'original_value' for rules 1 and 3, and the value 'local_value_for_2' for rule 2.

Batch Job Refinement

When executing jobs, Makeflow simply uses the default settings in your batch system. If you need to pass additional options, use the BATCH_OPTIONS variable or the -B option to Makeflow.

When executing jobs, Makeflow simply uses the default settings in your batch system. If you need to pass additional options, use the BATCH_OPTIONS variable or the -B option to Makeflow.

When using Condor, this string will be added to each submit file. For example, if you want to add Requirements and Rank lines to your Condor submit files, add this to your Makeflow:

BATCH_OPTIONS = Requirements = (Memory>1024)

When using SGE, the string will be added to the qsub options. For example, to specify that jobs should be submitted to the devel queue: BATCH_OPTIONS = -q devel

Remote File Renaming

With the Work Queue and Condor batch systems, Makeflow has a feature called remote file renaming. For example:

local_name->remote_name

indicates that the file local_name is called remote_name in the remote system. Consider the following example:

b.out: a.in myprog LOCAL myprog a.in > b.out c.out->out: a.in->in1 b.out myprog->prog prog in1 b.out > out

The first rule runs locally, using the executable myprog and the local file a.in to locally create b.out. The second rule runs remotely, but the remote system expects a.in to be named in1, c.out, to be named out and so on. Note that we did not need to rename the file b.out. Without remote file renaming, we would have to create either a symbolic link, or a copy of the files with the expected correct names.

Displaying a Makeflow

When run with the -D option, Makeflow will emit a diagram of the Makeflow in the Graphviz DOT format. If you have dot installed, then you can generate an image of your workload like this:

% makeflow -D example.makeflow | dot -T gif > example.gif

To observe how a makeflow runs over time, use makeflow_graph_log to convert a log file into a timeline that shows the number of tasks ready, running, and complete over time:

makeflow_graph_log example.makeflowlog example.png

Supported Makeflow Drivers

The full list of supported Makeflow drivers include:

User-defined Clusters

For clusters that are not directly supported by Makeflow we strongly suggest using the Work Queue system and submitting workers via the cluster's normal submission mechanism.

For clusters using managers similar to SGE or Moab that are configured to preclude the use of Work Queue we have the "Cluster" custom driver. To use the "Cluster" driver the Makeflow must be run in a parallel filesystem available to the entire cluster, and the following environment variables must be set.

These will be used to construct a task submission for each makeflow rule that consists of:

% $SUBMIT_COMMAND $SUBMIT_OPTIONS "<rule name>" $CLUSTER_NAME.wrapper "<rule commandline>"

The wrapper script is a shell script that reads the command to be run as an argument and handles bookkeeping operations necessary for Makeflow.

Running Makeflow with Work Queue

With the '-T wq' option, Makeflow runs as a master process that dispatches tasks to remote worker processes using the Work Queue framework.

Selecting a Port

Makeflow listens on a port which the remote workers would connect to. The default port number is 9123. Sometimes, however, the port number might be not available on your system. You can change the default port via the -p <port number> option. For example, if you want the master to listen on port 9567 by default, you can run the following command:

% makeflow -T wq -p 9567 example.makeflow

Project Names

A simpler way to match workers to masters is to use the project name matching. You can give the master a project name with the -N option.

% makeflow -T wq -a -N MyProj example.makeflow

The -N option gives the master a project name called 'MyProj'. The -a option enables the catalog mode of the master. Only in the catalog mode a master would advertise its information, such as the project name, running status, hostname and port number, to a catalog server. Then a worker could retrieve these information from the same catalog server. The above command uses the default catalog server at Notre Dame which runs 24/7. We will talk about how to set up your own catalog server later.

To start a worker that automatically finds MyProj's master via the default Notre Dame catalog server:

% work_queue_worker -a -N MyProj

The -a option enables the catalog mode on the worker, which tells the worker to contact a catalog server to find out a project's (specified by -N option) hostname and port.

You can also give multiple -N options to a worker. The worker will find out which ones of the specified projects are running from the catalog server and randomly select one to work for. When one project is done, the worker would repeat this process. Thus, the worker can work for a different master without being stopped and given the different master's hostname and port. An example of specifying multiple projects:

% work_queue_worker -a -N proj1 -N proj2 -N proj3

Setting a Password

We recommend that any workflow that uses a project name also set a password. To do this, select any passphrase and write it to a file called mypwfile. Then, run Makeflow and each worker with the --password option to indicate the password file:

% makeflow --password mypwfile ... % work_queue_worker --password mypwfile ...

Catalog Server

Now let's look at how to set up your own catalog server. Say you want to run your catalog server on a machine named catalog.somewhere.edu. The default port that the catalog server will be listening on is 9097, you can change it via the '-p' option.

% catalog_server

Now you have a catalog server listening at catalog.somewhere.edu:9097. To make your masters and workers contact this catalog server, simply add the -C hostname:port option to both of your master and worker:

% makeflow -T wq -C catalog.somewhere.edu:9097 -N MyProj example.makeflow % work_queue_worker -C catalog.somewhere.edu:9097 -a -N MyProj

Resources and Categories

Makeflow has the capability of automatically setting the cores, memory, and disk space requirements to the underlying batch system (currently this only works with Work Queue and Condor). Jobs are grouped into job categories , and jobs in the same category have the same cores, memory, and disk requirements.

Job categories and resources are specified with variables. Jobs are assigned to the category named in the value of the variable CATEGORY. Likewise, the values of the variables CORES, MEMORY (in MB), and DISK (in MB) describe the resources requirements for the category specified in CATEGORY.

Jobs without an explicit category are assigned to default. Jobs in the default category get their resource requirements from the value of the environment variables CORES, MEMORY, and DISK.

Consider the following example: #Following tasks are assigned to the category preprocessing. MEMORY and CORES are read from the environment, if defined. CATEGORY="preprocessing" DISK=500 one: src cmd two: src cmd #Switch to category simulation. Note that now CORES, MEMORY, and DISK are specified. CATEGORY="simulation" CORES=1 MEMORY=400 DISK=400 three: src cmd four: src cmd #Another category switch. MEMORY is read from the environment. CATEGORY="analysis" CORES=4 DISK=600 five: src cmd export MEMORY=800 makeflow ...
Resources specified:
CategoryCoresMemory (MB)Disk (MB)
preprocessing (unspecified) 800 (from environment) 500
simulation 1 400 400
analysis 4 800 (from environment) 600

Linking Workflow Dependencies

Makeflow provides a tool to collect all of the dependencies for a given workflow into one directory. By collecting all of the input files and programs contained in a workflow it is possible to run the workflow on other machines.

Currently, Makeflow copies all of the files specified as dependencies by the rules in the makeflow file, including scripts and data files. Some of the files not collected are dynamically linked libraries, executables not listed as dependencies (python, perl), and configuration files (mail.rc).

To avoid naming conflicts, files which would otherwise have an identical path are renamed when copied into the bundle:

Example usage:

% makeflow -b some_output_directory example.makeflow

Makeflow Garbage Collection

As the workflow execution progresses, Makeflow can automatically delete intermediate files that are no longer needed. In this context, an intermediate file is an input of some rule that is the target of another rule. Therefore, by default, garbage collection does not delete the original input files, nor final target files.

Which files are deleted can be tailored from the default by appending files to the Makeflow variables GC_PRESERVE_LIST and GC_COLLECT_LIST. Files added to GC_PRESERVE_LIST are never deleted, thus it is used to mark intermediate files that should not be deleted. Similarly, GC_COLLECT_LIST marks final target files that should be deleted. Makeflow is conservative, in the sense that GC_PRESERVE_LIST takes precedence over GC_COLLECT_LIST, and original input files are never deleted, even if they are listed in GC_COLLECT_LIST.

Makeflow offers two modes for garbage collection: reference count, and on demand. With the reference count mode, intermediate files are deleted as soon as no rule has them listed as input. The on-demand mode is similar to reference count, only that files are deleted until the space on the local file system is below a given threshold.

To activate reference count garbage collection:

makeflow -gref_count

To activate on-demand garbage collection, with a threshold of 500MB:

makeflow -gon_demand -G500000000

Makeflow Log File Format

After you have executed the example.makeflow Makeflow script, you should see a log file named example.makeflow.makeflowlog under the directory where you ran the makeflow command. The Makeflow log file records how and when every task is run by Makeflow. It exists primarily so that Makeflow can recover cleanly after a failure, but can also be used for logging and debugging.

A sample logfile might look like this:

# STARTED timestamp 1347281321284638 5 1 9206 5 1 0 0 0 6 1347281321348488 5 2 9206 5 0 1 0 0 6 1347281321348760 4 1 9207 4 1 1 0 0 6 1347281321348958 3 1 9208 3 2 1 0 0 6 1347281321629802 4 2 9207 3 1 2 0 0 6 1347281321630005 2 1 9211 2 2 2 0 0 6 1347281321635236 3 2 9208 2 1 3 0 0 6 1347281321635463 1 1 9212 1 2 3 0 0 6 1347281321742870 2 2 9211 1 1 4 0 0 6 1347281321752857 1 2 9212 1 0 5 0 0 6 1347281321753064 0 1 9215 0 1 5 0 0 6 1347281325731146 0 2 9215 0 0 6 0 0 6 # COMPLETED timestamp

Each line in the log file represents a single action taken on a single rule in the workflow. For simplicity, rules are numbered from the beginning of the Makeflow, starting with zero. Each line contains the following items:

timestamp task_id new_state job_id tasks_waiting tasks_running tasks_complete tasks_failed tasks_aborted task_id_counter

Which are defined as follows:

In addition, lines starting with a pound sign are comments and contain additional high-level information that can be safely ignored. The logfile begins with a comment to indicate the starting time, and ends with a comment indicating whether the entire workflow completed, failed, or was aborted.

When file garbage collection is enabled, the log file also records information at each collection cycle. Collection information is included in lines starting with the # GC prefix:

# GC timestamp collected time_spent dag_gc_collected

Each garbage collection line records the garbage collection statistics during a garbage collection cycle:

For More Information

For the latest information about Makeflow, please visit our web site and subscribe to our mailing list.