SOAR - System of Automatic Runs

Overview

SOAR is a framework which automates the execution of groups of jobs. These jobs form a DAG (directed acyclic graph), are run under Condor, and SOAR assembles the results within single directory. For information on Condor visit: http://www.cs.wisc.edu/condor

The system is flexible, such that it can run all data sets at once, or it can keep track of completions, running new data sets on a periodic basis. It reports on the progress by both logging information and progressively plotting of the progress of the run in terms of how many jobs are running and how many are ready to run. The system becomes adapted and personalized for each project.

Theory

A single tie in so a program runs under SOAR does some things for you. It maps a location for data sources to a program. It make an easy way for you to either feed in new data sources or to create a new version of the program to run against the existing data sources. The third primary location(the tie to soar) is a set of scripts which fetch the data sources to the location needed for each job within the DAG and scrapes the expected results to the results location.

Operational Usage

SOAR is currently intended and written to be managed by one person or possible several through a shared account. The product is young but reliable and the focus has been on tools to easily automate research. In time work will be done to explore making the various systems of software play nicely in the area of multiple users and groups.

Most projects once adapted to SOAR sit in a mode where additional jobs are automatically run as additional datasets are placed in that projects data area. Multiple people can run the same research by simply having a derived project thus establishing their own data and result locations via their own portion of the web interface. SOAR uses Condor's ability to run jobs on any sort of periodic basis needed. If the project needs a sweep for new data every 4 hours, it is trivial.

SOAR is generally managed by a single person who watches over disk consumption, adapts new research to SOAR projects, assist with special runs and expediting software changes for projects that rapidly change the science they are doing on the data.

Web Interface

No matter how one starts the runs, the web interface allows four things:

Hey, I know how to submit clusters with one submit file, why do I want to use SOAR

There are some distinct advantages to using SOAR. And it very similar to having a folder with data(each in a distinct loaction) and a single submit file. The following are provided by SOAR only:

Likely Steps towards SOARing

Each project will have its own unique steps leading to automation and maintenance of the automated jobs. Accounting for the unique nature of jobs, each initial SOAR setup will follow these steps:

  1. download and unpack
  2. Follow the "Installation" steps below.
  3. Adapt the Project to SOAR
    possibly modify the program to be automated, such that it can be automated
  4. Adapt SOAR to the Project (ready everything for the project):
    copy and modify existing sample scripts and ???
    stage input data sets and executables
    set up time-based submission
  5. run condor_submit

Steps 3, 4, and 5 will be repeated for each new project to be automated.

Installation

The first part to installing SOAR is deciding where to place things. SOAR requires a minimum of 2 installation locations and one URL up to four loactions and three URLS. This is a combination of limiting what the web interface can access combined with allowing for large amounts of disk to be consumed between initial datasets for projects and copious resulting data from a system allowing for an ease of doing research not easily reachable before. See Configurable locations below.

NOTE: SOAR is designed to leverage Condor_dagman within a Condor environment. The run location must be on local disk to avoid file locking issues with most shared file systems.

Software needed to run SOAR

Adapting a Project to SOAR

To effectively do multiple datasets in any system you want your application to either accept command line arguments and read parameters from one or more files. Anything hard coded into the program you are running can not be easily varied.

Configurable locations

The SOAR system utilizes and organizes various components related to a project's runs. As these components may be kept in a variety of locations, the file fsconfig identifies the component locations. Each component listed is also the directory name. Within the directory, each project may have its own subdirectory for project-specific versions of the component.

Additionally, each project can have its data location specified in this file in the format:

Project_name,Location_holding _data_in _project_name_folder

Soar is told where to find datasets for your jobs. These will be folders with unique names with the variable data for your jobs or folders named datasetXXXXXXX which will contain the unique job folders. The code and jobs folders under sources contain the unchanging parts for the first and the glue scripts in the second.

Adapting SOAR to a Project

Source Directory code

This directory gets all the sources to make your job run. It also gets the results of compiling if that is something your job needs be it Matlab or some regular computer language. For security purposes it must have an .htaccess file or the code will not be placed where it needs to be for the job to find it. This is to ensure that the sources on the web are only accessible by authorized persons.

Normally all files in the code directory are copied to the submit location where the job is started. However any file listed in the file SKIP will not be moved.

Another file called BLACKLIST must exist. It contains a name which starts with a number, a colon the word blacklist and then a reason within [ ]. Here follows an example:

1000: blacklist [ condensation ]

Source Directory jobs

This directory holds the glue scripts which adapt SOAR to handling the data sets and the code of your research. The glue scripts tie together the data sets to whatever processing you want to do.

Glue Scripts and Interfacing files

A basic job consists of a single node which submits to a pool and then another analysis job which could be run if desired based on results. A faulty start can have us execute a null piece of work for the first node and we usually do a null follow-up node. After all the jobs have run we have a report node, an optional clean node, an after the report mode which preps the data collected if we are delivering it and a push the data node.

There are a number of scripts that run before or after which can or should be customized:

All the template files are filled in with variable data.

There are some additional files which allow extra features.

How the report gets generated

Either your job or postjob.pl glue script generates a file called RESULT which holds a number. When the report process runs, it looks this number up in the file RESULTVALUES in your job directory which holds the glue scripts. This file has three fields separated by /. Field one is a number. Field two is either passed or failed. The last filed is the message which will be used to classify that result.

Using SOAR

Time activated tasks

The most convenient way to do production with SOAR is to place entries in a file which condor manages the frequency of. There are two included(per_runs and per_plotsandreports). Per_runs fires off the commands in continuous.cron once a day and per_plotsandreports fires off the commands in checkprogress.com every 5 minutes. Setting this up is as easy as placing what you want done similar to the sample files and submitting them with condor_submit(condor_submit per_runs). The first usually contains usage of the control.pl script and --kind=new so tracking of datasets already done has only new datasets run. The second as of version 0.7.5 is done for you. Every run gets an entry added when the run is started and is removed when the run completes. This allows us to use the information in the report system to accurately move jobs which were running to complete. This information allows the next run to search out all currently running jobs and not start them again.

If you need to remove a run, you must use the soar_rm.pl script to both extract this set of reports from the recheck interval or you waste the cycles on reports on a completed/removed run.

This way once you tie a project to do a particular job, you or the person doing the research only need to worry about creating more data sets into the image location for the project and pull results from the result location.

Automated Code Replacement

Input data for jobs in folders or folders in datasets are located either in the directory specified by IMAGERUNS is control/fsconfig, or in a location specified by the project name in the same location. Lets say your data is expected to be in your home in a directory rundata. So job data for project redapple would be in /home/me/rundata/redapple.

The way code replacement works is that anything placed in /home/me/rundata/redapple_objs is compared against the age of the current files for the currently requested version of your workflow code. Newer code from that location is inserted into your work flow. You have some control though its minimal at this point. The following four attributes can be set in /home/me/rundata/redapple_objs/objconfig. The contents of this file will only be active if files are found to update.

Control Scripts

Control.pl

Control.pl Examples

Each project has a master file ENVVARS in the run directory, which is used various things. It is changed out only for --kind=nightly and --kind=new for. However, if it comes out of sync or to make it the last "oneoff", the command is

./control.pl --project=gravitropism --update
with one of the following additional command line options:

Status.pl

The framework usually runs this script at end of a job run.

Its command line options are

Status.pl Examples

soar_rm.pl Usage

When a run is started you are givin an environment strings which describes where the data for that particular run is. "soar_rm.pl" uses this to both find the actual job id to use with "condor_rm", but also uses it to remove the run specific periodic "status.pl" runs which update the plots and reports at some interval.

How do I ......

You probably have a project running in SOAR but you want one of the following changes:

Get someone else using the configured project

Base a new project on the current project.

Let say you have a project positioning and the best version of it is v3. Your new user is sam and you want it to be called sams_liquids.

Change how the application works(change the science)

Generate a new version of the code.

This is very easy to do. If you make a new version you can name it something meaningful to mimic why you created it. The new version now has access to all the project data allowing you to change the science you are doing. If the project is positioning and the old version is v1 and you want the new version to be surfaces then go to "sources/positioning" and enter this command: cp -r v1 surfaces.

Now simply place new binaries in "sources/positioning/surfaces/code".

NOTE if you change the behavior and files needed or created you'll likely need need to change the scripts "prejob.pl, postjob.pl and pushdata.pl".

Web Interface

Basics

The goal of the web interface is to inspect a run of data sets done as a single Condor DAG. One gets access to the run directory for the DAG which has a subdirectory for each job, to the plot showing current progress of that DAG, a report which breaks out aspects of each job which has currently ended, and allows access to results for each job in the DAG when it completes.

index.html

This file sits at the top most level and has sample links to projects.php. These two files make it possible to have links created from the file RUNS in the top of the projects run directory which allow the links and access described above.

Release Notes

Version 0.7.7

Version 0.7.6

Version 0.7.5

Wanted Features