Last edited: August 2013
resource_monitor is Copyright (C) 2013 The University of Notre Dame.
All rights reserved.
This software is distributed under the GNU General Public License.
See the file COPYING for details.
resource_monitor generates up to three log files: a summary file with the maximum values of resource used, a time-series that shows the resources used at given time intervals, and a list of files that were opened during execution.
Maximum resource limits can be specified in the form of a file, or a string given at the command line. If one of the resources goes over the limit specified, then the monitor terminates the task, and reports which resource went over the respective limits.
In systems that support it, resource_monitor wraps some libc functions to obtain a better estimate of the resources used. In contrast, resource_monitorv disables this wrapping, which means, among others, that it can only monitor the root process, but not its descendants.
% resource_monitor -- ls
This will generate three files describing the resource usage of
the command "ls". These files are
resource-pid-PID.summary,
resource-pid-PID.series, and resource-pid-PID.files, in
which PID represents the
corresponding process id. Alternatively, we can specify the output
names, and the sampling intervals:
% resource_monitor -O log-sleep -i 2 -- sleep 10
The previous command will monitor "sleep 10", at two second
intervals, and will generate the
files log-sleep.summary, log-sleep.series,
and log-sleep.files.
Currently, the monitor does not support interactive
applications. That is, if a process issues a read call from
standard input, and standard input has not been redirected, then
the tree process is terminated. This is likely to change in
future versions of the tool.
% makeflow -Mmonitor_logs Makeflow
In this case, makeflow wraps every command line rule with the monitor,
and writes the resulting logs per rule in the directory
monitor_logs.
q = work_queue_create(port);
work_queue_enable_monitoring(q, some-log-file);
wraps every task with the monitor, and appends all generated
summary files into the file some-log-file. Currently
only summary reports are generated from work queue.
universe = vanilla
executable = /bin/echo
arguments = hello condor
output = test.output
should_transfer_files = yes
when_to_transfer_output = on_exit
log = condor.test.logfile
queue
This can be rewritten, for example, as:
universe = vanilla
executable = /path/to/resource_monitor
arguments = -O echo-log -- /bin/echo hello condor
output = test.output echo-log.summary echo-log.series echo-log.files
should_transfer_files = yes
when_to_transfer_output = on_exit
log = condor.test.logfile
queue
command: [the command line given as an argument]
start: [seconds at the start of execution, since the epoch, float]
end: [seconds at the end of execution, since the epoch, float]
exit_type: [one of normal, signal or limit, string]
signal: [number of the signal that terminated the process.
Only present if exit_type is signal int]
limits_exceeded: [resources over the limit. Only present if
exit_type is limit, string]
exit_status: [final status of the parent process, int]
max_concurrent_processes: [the maximum number of processes running concurrently, int]
wall_time: [seconds spent during execution, end - start, float]
cpu_time: [user + system time of the execution, in seconds, float]
virtual_memory: [maximum virtual memory across all processes, in MB, int]
resident_memory: [maximum resident size across all processes, in MB, int]
swap_memory: [maximum swap usage across all processes, in MB, int]
bytes_read: [number of bytes read from disk, int]
bytes_written: [number of bytes written to disk, int]
workdir_number_files_dirs: [total maximum number of files and directories of
all the working directories in the tree, int]
workdir_footprint: [size in MB of all working directories in the tree, int]
The time-series log has a row per time sample. For each row, the columns have the following meaning:
wall_clock [the sample time, since the epoch, in microseconds, int]
concurrent_processes [concurrent processes at the time of the sample, int]
cpu_time [accumulated user + kernel time, in microseconds, int]
virtual_memory [current virtual memory size, in MB, int]
resident_memory [current resident memory size, in MB, int]
swap_memory [current swap usage, in MB, int]
bytes_read [accumulated number of bytes read, int]
bytes_written [accumulated number of bytes written, int]
workdir_number_files_dirs [current number of files and directories, across all
working directories in the tree, int]
workdir_footprint [current size of working directories in the tree, in MB int]
resource: max_value
It may contain any of the following fields, in the same units as
defined for the summary file:
max_concurrent_processes,
wall_time, cpu_time,
virtual_memory, resident_memory, swap_memory,
bytes_read, bytes_written,
workdir_number_files_dirs, workdir_footprint
Thus, for example, to automatically kill a process after one hour, or if it is using 5GB of swap, we can create the following file limits.txt:
wall_time: 3600
swap_memory: 5242880
In makeflow we then specify:
makeflow -Mmonitor_logs --monitor-limits=limits.txt
Or with condor:
universe = vanilla
executable = matlab
arguments = -O matlab-script-log --limits-file=limits.txt -- matlab < script.m
output = matlab.output matlab-script-log.summary matlab-script-log.series matlab-script-log.files
transfer_input_files=script.m limits.txt
should_transfer_files = yes
when_to_transfer_output = on_exit
log = condor.matlab.logfile
queue