resource_monitor is Copyright (C) 2013 The University of Notre Dame. This software is distributed under the GNU General Public License. See the file COPYING for details.
resource_monitor generates up to three log files: a summary file with the maximum values of resource used, a time-series that shows the resources used at given time intervals, and a list of files that were opened during execution.
Maximum resource limits can be specified in the form of a file, or a string given at the command line. If one of the resources goes over the limit specified, then the monitor terminates the task, and reports which resource went over the respective limits.
In systems that support it, resource_monitor wraps some libc functions to obtain a better estimate of the resources used. In contrast, resource_monitorv disables this wrapping, which means, among others, that it can only monitor the root process, but not its descendants.
% resource_monitor -- lsThis will generate three files describing the resource usage of the command "ls". These files are resource-pid-PID.summary, resource-pid-PID.series, and resource-pid-PID.files, in which PID represents the corresponding process id. Alternatively, we can specify the output names, and the sampling intervals:
% resource_monitor -O log-sleep -i 2 -- sleep 10The previous command will monitor "sleep 10", at two second intervals, and will generate the files log-sleep.summary, log-sleep.series, and log-sleep.files. Currently, the monitor does not support interactive applications. That is, if a process issues a read call from standard input, and standard input has not been redirected, then the tree process is terminated. This is likely to change in future versions of the tool.
% makeflow -Mmonitor_logs MakeflowIn this case, makeflow wraps every command line rule with the monitor, and writes the resulting logs per rule in the directory monitor_logs.
q = work_queue_create(port); work_queue_enable_monitoring(q, some-log-file);wraps every task with the monitor, and appends all generated summary files into the file some-log-file. Currently only summary reports are generated from work queue.
universe = vanilla executable = /bin/echo arguments = hello condor output = test.output should_transfer_files = yes when_to_transfer_output = on_exit log = condor.test.logfile queueThis can be rewritten, for example, as:
universe = vanilla executable = /path/to/resource_monitor arguments = -O echo-log -- /bin/echo hello condor output = test.output echo-log.summary echo-log.series echo-log.files should_transfer_files = yes when_to_transfer_output = on_exit log = condor.test.logfile queue
command: [the command line given as an argument] start: [seconds at the start of execution, since the epoch, float] end: [seconds at the end of execution, since the epoch, float] exit_type: [one of normal, signal or limit, string] signal: [number of the signal that terminated the process. Only present if exit_type is signal int] limits_exceeded: [resources over the limit. Only present if exit_type is limit, string] exit_status: [final status of the parent process, int] max_concurrent_processes: [the maximum number of processes running concurrently, int] wall_time: [seconds spent during execution, end - start, float] cpu_time: [user + system time of the execution, in seconds, float] virtual_memory: [maximum virtual memory across all processes, in MB, int] resident_memory: [maximum resident size across all processes, in MB, int] swap_memory: [maximum swap usage across all processes, in MB, int] bytes_read: [number of bytes read from disk, int] bytes_written: [number of bytes written to disk, int] workdir_number_files_dirs: [total maximum number of files and directories of all the working directories in the tree, int] workdir_footprint: [size in MB of all working directories in the tree, int]The time-series log has a row per time sample. For each row, the columns have the following meaning:
wall_clock [the sample time, since the epoch, in microseconds, int] concurrent_processes [concurrent processes at the time of the sample, int] cpu_time [accumulated user + kernel time, in microseconds, int] virtual_memory [current virtual memory size, in MB, int] resident_memory [current resident memory size, in MB, int] swap_memory [current swap usage, in MB, int] bytes_read [accumulated number of bytes read, int] bytes_written [accumulated number of bytes written, int] workdir_number_files_dirs [current number of files and directories, across all working directories in the tree, int] workdir_footprint [current size of working directories in the tree, in MB int]
resource: max_valueIt may contain any of the following fields, in the same units as defined for the summary file: max_concurrent_processes, wall_time, cpu_time, virtual_memory, resident_memory, swap_memory, bytes_read, bytes_written, workdir_number_files_dirs, workdir_footprint Thus, for example, to automatically kill a process after one hour, or if it is using 5GB of swap, we can create the following file limits.txt:
wall_time: 3600 swap_memory: 5242880In makeflow we then specify:
makeflow -Mmonitor_logs --monitor-limits=limits.txtOr with condor:
universe = vanilla executable = matlab arguments = -O matlab-script-log --limits-file=limits.txt -- matlab < script.m output = matlab.output matlab-script-log.summary matlab-script-log.series matlab-script-log.files transfer_input_files=script.m limits.txt should_transfer_files = yes when_to_transfer_output = on_exit log = condor.matlab.logfile queue