SWIF stati

Possible “auger_result” tag text:

  • SUCCESS
  • TIMEOUT
  • OVER_RLIMIT
  • CANCELLED
  • FAILED

Possible SWIF results from “swif status”:

  • succeeded
  • failed
  • canceled
  • problems

Possible SWIF “problems” from “swif status”:

  • SWIF-SYSTEM-ERROR
  • SWIF-USER-NON-ZERO
  • AUGER-OVER_RLIMIT
  • AUGER-FAILED
  • AUGER-TIMEOUT

Possible SWIF “problem” tag text:

  • AUGER-OVER_RLIMIT
  • AUGER-TIMEOUT
  • SWIF-SYSTEM-ERROR
  • AUGER-FAILED
  • SWIF-USER-NON-ZERO
  • AUGER-CANCELLED

for canceled jobs, this re-submits them:

swif retry-jobs -resurrect -workflow sim1_2_1 4115978

Screenshot from 2017-03-09 13-43-53

non-zero exit does not ruin success

Some correlations seen for sim1_2_1:

+----------+--------+----------+-------------+
| augerId  | status | exitCode | result      |
+----------+--------+----------+-------------+
| 33231108 | DONE   |      -11 | TIMEOUT     |
| 33271824 | DONE   |      271 | FAILED      |
| 33273275 | DONE   |      271 | CANCELLED   |
| 33171548 | DONE   |        0 | SUCCESS     |
| 33177078 | DONE   |        1 | SUCCESS     |
| 33177785 | DONE   |      -10 | OVER_RLIMIT |
Advertisements

Maui notes

from moab documentation

2.3.7 PE

The concept of the processor equivalent, or PE, arose out of the need to translate multi-resource consumption requests into a scalar value. It is not an elementary resource but rather a derived resource metric. It is a measure of the actual impact of a set of requested resources by a job on the total resources available system wide. It is calculated as follows:

PE = MAX(ProcsRequestedByJob / TotalConfiguredProcs,
MemoryRequestedByJob / TotalConfiguredMemory,
DiskRequestedByJob / TotalConfiguredDisk,
SwapRequestedByJob / TotalConfiguredSwap) * TotalConfiguredProcs

from moab documentation

5.3.1.A FSPOLICY – Specifying the Metric of Consumption

As Moab runs, it records how available resources are used. Each iteration (RMPOLLINTERVAL seconds) it updates fairshare resource utilization statistics. Resource utilization is tracked in accordance with the FSPOLICY parameter allowing various aspects of resource consumption information to be measured. This parameter allows selection of both the types of resources to be tracked as well as the method of tracking. It provides the option of tracking usage by dedicated or consumed resources, where dedicated usage tracks what the scheduler assigns to the job and consumed usage tracks what the job actually uses.

Metric Description
DEDICATEDPES Usage tracked by processor-equivalent seconds dedicated to each job. This is based on the total number of dedicated processor-equivalent seconds delivered in the system. Useful in dedicated and shared nodes environments.
DEDICATEDPS Usage tracked by processor seconds dedicated to each job. This is based on the total number of dedicated processor seconds delivered in the system. Useful in dedicated node environments.
DEDICATEDPS% Usage tracked by processor seconds dedicated to each job. This is based on the total number of dedicated processor seconds available in the system.
[NONE] Disables fairshare.
UTILIZEDPS Usage tracked by processor seconds used by each job. This is based on the total number of utilized processor seconds delivered in the system. Useful in shared node/SMP environments.

Example 5-5:

An example may clarify the use of the FSPOLICY parameter. Assume a 4-processor job is running a parallel /bin/sleep for 15 minutes. It will have a dedicated fairshare usage of 1 processor-hour but a consumed fairshare usage of essentially nothing since it did not consume anything. Most often, dedicated fairshare usage is used on dedicated resource platforms while consumed tracking is used in shared SMP environments.

FSPOLICY    DEDICATEDPS%
FSINTERVAL  24:00:00
FSDEPTH     28
FSDECAY     0.75


from maui documentation:

FSPOLICY
one of the following: DEDICATEDPS, DEDICATEDPES
[NONE]
specifies the unit of tracking fairshare usage. DEDICATEDPS tracks dedicated processor seconds. DEDICATEDPES tracks dedicated processor-equivalent seconds
FSPOLICY DEDICATEDPES

(Maui will track fairshare usage by dedicated process-equivalent seconds)

From Jie Chen:
We are using Maui version 3.2.6p19.

ROOT, Python 2, CentOS7, ifarm1402

Did not succeed building ROOT 6 on ifarm1402 (CentOS7) using the Python 2 version in /apps.

[ 88%] Building CXX object bindings/pyroot/CMakeFiles/PyROOT.dir/src/TTupleOfInstances.cxx.o
Linking CXX shared library ../../lib/libPyROOT.so
/usr/bin/ld: /apps/python/2.7.12/lib/libpython2.7.a(myreadline.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC
/apps/python/2.7.12/lib/libpython2.7.a: could not read symbols: Bad value
collect2: error: ld returned 1 exit status
gmake[4]: *** [lib/libPyROOT.so] Error 1
gmake[4]: Leaving directory `/w/halld-scifs1a/home/gluex/gluex_top/root/root-6.06.08/build_dir'
gmake[3]: *** [bindings/pyroot/CMakeFiles/PyROOT.dir/all] Error 2
gmake[3]: Leaving directory `/w/halld-scifs1a/home/gluex/gluex_top/root/root-6.06.08/build_dir'
gmake[2]: *** [all] Error 2
gmake[2]: Leaving directory `/w/halld-scifs1a/home/gluex/gluex_top/root/root-6.06.08/build_dir'
make[1]: *** [root-6.06.08/.build_done] Error 2
make[1]: Leaving directory `/w/halld-scifs1a/home/gluex/gluex_top/root'
make: *** [root_build] Error 2

Structure of typical job info table in jproj

MariaDB [farming]> describe dc_03_reconJob;
+-----------------+---------------+------+-----+-------------------+-----------------------------+
| Field           | Type          | Null | Key | Default           | Extra                       |
+-----------------+---------------+------+-----+-------------------+-----------------------------+
| id              | int(11)       | NO   | PRI | NULL              | auto_increment              |
| run             | int(11)       | YES  |     | NULL              |                             |
| file            | int(11)       | YES  |     | NULL              |                             |
| jobId           | int(11)       | YES  |     | NULL              |                             |
| timeChange      | timestamp     | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| username        | varchar(64)   | YES  |     | NULL              |                             |
| project         | varchar(64)   | YES  |     | NULL              |                             |
| name            | varchar(64)   | YES  |     | NULL              |                             |
| queue           | varchar(64)   | YES  |     | NULL              |                             |
| hostname        | varchar(64)   | YES  |     | NULL              |                             |
| nodeTags        | varchar(64)   | YES  |     | NULL              |                             |
| coresRequested  | int(11)       | YES  |     | NULL              |                             |
| memoryRequested | int(11)       | YES  |     | NULL              |                             |
| status          | varchar(64)   | YES  |     | NULL              |                             |
| exitCode        | int(11)       | YES  |     | NULL              |                             |
| result          | varchar(64)   | YES  |     | NULL              |                             |
| timeSubmitted   | datetime      | YES  |     | NULL              |                             |
| timeDependency  | datetime      | YES  |     | NULL              |                             |
| timePending     | datetime      | YES  |     | NULL              |                             |
| timeStagingIn   | datetime      | YES  |     | NULL              |                             |
| timeActive      | datetime      | YES  |     | NULL              |                             |
| timeStagingOut  | datetime      | YES  |     | NULL              |                             |
| timeComplete    | datetime      | YES  |     | NULL              |                             |
| walltime        | varchar(8)    | YES  |     | NULL              |                             |
| cput            | varchar(8)    | YES  |     | NULL              |                             |
| mem             | varchar(64)   | YES  |     | NULL              |                             |
| vmem            | varchar(64)   | YES  |     | NULL              |                             |
| script          | varchar(1024) | YES  |     | NULL              |                             |
| files           | varchar(1024) | YES  |     | NULL              |                             |
| error           | varchar(1024) | YES  |     | NULL              |                             |
+-----------------+---------------+------+-----+-------------------+-----------------------------+
30 rows in set (0.00 sec)

Workflow Status Meeting, October 21, 2014

  • presently a scaffold
  • can create workflow
    • can have phases, one phase finishes before the next begins
  • can pause workflow
    • pending jobs are cancelled
  • can cancel a workflow
  • other issues
    • ability to specify login shell
    • add output files dynamically, i. e., within the job
    • start with a clean environment, just the pbs, auger, and swif info
  • server wakes up and checks status on each workflow
    • can release jobs, will optimize tape access
    • checks for errors, errors must be cleared before proceeding
  • suggestion: update status of running jobs

multiple output files in farm jobs

each output file needs a:

  • staging directory
  • mss directory (doubles as the cache directory)

each job has a one to many relationship with the files

  • adding a column for each type of file seems awkward
  • use a separate table for file status, independent of job status
    • have file types defined
    • row tracks status of a particular file
    • foreign key to jobs

First need to define the files in a configurable way, not on the command line for a particular project