Historical CPU usage, simulation

from some old data challenge or sim?.?

  • 15 hours
  • 30,000 events
  • gives 1.8 s/event

Used this in 2018 All-Hands Meeting talk



Replacing sim1_1 file in tape library

Tried to put the new file:

ifarm1102:gxproj4:sim1_1> jproj.pl sim1_1 jput
jproj.pl jput: command = cd /volatile/halld/gluex_simulations/sim1_1/rest ; jput *011454_0026* /mss/halld/gluex_simulations/sim1_1/rest/
FATAL Bad Request - A stub file already exists at /mss/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm

jput 1 files

ls on the old file:

> ls -l /cache/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm
-r--r--r-- 1 gxproj4 halld-1 7340032 Nov 19 18:15 /cache/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm

did a:

jcache tapeRemove /cache/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm

deleted /cache/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm

redid the jput on the new file in volatile:

jproj.pl sim1_1 jput
jproj.pl jput: command = cd /volatile/halld/gluex_simulations/sim1_1/rest ; jput *011454_0026* /mss/halld/gluex_simulations/sim1_1/rest/
154867173 Staging        /lustre/expphy/volatile/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm   /mss/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm
154867173 Staged         /lustre/expphy/volatile/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm   /mss/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm
154867173 Running        /lustre/expphy/volatile/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm   /mss/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm
154867173 Done           /lustre/expphy/volatile/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm   /mss/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm
jput 1 files

new stub file is:

> cat /mss/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm
creationTime=2016-11-19 18:41:14

jcache status gives:

Sun Nov 20 20:56:05 EST 2016
iteration = 89
get request: 4765609
user: gxproj4
status: pending
/cache/halld/gluex_simulations/sim1_1/rest/dana_rest_011454_0026.hddm -> failed

sim1.1 problem jobs


  • why no printout of fort.15?
  • why no printout of control.in?
  • why does random number seed not make a difference in the jobs?


  • all three jobs for run 10871, file 8 bomb with seg fault from bggen
  • same for all three for same run, file 7
  • for file 7 and 8, hdgeant reports “tracking abandoned” first on the same event, 2034
  • look at a successful run: 10391, files 1-34
    • file 1 tracking abondoned on even 11366
    • file 2: 4114
    • file 3: 20588
    • fort.15 gets typed out
    • control.in gets typed out
  • scan .out files for event where tracking abandoned
  • The 167 failed jobs are as follows:
    • run 10727, files 1-3
    • run 10777, files 1-57
    • run 10871, files 1-49
    • run 11140, files 1-58
    • these are all of the files in each run
  • Failure mode: hdgeant runs forever when finding seg faulted output from bggen. Evidence: extremely large event numbers in output.
  • Running test on run 10727, collimator comes back as “Unknown”
  • Collimator is Unknown in each of these runs.

From standard error for one failed job:

cp: cannot stat `run.ffr.Unknown_coll.template': No such file or directory
cp: cannot stat `run.ffr': No such file or directory
cp: cannot stat `control.in_Unknown_coll': No such file or directory
control.in: No such file or directory.

Sim1.1 Notes 2

  • hd_root was bombing at the beginning. Output files were stubs.
  • Sean pointed this out.
  • Error: calling terminate after throwing an exception of type long int (or something like that)
    • Cured by removing the TS_scaler plugin
  • Next error: seg fault
    • Cured by removing the L1_online plugin
  • Paul recommends removal of the occupancy_online and EPICS_dump plugins, the latter since there are no EPICS events in MC.
  • many jobs crashing with seg faults, most of them
  • 010438_0001 crashed quickly, rest file has 355 events
    • gets past that part when run interactively
  • standard err mentions PSPair_online a lot. Drop that plugin.
  • drop the timeout option
  • drop the skim plugin

Sim1.1 Notes

  • there is a sim1.1 branch in the repository, Sean says to ignore it
  • example of awk to get collimator
    > rcnd 11555 collimator_diameter | awk '{print $1}'
    > rcnd 1 collimator_diameter | awk '{print $1}'
  • On the value of the random number seed for bggen, from http://goo.gl/YWLymG:

    No initialization is necessary if the user wants default values. Otherwise the following are available:CALL RLUXGO(LUX,INT,K1,K2)When K1=K2=0 , this call initializes the RANLUX generator from one 32-bit integer INT and sets the Luxury Level.

    Assuming signed integer, max pos. value is 2147483648 so we need to keep the seed below this. 5 digit run concatenated with a 4 digit file does this.

  • from Sean, 6/29/15: By the way, here is a list of the “golden runs” and the number of triggers in each of them: /work/halld/home/gxproj3/rp2016-02-runs_events
  • from Sean list of plugins, email from 6/14:
  • from Sean, email of 6/15, additional option to hd_root:

Sim1 Notes, Time Estimate

Estimate of parameters for farm jobs.

  • 3000 events
  • cput=03:08:06,mem=781664kb,vmem=1924160kb,walltime=02:58:51
  • for a 24 hour job, want 3000*(24/3) = 24000 events, make it 25000 events
  • ask for 30 hours = 1800 minutes
  • ask for 2.5 GB of memory
  • using 410 MB of disk, ask for 8*500 MB = 4 GB