Notes on singularity and gluex software

July 29, 2017

Things to do:

  • Figure out how to copy files to oasis
  • Figure out how to copy containers to singlularity cvmfs

Useful commands:

singularity expand centos7.img
sudo /usr/local/bin/singularity shell --writable centos7.img
/usr/local/bin/singularity shell --bind /group/halld:/group/halld centos7.img

Getting mysql shared library to be seen by container:

> eval `addpath.pl -l /home/marki/lib`
> addpath.pl -l /home/marki/lib
LD_LIBRARY_PATH=/home/marki/lib:/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/gluex_root_analysis/gluex_root_analysis-0.2^ccdb165/Linux_CentOS7-x86_64-gcc4.8.5/lib/:/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/evio/evio-4.4.6/Linux-x86_64/lib:/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/rcdb/rcdb_0.01/cpp/lib:/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/ccdb/ccdb_1.06.05/lib:/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/geant4/geant4.10.02.p02/lib64:/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/root/root-6.08.06/lib:/group/halld/Software/builds/Linux_CentOS7-x86_64-gcc4.8.5/xerces-c/xerces-c-3.1.4/lib:/.singularity.d/libs
> ls /home/marki/lib
libmysqlclient.so.20
> cp /usr/lib64/mysql/libmysqlclient.so.20 /u/scratch/marki

Try to find where the non-standard library is coming from, on ifarm1401:

> repoquery -f /usr/lib64/mysql/libmysqlclient.so.20.3.2
mysql-community-libs-0:5.7.15-1.el7.x86_64
> repoquery -i mysql-community-libs-0:5.7.15-1.el7.x86_64

Name        : mysql-community-libs
Version     : 5.7.15
Release     : 1.el7
Architecture: x86_64
Size        : 9898444
Packager    : MySQL Release Engineering <mysql-build@oss.oracle.com>
Group       : Applications/Databases
URL         : http://www.mysql.com/
Repository  : mysql
Summary     : Shared libraries for MySQL database client applications
Source      : mysql-community-5.7.15-1.el7.src.rpm
Description :
This package contains the shared libraries for MySQL client
applications.

Tracking down the mysql shared library needed:

In CentOS7 Singularity container:

> mysql --version
mysql Ver 15.1 Distrib 5.5.52-MariaDB, for Linux (x86_64) using readline 5.1

On ifarm1402:

> mysql --version
mysql Ver 14.14 Distrib 5.7.15, for Linux (x86_64) using EditLine wrapper
> ldd `which hd_root` | grep mysql
libmysqlclient.so.20 => /usr/lib64/mysql/libmysqlclient.so.20 (0x00007f670896c000)

On lorentz:

> mysql --version
mysql Ver 15.1 Distrib 5.5.52-MariaDB, for Linux (x86_64) using readline 5.1
> ldd `which hd_root` | grep mysql
libmysqlclient.so.18 => /usr/lib64/mysql/libmysqlclient.so.18 (0x00007f09a0cfd000)

Special repo on ifarm:

> pushd /etc/yum.repos.d
/etc/yum.repos.d /u/scratch/marki
ifarm1402:marki:yum.repos.d> ls
core72.repo epel-testing.repo.bak mysql.repo scicomp-extras.repo
epel.repo eple.repo.bak2 salt.repo
ifarm1402:marki:yum.repos.d> cat mysql.repo
# mysql rhel7 mirror
[mysql]
name = MySQL Community
baseurl=http://sca1401/yum/centos72/mysql_c7
gpgcheck=0

GlueX and the Open Science Grid

  • at JLab
    • submit host installed: for submitting jobs to the OSG
    • SciComp did installation in consultation with OSG experts
    • log-in with CUE credentials for authorized users
    • JLab users submit jobs to the OSG with the installed software
  • at Collaborating Institutions
    • meeting was held with (different) set of OSG personnel to discuss contribute of University-based clusters to OSG infrastructure
    • makes these nodes available for general GlueX computing
    • UConn, NU already contributing
    • prospective contributions from CMU, IU, FIU, FSU

SRM Reading List

From 2013-09-04

http://en.wikipedia.org/wiki/Storage_Resource_Manager
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&ved=0CGEQFjAE&url=https%3A%2F%2Fwww.opensciencegrid.org%2Ftwiki%2Fpub%2FEducation%2FCourseMaterials%2FSRM.ppt&ei=U3vkUJ7lNJG10QGK5oDIBA&usg=AFQjCNGMtKnG06yFGFEREfOxt4qRfoo2DA&sig2=bZjlAxO1LEfyTUmv3AlMBw&cad=rja
https://www.opensciencegrid.org/bin/view/Documentation/Release3/NavUserApplications
https://www.opensciencegrid.org/bin/view/Documentation/SrmBasics
http://www.youtube.com/watch?v=4hsL6vTc1Yg&feature=plcp&context=C3e32ca8UDOEgsToPDskKBZLXYsQUHgz3fQ9ePim55

Speeding Up the SRM

Transferring 100 files from run 9003 from UConn with 6 other streams going on ifarm1101:

ifarm1101:marki:srm_test> date ; srmcp -copyjobfile=copyjob.txt ; date
Wed May 28 10:41:31 EDT 2014
Wed May 28 11:05:27 EDT 2014

Srmcp help message says that multiple-streams requires that the server be in active mode. Try:

srmcp -debug -server_mode=active -copyjobfile=copyjob_short.txt

where copyjob_short.txt has three DC2 files in it. Fails with the following error text:

“””””
copying CopyJob, source = gsiftp://stat26.phys.uconn.edu:2811/Gluex/test/dana_rest_09003_4052210.hddm destination = file:////scratch/marki/srm_test/data/dana_rest_09003_4052210.hddm
2014-05-28 15:36:24,976 [Thread-73] ERROR org.dcache.srm.util.GridftpClient – org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer failed due to unexpected exception: java.net.ConnectException: Connection timed out] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 426 Transfer failed due to unexpected exception: java.net.ConnectException: Connection timed out]
2014-05-28 15:36:24,979 [Thread-1] ERROR org.dcache.srm.util.GridftpClient –  transfer exception
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer failed due to unexpected exception: java.net.ConnectException: Connection timed out]
at org.globus.ftp.exception.ServerException.embedUnexpectedReplyCodeException(ServerException.java:101) ~[gridftp-2.0.5-dcache-rc4.jar:na]
at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:182) ~[gridftp-2.0.5-dcache-rc4.jar:na]
at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:109) ~[gridftp-2.0.5-dcache-rc4.jar:na]
at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1488) ~[gridftp-2.0.5-dcache-rc4.jar:na]
at org.globus.ftp.FTPClient.get2(FTPClient.java:1751) ~[gridftp-2.0.5-dcache-rc4.jar:na]
at org.dcache.srm.util.GridftpClient$TransferThread.run(GridftpClient.java:886) ~[srm-2.2.11-SNAPSHOT.jar:2.2.11-SNAPSHOT]
at java.lang.Thread.run(Thread.java:662) [na:1.6.0_31]
copy failed with the error
org.globus.ftp.exception.ServerException: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message:  Custom message: Unexpected reply: 426 Transfer failed due to unexpected exception: java.net.ConnectException: Connection timed out] [Nested exception is org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message: Unexpected reply: 426 Transfer failed due to unexpected exception: java.net.ConnectException: Connection timed out]
try again
sleeping for 180000 before retrying
“””””

Get a similar error when trying streams_num=4 with or without server_mode=active.

We might be suffering from the lots-of-small-files problem. The server time-out in the error seems consistent with that theory. See http://toolkit.globus.org/alliance/publications/papers/Pipelining.pdf .

There is quasi-similar error report at http://www.lofar.org/wiki/doku.php?id=public:grid:troubleshooting which recommends setting stream_num=1 to fix the problem, but that error is “no route to host” not “connection timed out”

OpenShop Certificates at JLab for the SRM

On May 14 Richard added new host certificates to the appropriate directory under /apps/osg to account for some of the SRM server hosts at UConn licensed under “OpenShop”, his private label. Those will be changed in the future to refer to DigiCert as is more standard.

This was causing intermittent failures that depended on the number of files requested. If more than about 20 were in an SRM requested then it was likely that one of the untrusted servers would be involved and the transfer would bomb on the client side with no trace on the server except for a log entry indicating that the client had “canceled” the request.