Resolved: Hopper /scratch file system slow
August 6, 2014
Users have reported scripts hung when involving file copies from/to the /scratch file system, or jobs running in /scratch are slower than before since late last week.
If it is convenient, move your workflow to /scratch2 temporarily, avoid any reference to /scratch. It includes setting your input files directory, your batch job submission directory, your executable file location, and the $TMPDIR setting (which is set to $SCRATCH by default).
This problem has been resolved. The issue was caused when the MSS (IO manangement) server had a failover (to the backup server) on July 22, it lost the default stripe count of 2 on /scratch file system, and the new files had a stripe count of 1. The stripe count has been set back to 2 on Aug 5. The file stripe count of 1 triggered a bug with "cp -p" in a loop calling sytem function ioctl(). The bug has been fixed in new Lustre versions 1.8.6 or later. Hopper is on the path to upgrade to Lustre 2.4 soon.
To check what stripe count a file or directory has:
% lfs getstripe my_file_or_dir
The hung problem may still appear with files or directories having stripe count of 1. It can be changed to 2 with the following steps:
% mv my_dir my_dir.orig
% cp -r my_dir.orig my_dir
The new my_dir will have stripe count of 2.