Environment on Genepool
When you log into the Genepool system you will land in your $HOME directory on NERSC's "global homes" file system. The global homes file system is mounted across all NERSC computation systems with the exception of PDSF. The $HOME directory has quota of 40GB and 1,000,000 inodes. To customize your environment, by setting environment variables or aliases, you will need to modify one of the "dot" files that NERSC has created for you. You may NOT modify the .bashrc or .cshrc files. These are set to read-only on NERSC systems and specify system specific customizations. Instead you should modify a file called .bashrc.ext or .cshrc.ext.
Important Environment Variables
The dotfiles work by setting a number of environment variables. In particular, the following are set for you, and extreme care should be taken before adjusting these:
|$HOME||points to the location of your home directory in the filesystem|
|$BSCRATCH||points to the location of your projectb scratch directory in the filesystem|
|$SCRATCH||points to the "best" scratch directory you have access to on the current system; note this is very context dependent. In most cases, it is strongly recommended you use $BSCRATCH. Currently, on Genepool, $SCRATCH points to your projectb scratch, but $SCRATCH will point to completely different scratch directories on other NERSC systems.|
|$NERSC_HOST||identifies the NERSC system environment you are presently using|
|$TMPDIR||location of current temporary space. In an ssh-interactive session TMPDIR will point to $SCRATCH. In a batch-scheduled job, TMPDIR will point to a special job directory on the node local disk. TMPDIR should always be used when writing job-specific data on the compute nodes.|
Setting up your work environment with Modules
The JGI and NERSC have been collaborating to provide a large number of bioinformatics and many other software packages on genepool. These software are made available to you by the modules system. Please read the general documentation on using modules at NERSC. There are a number of default modules. These should usually not be unloaded unless there is a very specific need to do so (e.g. swap PrgEnv/gnu4.6 for PrgEnv/gnu4.8 if needed).
Genepool Default Modules:
- PrgEnv/gnu4.6: a meta-module which manages the GNU gcc environment and compatible libraries.
- gcc/4.6.3: all dynamically linked software in the NERSC supported software environment is built using this verison of gcc; thus many software packages are dependent on this module. gcc is dynamically loaded by PrgEnv/gnu4.6
- jgitools: adds the legacy /jgi/tools into your environment
- uge: sets up the proper environment for accessing the genepool GridEngine batch scheduler
- OFED: on Infiniband-enabled nodes, the OFED module will be loaded to support software which can make use of the high-speed interconnects
- usg-default-modules: A meta-module which loads some of the default modules (at present: gcc, OFED)
- nsg: NERSC system-level utilities (e.g. myquota)
- modules: the modules environment
- mysql: mysql client executables and libraries
- oracle_client: Oracle database client executables, libraries, and JGI configuration
- oracle-jdk: the latest JDK from Oracle
Common Environment Variables set by Modules
The modules system works by manipulating your environment. modules typically insert directory paths at the beginning of each of several environment variables. The common environment variables set by genepool modules:
- PATH: the PATH specifies the directories where commands can be found
- LD_LIBRARY_PATH: the directories where shared libraries can be located
- MANPATH: to find manual pages
- PYTHONPATH: search paths for python packages
- PERL5DIR: search paths for perl packages
- PKG_CONFIG_PATH: the pkgconfig system enables automated discovery of libraries by the autoconf/automake tools
- <MODULENAME>_DIR: if you load gcc, then GCC_DIR will usually be set to enable easy search: e.g. ls $GCC_DIR
You should carefully consider the consequences before manually modifying any of the above environment variables. A couple of good "rules of thumb":
- If you are considering manually encoding a path with /global/common or /usr/common into your software or into your environment, please use the module instead. There are multiple paths to the software installed in this location, and the proper way to access it changes depending on your current context. It is best to use the modules to help setup your environment for you!
- If you need to refer to a path in the module, use the <MODULENAME>_DIR environment variable. You can see all the settings of a module by entering: "module show <modulename>"
Loading modules can have additional effects. The genepool modules are interconnected to ease the loading of dependencies. Frequently when you load a module, swap a module, or remove modules other modules may be loaded, swapped, or removed. You can see the effects of loading a module by using the "module show <module>" command.
Working with Modules for production-level batch scripts
When writing a batch script which you may share with another genepool user, it can be difficult to ensure the environment the other user will be compatible with your batch script - usually because of differences in the dotfile configurations and interactive-usage preferences. For this reason, it is recommended that if you choose to load additional modules (or unload them) in your dotfiles, that you carefully consider how this will affect your jobs as well. One good practice for getting reproducible results from a batch script is to purge all modules from the environment, and manually construct the exact environment you need:
#!/bin/bash module purge module load usg-default-modules module load blast+/2.2.27 blastn ...
Purging the module environment and then loading the specific version of needed modules is the recommended approach for batch scripts. Transmitting the full environment through qsub (using the "-V") flag is not recommended because it is implicitly context-dependent and potentially leads to non-reproducible calculations.
Loading modules by default in the dotfiles
For common tasks in an interactive environment it can be convenient to load certain modules by default for all interactive sessions. If this is needed, the recommended mechanism is to embed the module commands into your .bashrc.ext or .tcshrc.ext (depending on if you are a bash or tcsh user). Each NERSC system has different modules, for this reason, but your dotfiles are evaluated by all systems. Thus, you should check to make sure that $NERSC_HOST is "genepool", when loading genepool modules.
## .bashrc.ext Example if [ "$NERSC_HOST" == "genepool" ]; then
# make user-specific changes to PATH
# then load modules
module load blast+ fi ...
## .tcshrc.ext Example
if ($NERSC_HOST == "genepool") then
# make user-specific changes to PATH
setenv PATH $HOME/scripts:$PATH
# then load modules
module load blast+
If you alter one of the commonly manipulated environment variables in your dotfiles, it is critical that you take extreme care. For example, if you manually add /jgi/tools/bin to your PATH - and have the jgitools module loaded, the evaluation order of the PATH will likely be incorrect and you may experience unexpected side effects. It is recommended to make any manual modifications to PATH, LD_LIBRARY_PATH, and others earlier in the dotfiles than the module commands.
Using modules in cron jobs
The cron environment on genepool does not have a complete environment setup. In particular important environment variables like $SCRATCH, $HOME, $BSCRATCH, $NERSC_HOST may not be setup. Also, modules will not work. To get a proper environment for a cron job, you'll need to start a new login-style shell for your process to work in. For a simple job this can be done like:
# crontab 07 04 * * * bash -l -c "module load python; python /path/to/myScript.py"
If you need a more extensive environment setup, you can simply put the entire cronjob into a script, and call the script from your crontab.
*** script #!/bin/bash -l module load python module load hdf5 ... *** crontab entry 07 04 * * * /path/to/myScript
The key with both of these methods is the "bash -l" which is ensuring that a new environment is initialized for the shell which will be complete (including modules!).