Memory Considerations
Overview
Carver login nodes each have 48GB of physical memory. Most compute nodes have 24GB; however, 80 compute nodes have 48GB. Not all of this memory is available to user processes. Some memory is reserved for the Linux kernel. Furthermore, since Carver nodes have no disk, the "root" file system (including /tmp) is kept in memory ("ramdisk"). The kernel and root file system combined occupy about 4GB of memory. Therefore users should try to use no more than 20GB on most compute nodes, or 44GB on the large-memory compute nodes.
There is also a single "extra large" memory node that has four 8-core Intel X7550 ("Nehalem EX") 2.0 GHz processors (32 cores total) and 1TB memory. This node is available through the Magellan queue "mag_xlmem".
Memory Limits
Carver compute nodes have no disk for swapping virtual memory. If a user job tries to use more physical memory than is available, it can cause severe problems for the operating system, possibly leading to system crashes and/or hangs. Therefore, per-process memory limits are enforced on all login and compute nodes.
| Type of Node | Soft Limit | Hard Limit |
|---|---|---|
| Login Node | 2GB | 2GB |
| 24GB 8-core Compute Node | 2.5GB | 20GB |
| 48GB 12-core Compute Node | 3.5GB | 44GB |
The above compute node "soft" limits were chosen to allow typical MPI programs running fully "packed" (i.e., 8 processes per Nehalem node or 12 processes per Westmere node) safely to access the maximum amount of memory. There are cases where it is desirable to run fewer processes per node than the number of cores. These include:
- A single process (possibly multithreaded) that needs access to the entire hard limit.
- An MPI application where each process needs access to more than the default soft limit.
- A "mixed-model" application where each MPI process is multithreaded.
In the above cases, it will be necessary to override the default soft limit. This may be done with the PBS resource pvmem. This resource requires an integer value, so it is sometimes necessary to specify it in megabytes instead of gigabytes. The following table shows appropriate values for pvmem depending on the number of processes per node (PBS resource ppn):
| ppn | 24GB 8-core Node | 48GB 12-core Node |
|---|---|---|
| 1 | pvmem=20GB | pvmem=44GB |
| 2 | pvmem=10GB | pvmem=22GB |
| 3 | pvmem=6826MB | pvmem=15018MB |
| 4 | pvmem=5GB | pvmem=11GB |
| 5 | pvmem=4GB | pvmem=9011MB |
| 6 | pvmem=3413MB | pvmem=7509MB |
| 7 | pvmem=2925MB | pvmem=6436MB |
| 8 | not needed (default) | pvmem=5632MB |
| 9 | N/A | pvmem=5006MB |
| 10 | N/A | pvmem=4505MB |
| 11 | N/A | pvmem=4GB |
| 12 | N/A | not needed (default) |
Note that the product ppn*pvmem must be no greater than 20GB for 24GB nodes, or no greater than 44GB for the 48GB nodes. Jobs that specify total memory sizes greater than these values may fail to start.
For example, to run a job that requires 8 MPI processes, each having access to 10GB of memory, would require 4 8-core nodes:
#PBS -l nodes=4:ppn=2
#PBS -l pvmem=10GB
#PBS -l walltime=00:30:00
cd $PBS_O_WORKDIR
mpirun -np 8 ./a.out
Interactively the command is:
qsub -I -V -q interactive -l nodes=4:ppn=2 -l pvmem=10GB -l walltime=00:30:00
Requesting Large-Memory Nodes
160 Carver compute nodes have 48 GB of memory, rather than the 24 GB found on most nodes. To request these large-memory nodes, use the "bigmem" option when requesting nodes:
#PBS -l nodes=4:ppn=8:bigmem
#PBS -q regular
#PBS -l walltime=00:10:00
cd $PBS_O_WORKDIR
mpirun -np 48 ./my_big_executable
In this script, the user is requesting 4 nodes that each contain 48 GB of memory. Note that it might take longer for such a job to start, as the batch system must wait for the desired nodes to become available.
Extra-Large-Memory Node
There is a single node available that has 4 8-core Nehalem EX processors (32 cores total) and 1TB memory. It is accessible through the "reg_xlmem" batch queue. Each job submitted to this queue must specify one node and no more that 8 tasks.
The batch system can schedule multiple jobs on this node, depending on available memory. Therefore, jobs submitted to mag_xlmem must specify a memory request. The amount of memory requested must be at least 4GB per task, and no more than 800GB for the entire job:
#PBS -q reg_xlmem
#PBS -l mem=300GB
#PBS -l nodes=1:ppn=4
cd $PBS_O_WORKDIR
mpirun -np 4 ./my_xlmem_executable
In the above example, each MPI process will be able to access 75GB of memory. Note shared-memory applications using OpenMP (or some other threading model) might be more suitable for this single node than MPI applications. Remember to use ppn=1 for "pure" shared-memory applications.
Serial Jobs
Jobs submitted to the serial queue should specify memory requirements for efficient job scheduling. These jobs will run on 12-core, 48GB nodes. By default, a serial job will be limited to 3.5GB, but this can be increased up to 44GB through the use of the "pvmem" directive. The following job requests a single core, and 10GB of memory, for 12 hours:
#PBS -q serial
#PBS -l pvmem=10GB
#PBS -l walltime=12:00:00
cd $PBS_O_WORKDIR
./a.out
Note that the argument to pvmem must be an integer. Therefore it is sometimes necessary to use different units. For example, if you want 7.5GB, you should specify:
#PBS -l pvmem=7500MB


