NERSCPowering Scientific Discovery for 50 Years

2006 User Survey Results

Hardware Resources

 

Legend:

SatisfactionAverage Score
Very Satisfied 6.50 - 7.00
Mostly Satisfied 5.50 - 6.49
Somewhat Satisfied 4.50 - 5.49
Significance of Change
significant increase
significant decrease
not significant

 

Hardware Satisfaction - by Score

7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied

ItemNum who rated this item as:Total ResponsesAverage ScoreStd. Dev.Change from 2005
1234567
HPSS: Reliability (data integrity)       2   22 69 93 6.70 0.59 -0.03
HPSS: Uptime (Availability)       1 2 29 62 94 6.62 0.59 -0.06
Jacquard: Uptime (Availability)       2 2 26 55 85 6.58 0.66 0.73
Network performance within NERSC (e.g. Seaborg to HPSS)     2 1 3 38 72 116 6.53 0.75 -0.08
NGF: Reliability       3   6 17 26 6.42 0.99  
NGF: File and Directory Operations       1 1 8 12 22 6.41 0.80  
Bassi: Uptime (Availability)     1 4 4 31 52 92 6.40 0.85  
HPSS: Overall satisfaction 1   2 1 5 37 58 104 6.38 0.96 -0.13
NGF: Overall     1 1 2 5 17 26 6.38 1.06  
NGF: Uptime   1   1 1 7 16 26 6.35 1.16  
GRID: Job Submission 1     1 1 2 14 19 6.32 1.53 -0.21
HPSS: Data transfer rates 1 1 2 2 5 33 52 96 6.29 1.10 -0.11
NERSC CVS server       1   2 4 7 6.29 1.11 0.08
Jacquard: overall   2   2 10 28 46 88 6.27 1.01 0.49
Bassi: overall 2   3 5 2 30 57 99 6.26 1.23  
GRID: Access and Authentication   1   2   6 14 23 6.26 1.29 -0.16
Seaborg: Uptime (Availability)   1 4 3 20 52 79 159 6.23 0.99 -0.33
GRID: Job Monitoring 1     2   4 13 20 6.20 1.54 -0.30
GRID: File Transfer     2 1 1 5 13 22 6.18 1.30 -0.10
NGF: I/O Bandwidth     1   3 9 10 23 6.17 0.98  
Seaborg: overall 1   4 7 17 75 64 168 6.10 1.00 0.18
Bassi: Disk configuration and I/O performance 1 1 1 5 2 36 30 76 6.08 1.16  
HPSS: Data access time 1 1 3 2 9 37 38 91 6.08 1.17 0.08
DaVinci: overall   2   1 3 9 15 30 6.07 1.36 0.42
Jacquard: Disk configuration and I/O performance   1 1 8 3 25 30 68 6.06 1.16 0.18
PDSF: Batch queue structure     1 3 3 20 11 38 5.97 0.97 -0.03
Jacquard: Batch queue structure   1 3 6 7 34 28 79 5.95 1.14 0.49
PDSF: Batch wait time     1 3 5 18 12 39 5.95 1.00 0.15
Seaborg: Disk configuration and I/O performance 1 1 4 13 13 55 49 136 5.92 1.19 -0.14
Bassi: Batch queue structure 1   2 9 7 38 29 86 5.92 1.16  
Remote network performance to/from NERSC (e.g. Seaborg to your home institution) 1 5 10 4 19 64 63 166 5.89 1.33 -0.24
Jacquard: Batch wait time 1   3 5 10 40 23 82 5.87 1.13 0.71
Bassi: Batch wait time     3 7 16 40 25 91 5.85 1.02  
PDSF: Uptime (availability)     4 3 6 12 17 42 5.83 1.31 -0.06
HPSS: User interface (hsi, pftp, ftp) 1 1 4 7 14 35 33 95 5.83 1.26 -0.29
PDSF: Overall satisfaction   1 3 1 4 23 11 43 5.81 1.20 -0.19
Seaborg: Batch queue structure 1 4 5 13 21 61 48 153 5.77 1.27 0.72
Jacquard: Ability to run interactively   2 2 9 7 25 23 68 5.76 1.29 0.20
Seaborg: Ability to run interactively   2 10 13 19 42 45 131 5.71 1.32 0.18
Bassi: Ability to run interactively 2 4 3 8 8 25 25 75 5.55 1.60  
PDSF: Ability to run interactively 1 1 1 4 11 17 6 41 5.39 1.30 -0.40
PDSF: Disk configuration and I/O performance 1   7 5 6 13 7 39 5.10 1.54 -0.04
Seaborg: Batch wait time 6 5 27 11 35 56 19 159 4.94 1.57 0.99

 

Hardware Satisfaction - by Platform

7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied

ItemNum who rated this item as:Total ResponsesAverage ScoreStd. Dev.Change from 2005
1234567
IBM POWER 5 p575: Bassi
Bassi: Uptime (Availability)     1 4 4 31 52 92 6.40 0.85  
Bassi: overall 2   3 5 2 30 57 99 6.26 1.23  
Bassi: Disk configuration and I/O performance 1 1 1 5 2 36 30 76 6.08 1.16  
Bassi: Batch queue structure 1   2 9 7 38 29 86 5.92 1.16  
Bassi: Batch wait time     3 7 16 40 25 91 5.85 1.02  
Bassi: Ability to run interactively 2 4 3 8 8 25 25 75 5.55 1.60  
CVS Server
CVS server       1   2 4 7 6.29 1.11 0.08
SGI Altix: DaVinci
DaVinci: overall   2   1 3 9 15 30 6.07 1.36 0.42
Grid Services
GRID: Job Submission 1     1 1 2 14 19 6.32 1.53 -0.21
GRID: Access and Authentication   1   2   6 14 23 6.26 1.29 -0.16
GRID: Job Monitoring 1     2   4 13 20 6.20 1.54 -0.30
GRID: File Transfer     2 1 1 5 13 22 6.18 1.30 -0.10
Archival Mass Storage: HPSS
HPSS: Reliability (data integrity)       2   22 69 93 6.70 0.59 -0.03
HPSS: Uptime (Availability)       1 2 29 62 94 6.62 0.59 -0.06
HPSS: Overall satisfaction 1   2 1 5 37 58 104 6.38 0.96 -0.13
HPSS: Data transfer rates 1 1 2 2 5 33 52 96 6.29 1.10 -0.11
HPSS: Data access time 1 1 3 2 9 37 38 91 6.08 1.17 0.08
HPSS: User interface (hsi, pftp, ftp) 1 1 4 7 14 35 33 95 5.83 1.26 -0.29
Opteron/Infiniband Linux Cluster: Jacquard
Jacquard: Uptime (Availability)       2 2 26 55 85 6.58 0.66 0.73
Jacquard: overall   2   2 10 28 46 88 6.27 1.01 0.49
Jacquard: Disk configuration and I/O performance   1 1 8 3 25 30 68 6.06 1.16 0.18
Jacquard: Batch queue structure   1 3 6 7 34 28 79 5.95 1.14 0.49
Jacquard: Batch wait time 1   3 5 10 40 23 82 5.87 1.13 0.71
Jacquard: Ability to run interactively   2 2 9 7 25 23 68 5.76 1.29 0.20
NERSC Network
Network performance within NERSC (e.g. Seaborg to HPSS)     2 1 3 38 72 116 6.53 0.75 -0.08
Remote network performance to/from NERSC (e.g. Seaborg to your home institution) 1 5 10 4 19 64 63 166 5.89 1.33 -0.24
NERSC Global Filesystem
NGF: Reliability       3   6 17 26 6.42 0.99  
NGF: File and Directory Operations       1 1 8 12 22 6.41 0.80  
NGF: Overall     1 1 2 5 17 26 6.38 1.06  
NGF: Uptime   1   1 1 7 16 26 6.35 1.16  
NGF: I/O Bandwidth     1   3 9 10 23 6.17 0.98  
Linux Cluster: PDSF
PDSF: Batch queue structure     1 3 3 20 11 38 5.97 0.97 -0.03
PDSF: Batch wait time     1 3 5 18 12 39 5.95 1.00 0.15
PDSF: Uptime (availability)     4 3 6 12 17 42 5.83 1.31 -0.06
PDSF: Overall satisfaction   1 3 1 4 23 11 43 5.81 1.20 -0.19
PDSF: Ability to run interactively 1 1 1 4 11 17 6 41 5.39 1.30 -0.40
PDSF: Disk configuration and I/O performance 1   7 5 6 13 7 39 5.10 1.54 -0.04
IBM POWER 3: Seaborg
Seaborg: Uptime (Availability)   1 4 3 20 52 79 159 6.23 0.99 -0.33
Seaborg: overall 1   4 7 17 75 64 168 6.10 1.00 0.18
Seaborg: Disk configuration and I/O performance 1 1 4 13 13 55 49 136 5.92 1.19 -0.14
Seaborg: Batch queue structure 1 4 5 13 21 61 48 153 5.77 1.27 0.72
Seaborg: Ability to run interactively   2 10 13 19 42 45 131 5.71 1.32 0.18
Seaborg: Batch wait time 6 5 27 11 35 56 19 159 4.94 1.57 0.99

 

Hardware Comments:   37 responses

 

Overall Hardware Comments:   12 responses

Need more resources

Hopefully franklin will fix the long queue wait times.

Please get more computers.

Please assign more space of hardware to its users.

Queue comments

The queue structure could have some improvement, sometimes jobs requiring many nodes make the queue slow but I am sure that you are looking into this.

I run the CCSM model. The model runs a relatively small number of processors for a very long time. For example, we use 248 processors on bassi. On Seaborg, we could potentially get one model year/wallclock day. Since we usually run 130 year simulations, if we had 248 processors continuously, it would take 4.5 months to run the model. We didn't get even close to that. Our last seaborg run took 15 months real time, which is intolerably slow.

Bassi runs faster. On bassi, we get roughly 10 model years/wallclock day, a nice number. So it's cheaper for us to run on bassi, and better. bassi is down more frequently, and I get more machine related errors when running on bassi.

On both machines your queue structure does not give us the priority that we need to get the throughput that we have been allocated. For now it's working because bassi isn't heavily loaded. But as others leave seaborg behind and move onto bassi, the number of slots we get in the queue will go down, and we'll find ourselves unable to finish model runs in a timely fashion again.

Submission of batch jobs is not well documented.

The most unsatisfactory part for me is the confusing policy of queueing the submitted jobs. In an ideal world, it should be first come, first serve with some reasonable constraints. However, I frequently find my jobs waiting for days and weeks without knowing why. Other jobs of similar types or even those with low priority sometimes can jump ahead and run instantaneously. This makes rational planning of the project and account management almost impossible. I assume most of us are not trained as computer scientists with special skills who can find loop holes or know how to take advantages of the system. We only need our projects to proceed as planed.

Good overall

Bassi is great. The good network connectivity within NERSC and to the outside world and the reliability of HPSS make NERSC my preferred platform for post-processing very large scale runs.

Hardware resources at NERSC are the best I have used anywhere. NERSC and in particular Dr. Simon Horst should be congratulated for setting up and running certainly one of the best supercomputing facilities in the world.

Other comments

NERSC could have a more clear and fair computing time reimbursement/refund policy. For example (Reference Number 061107-000061 for online consulting), on 11/07/2006, I had a batch job on bassi interrupted by a node failure. The loadleveler automatically restarted the batch job from beginning, overwritting all the output files before the node failure. Later I requested refund of the 1896 MPP hours wasted in that incident due to the bassi node failure. But my request was denied, which I think is unfair.

I have not done extensive comparison on I/O and network performance. Hopefully, next year I'll be able to provide more useful information here.

every nersc head node should be running grid ftp

every nersc queuing node should be running GT4 GRAM

 

Comments by Bassi Users:   7 responses

Charge factor

the charge factor of 6 for bassi is absolutely ridiculous compared to jacquard. it performs only half as good as jacquard.

We have consistently found (and NERSC consultants have confirmed) a speedup factor of 2 for Bassi relative to Seaborg on our production code. Because the charge factor is 6, and because we see a speedup of 3 on Jacquard, Bassi is currently not an attractive platform for us, except for extremely large and time-sensitive jobs.

Queue comments

I really like Bassi; however, the availability of Bassi for multiple small jobs is difficult, since only 3 jobs from a user can run at a time; this is difficult to deal with when I have many of these jobs, even when the queues are rather small.

I don't understand why bassi has restriction on using large number of nodes (i.e., > 48 nodes requires special arrangement.)

Disk storage comments

Scratch space is small. My quota is 256GB. Simulation we are currently running are on a 2048^3 and we solve for 3 real variables per grid point giving a total of 96 GB per restart dataset. After 6 hours of running (maximum walltime on bassi), we continue from a restart dataset. But sometimes, we need to do checkpointing (i.e. generate the restart files) half way through the simulation. This amount of being able to hold 3 datasets (initial conditions, half-way checkpoint and at the end) which is not possible. Moreover, for simulations of more scientific interest we solve for 5 variables per grid point. The restart dataset in this case is 160 GB. This means that we cannot run, checkpoint and continue. This quota also prevents fast postprocessing of the data when several realizations of the fields (many datasets) are needed to get reliable statistical results.

I could not unzip library source code on Bassi because it limited the number of subdirectories I could create. That machine is useless to me unless I can get Boost installed. ...

Login problems

Logon behavior to Bassi can be inconsistent with good passwords sometimes being rejected, and then accepted at the next attempt. Molpro does not generally work well on multiple nodes. This is not too much of a problem on Bassi as there are 8 processors per node, but better scaling, with respect to number of nodes, is possible for this code.

 

Comments by DaVinci Users:   2 responses

I use multiple processors on DaVinci for computations with MATLAB. The multiple processors and rather fast computation are extremely useful for my research projects on climate and ice sheet dynamics. Via DaVinci NERSC has been a huge help to my research program.

... The machine I have been able to effectively use is Davinci because it has Intel compilers. NERSC support has not been helpful at all in getting my software to run on various machines.

 

Comments by Jacquard Users:   2 responses

On jacquard, it might be nice to make it easier for users who want to submit a large number of single-processor jobs as opposed to a few massively parallel jobs. This is possible but in the current configuration, the user has to manually write code to submit a batch job, ssh to all the assigned nodes, and start the jobs manually. Perhaps that is intentional, but the need does arise, for instance when it is possible to divide a task such that it can be run as 1000 separate jobs which do not need to communicate.

Jacquard is much harder to use than the IBM-SP's...

 

Comments on Network Performance:   3 responses

we produce output files faster than we can transfer them to our home institution, even using compression techniques. this is usually not an issue, but it has been recently.

Network performance to HPSS seems a bit slower than to resources such as Jacquard. Not sure of how much of a hit this actually is. Just an impression.

It is quite possible that I am unaware of a better alternative, but using BBFTP to transfer files to/from Bassi from/to NSF centers I see data rates of only 30-40 MB/sec. This isn't really adequate for the volume of data that we need to move. For example, I can regularly achieve 10x this rate between major elements of the NSF Teragrid. And that isn't enough either!

 

Comments about Storage:   3 responses

... The HPSS hardware seems great, but the ability to access it is terrible [see comments in software section].

The largest restriction for us is usually disk and storage; we have been able to work with consulting to make special arrangements for our needs, which have been very helpful.

The low inodes quota is a real pain.

 

Comments by PDSF Users:   6 responses

Diskservers at PDSF are faring reasonably well, but occasional crashes/outages occur. The move to GFPS has made disk more reliable, but still occasional crashes occur. These sometimes mean that PDSF is unavailable for certain tasks for up to several days (depending on the severity of the crash). This should be an area of continued focus and attention.

My biggest problem with using PDSF has always been that regularly the nodes just freeze, even on something as simple as an ls command. Typically I think this is because some user is hammering a disk I am accessing. This effects the overall useability of the nodes and can be very frustrating. My officemates all use pdsf, and we regularly inform each other about the performance of PDSF to decide whether it is worth trying to connect to the system at all or if it would be better to wait until later.

Interactive use on PDSF is often too slow.

I think that the NGF and the general choice for GPFS is a great improvement over the previous NFS-based systems. I am worried that in recent months we have seen the performance of the PDSF home FS drop significantly.

The switch to GPFS from NFS on PDSF seems to be overall a good thing, but there are now occasional long delays or unavailability of the home disks that I don't like and don't understand...

PDSF has got way too few SSH gateway systems, plus they seem to be selected by round-robin DNS aliasing and thus it is entirely possible to end up on a host with load already approaching 10 while there are still machines available doing absolutely nothing; what I normally do nowadays is look at the Ganglia interface of PDSF and manually log in to the machine with smallest load. There is a definite need for proper load balancing here! Also, it may make sense to separate interactive machines into strict gateways (oriented on minimal latency of connections, with very limited number-crunching privileges) and interactive-job boxes (the opposite).

 

Comments by Seaborg Users:   3 responses

Seaborg is a little slow, but that is to be expected. The charge factors on the newer, faster machines are dauntingly high.

Our project relies primarily on our ability to submit parallel jobs to the batch queue on Seaborg. To that end, the current setup is more than adequate.

There have been persistent problems with passwords (to seaborg) being reset or deactivated. In one case my password was deactivated but I was not informed (via email or otherwise. This may have been the result of a security breach at our home institute. Several hours are lost trying to regain access to NERSC.