NERSCPowering Scientific Discovery for 50 Years

2003 User Survey Results

Hardware Resources

  • Legend
  • Satisfaction - Compute Platforms (sorted by Average Score)
  • Satisfaction - Compute Platforms (sorted by Platform)
  • Max Processors Used and Max Code Can Effectively Use
  • Satisfaction - HPSS
  • Satisfaction - Servers
  • Satisfaction - Networking
  • Summary of Hardware Comments
  • Comments on NERSC's IBM SP:   51 responses
  • Comments on NERSC's PDSF Cluster:   17 responses
  • Comments on NERSC's HPSS Storage System: 29 responses
  • Comments about NERSC's auxiliary servers:   5 responses

 

Legend:

SatisfactionAverage Score
Very Satisfied 6.5 - 7
Mostly Satisfied 5.5 - 6.4
Somewhat Satisfied 4.5 - 5.4
Significance of Change
significant increase
significant decrease
not significant

 

Satisfaction - Compute Platforms

Sorted by average score

QuestionNo. of ResponsesAverageStd. Dev.Change from 2002Change from 2001
SP Overall 192 6.43 0.78 0.05 0.61
SP Uptime 191 6.42 0.83 -0.14 0.89
PDSF Overall 68 6.41 0.87 0.15 NA
PDSF Uptime 62 6.35 1.04 -0.16 NA
SP Disk Configuration and I/O Performance 156 6.15 1.03 0.18 0.48
PDSF Queue Structure 59 6.00 0.96 0.03 NA
PDSF Batch Wait Time 61 5.93 1.12 0.19 NA
PDSF Ability to Run Interactively 64 5.77 1.39 -0.41 NA
PDSF Disk Configuration and I/O Performance 59 5.69 1.15 0.06 NA
SP Queue Structure 177 5.69 1.22 -0.23 0.50
SP Ability to Run Interactively 162 5.57 1.49 0.10 0.86
SP Batch Wait Time 190 5.24 1.52 -0.17 0.32

 

Satisfaction - Compute Platforms

Sorted by Platform

QuestionNo. of ResponsesAverageStd. Dev.Change from 2002Change from 2001
SP Overall 192 6.43 0.78 0.05 0.61
SP Uptime 191 6.42 0.83 -0.14 0.89
SP Disk Configuration and I/O Performance 156 6.15 1.03 0.18 0.48
SP Queue Structure 177 5.69 1.22 -0.23 0.50
SP Ability to Run Interactively 162 5.57 1.49 0.10 0.86
SP Batch Wait Time 190 5.24 1.52 -0.17 0.32
PDSF Overall 68 6.41 0.87 0.15 NA
PDSF Uptime 62 6.35 1.04 -0.16 NA
PDSF Queue Structure 59 6.00 0.96 0.03 NA
PDSF Batch Wait Time 61 5.93 1.12 0.19 NA
PDSF Ability to Run Interactively 64 5.77 1.39 -0.41 NA
PDSF Disk Configuration and I/O Performance 59 5.69 1.15 0.06 NA

 

Max Processors Used and Max Code Can Effectively Use

QuestionNo. of ResponsesAverageStd. Dev.Change from 2002Change from 2001
SP Processors Can Use 139 609.41 1006.23 63.41 -141.59
Max SP Processors Used 161 444.84 733.31 273.84 242.84
Max PDSF Processors Used 35 13.06 43.25 -21.94 NA
PDSF Processors Can Use 34 10.26 21.37 -86.74 NA

 

Satisfaction - HPSS

QuestionNo. of ResponsesAverageStd. Dev.Change from 2002Change from 2001
Reliability 126 6.61 0.77 0.10 -0.02
Uptime 126 6.54 0.79 0.17 0.21
Performance 126 6.46 0.88 0.11 0.10
HPSS Overall 134 6.46 0.84 0.07 -0.04
User Interface 127 5.98 1.24 0.03 -0.04

 

Satisfaction - Servers:

QuestionNo. of ResponsesAverageStd. Dev.Change from 2002Change from 2001
Escher 13 5.23 1.30 -0.15 0.15
Newton 15 5.20 1.37 -0.24 -0.27

 

Satisfaction - Networking

Question;No. of ResponsesAverageStd. Dev.Change from 2002Change from 2001
LAN 114 6.54 0.67 NA NA
WAN 100 6.12 1.02 NA NA

 

Summary of Hardware Comments

Comments on NERSC's IBM SP

[Read all 51 responses]

16   Good machine
15   Queue issues
12   Scaling comments
8   Provide more interactive and debugging resources
5   Allocation issues
4   Provide more serial resources
3   User environment issues
2   Other (down times and need faster processors)

Comments on NERSC's PDSF Cluster

[Read all 17 responses]

10   Disk, I/O and file issues
6   Batch issues
4   Good system
4   Provide more interactive and debugging resources
2   Down time issues
2   Other (slow processors, utilitarian code)

Comments on NERSC's HPSS Storage System

[Read all 29 responses]

14   Good system
7   Hard to use / user interface issues
5   Performance improvements needed
3   Authentication is difficult
2   Don't like the down times
2   Network / file transfer problems
2   Other (Grid, SRUs)

Comments about NERSC's math and vis servers

[Read all 5 responses]

3   Network connection too slow
2   Good service
1   Remote licenses

 

Comments on NERSC's IBM SP:   51 responses

Good machine:

Great! Now where is the power 4 version? But really, this is a great machine.

... I appreciate the fact the machine is rarely down.

It is an amazing machine. It has enabled us to do work that would be entirely out of reach otherwise.

This is a very useful machine for us.

... The system is so good and so well managed in just about every other way [except that there are not enough interactive resources]. On a positive note, I am very happy that NERSC opted to expand the POWER3 system rather than moving to the POWER4 or another vendor. Seaborg is very stable and, the above comment notwithstanding, very well managed. It is also large enough to do some serious work at the forefront of parallel computing. This strategy is right in line with the aims of my research group and, I believe, in line with a path that will lead to advancements in supercomputing technology. ...

Great Machine!

SP is very user friendly.

It is doing its job as expected

Nice system when everything works.

NERSC's facilities are run in impressive fashion. ...

My code uses MPI-2 one sided primitives and I have been pleasantly surprised by the dual plane colony switch performance. Previously we have worked extensively on Compaq SCs with quadrics, and while the IBM performance is not quite equal, there are far fewer penalties for running n MPI processes on n processors of a node (on the SC is often better to run only 3 processes per 4 processor node). Although the latency of the IBM is quite high (measured performance is at least x3 of quadrics) we can accept this penalty and still achieve good performance. ...

Excellent! ...

The IBM SP is very powerful.

It is a very fast machine. ...

Great!

Excellent system with the most advanced performance!!!!!!!!!!

Queue issues:
long turnaround for smaller node jobs:

Batch can be very slow and seems to have gotten worse in the past year despite the increase in CPU # -- I often use the "debug" queue to get results when the job is not too big. Things seem to have worsened for users requesting 1 or 2 nodes --- perhaps because of the increasing emphasis on heavy-duty users.

Wait time for short runs with many processors sometimes takes too long.

The queue structure is highly tilted towards large jobs. While it is not as bad as at some other supercomputer centers, running jobs that only use 64-128 processors with reasonable wait times still requires extensive use of the premium queue. An 8 hour job on the regular queue using 64 processors generally requires several days of waiting time. Such low compute efficiencies are not only frustrating, they make it very difficult to use a reasonably sized allocation in a year.

Batch turnaround time is long, but this is unavoidable because of the number of users. The premium queue is necessary for running benchmark/diagnostic short-time jobs.

.... Also, the current queue structure favors many-node jobs which is a disadvantage for me. I am integrating equations of motion in time, I don't need many nodes but would like to see queue wait time go down. Currently my jobs may spend more time waiting in the queue than actually running.

The batch queues take much longer than last year (probably because there are more jobs submitted). The queue can also stall when a large long job (>128 nodes, >8 hours) is at the head of the queue while the machine waits for enough processors to free up. Is it possible to allow jobs to allow small jobs to start then checkpoint so there are less wasted cycles? It would also be good if the pre_1 queue had higher priority, which could be offset by a higher cost for using it.

The job stays in queue too long. Sometime, I need to wait 2-3 days to run a one-hour job.

sometimes there are so many big nodes jobs running it sort of locks out everyone else.

Long turnaround / increase wall limit for "reg_1l" class:

With the old queue structure, we seldom had to wait more than a day for a job to start. With the new queue structure the wait has increased to of order a week. If this continues into FY2004, we will not be able to get our work done. ...

waiting time for small jobs is quite frequently inadequately long; there should be also a queue for small, but non-restartable jobs with longer than 24 hours limit.

very long waits at low priority:

My students complains are usually due to the waiting time in queues, especially for medium-size jobs, if using priority 0.5, which is a must if we want to optimize the use of our allocation.

long waits in September:

I am very satisfied with the average batch job wait time, but I noticed that recently (as of 09/12/03) the waiting time is too long.

long wait time for large memory jobs:

... and waiting for a 64-node job on the 32 GB nodes is often not practical.

wants pre-emption capability:

... Alternatively [to providing a Linux cluster for smaller users] is there any kind of scheduling algorithm that would pre-empt long-running 4-processor jobs when the bigger users need 1000+ processors? ...

has to divide up jobs:

I have found that to get things to run, you have to break the job into smaller pieces. ...

Scaling comments:
limits to scaling:

Please inform me if the 4k MPI limit is not existent anymore

... I have noticed a large increase in memory usage as the number of MPI processes is increased. This is quite a concern since very often we run in a memory limited capacity. I would like to be able to provided further comment about the performance of the machine when using 4096 P but as yet I have only been able to partial run my code on this many processors. I'd like to see a focus on increasing network bandwidth and reducing latency in the next generation machine. I am not convinced that clusters of fat SMPs is an effective model unless the connecting pipes are made much larger.

I'd VERY MUCH like to see more large-memory nodes put into service, or memory upgrades for the existing nodes. Using 16 MPI tasks per node, the 1 GB/processor limit is a definite constraint for me.

... The IBM SP is capable of only coarse-grained parallelism. This limits the number of processors that it can use efficiently on our jobs. Its capabilities are now exceeded by large PC (Linux) clusters, which probably have a better price/performance ratio than the SP. Of course such comparisons are unfair, since the SP is no longer new. I should thus be comparing new clusters with Regatta class IBM machines. I am unable to do this since, on the only Regatta class machine to which I have access, I am limited to running on a single (32 processor) node.

I can't believe any one can actually make use of this machine with an efficiency level that's any more than pathetic.

general scaling comments:

We're improving our code to make use of many more processors - e.g. 300-600 within the next year.

Currently, our jobs do not yet require more than 64 processors. Our code scales quite well with number of processors up to 64.

Max. number of processors your code can effectively use per job depends on the size of the problem I am dealing with.

The max. # of processors really depends on how big a given lattice is. For next year, we may have access to 40^3x96 lattices, so I expect the max number of nodes to increase to 32, maybe more.

We could use more processors, but our current simulation size dictates that this is the most efficient use of resources. In the upcoming year we plan to use more processors, and perform some larger simulations.

too much emphasis on large number of processors:

See earlier comments on promoting super large jobs, for reasoning behind our reservations. [Although we successfully tested large jobs, I do not believe these jobs could serve our scientific goals well. I could easily see using 8 nodes of seaborg for our activation energy barriers determination jobs, but using more nodes than that would not be efficient or necessary. In fact, I see a significant threat to excellent quality supercomputing research by expanding in the direction of using more and more nodes per job. I suspect that a good fraction of the work pursued at seaborg, although excellent, because of the very nature of the problems handled, one cannot expect linear scaling to very many nodes. We believe that 25% of the resources devoted on this super large jobs is already too much.]

Please, see my comments regarding the scaling initiative. [... Although we are using state-of-the-art parallel linear algebra software (SuperLU_DIST), scaling to increase speed for a given problem has limits. Furthermore, when solving initial-value problems, the time-dimension is completely excluded from any 'domain' decomposition. Our computations typically scale well to 100-200 processors. This leaves us in a middle ground, where the problems are too large for local Linux clusters and too small to qualify for the NERSC queues that have decent throughput. My opinion is that it is unfair for NERSC to have the new priority initiative apply to all nodes of the very large flagship machine, since it weights the "trivially parallel" applications above the more challenging computations, which have required a more serious approach to achieve parallelism.]

Provide more interactive and debugging resources:

It is virtually impossible to do any interactive work on seaborg. This is a major shortcoming of its configuration, not of the IBM SP architecture. With a system of seaborg's size, it should be straightforward and not ultimately detrimental to system throughput to allow more interactive work. I usually debug interactively on a Linux cluster prior to moving an application to the SP. Often in moving it over, there are IBM-specific issues that I need to address prior to submitting a long run. Being forced to do this final bit of debugging by submitting a succession of batch jobs is not a good use of my time nor optimal for job throughput on seaborg itself. PLEASE....fix this.

I commented in the 2002 that interactive access to seaborg was terrible. It still is. YOU NEED TO HAVE DEDICATED _SHARED_ACCESS_ NODES TO SOLVE THIS PROBLEM!!!!! I am sick to death of trying to run Totalview and being DENIED because of "lack of available resources" error messages. GET WITH IT ALREADY!!

NERSC response: Please see the "-retry" and "-retrycount" flags to poe (man poe). These flags can set your interactive job to retry its submission automatically so that you won't have to do so manually due to "lack of available resources".

We are developers so interactive, benchmarking time even for very large node configurations is often important.

Ability to run interactively after-hours is sometimes < desirable. Because of the dedicated nodes, it is generally satisfactory during office hours, but grad students are not restricted to office hours... :-)

... (Except the ability for interactive runs)

For interactive jobs, 30 min limit is too short, I feel. Would you increase this limit to 1 or 2 hours?

Sould reduce the waiting time for debug queue

... Debug class should be given priority during the weekends and the nights.

NERSC response: One of the major constraints of the batch queue system on seaborg is the speed with which resources can be shifted from one use to another. This impacts how quickly we can take the system down and very directly impacts the speed at which we can provide interactive/debug resources. Resource demands for debug and interactive tend to be very spiky and our approach thus far has been to try to estimate demand based on past usage. In order to best meet future demand for debug and interactive we monitor the utilization of the resources currently devoted to the debug and interactive classes.

We also allow debug and interactive work to run anywhere in the machine if such an opportunity arises. This is somewhat at odds with the fact that utilization in the main batch pool is very high, but every little bit helps and we do what we can within the given constraints.

The hours prior to system downtimes are a excellent time to do debug cycles. The cycles lost to the period of hours prior to the downtime over which the machine is drained can be in part recovered by debug and interactive work that is allowed to proceed during that time. This is a very limited time frame, but may be useful users who can schedule a time to do development/debug work on their parallel codes.

Batch queue policies and system issues are often discussed at the NERSC Users Group meeting. If you feel debug/interactive classes are not working, we encourage you to participate in NUG's work to improve the system. Your ideas and suggestions for how to work within the constraints inherent in the machine are welcome.

Provide more serial services:

I am still partly on the stage of porting the codes to the AIX system, and it would be good to have some nodes available as single processors.

For some time a number of users have been asking for you to set aside some processors for serial jobs that run for longer than 30 minutes. I also would find this useful for calculations that are not readily parallizable and need the software environment provided by NERSC. Given the large number of processors that your system now has this would seem to have a very minimal impact on your overall throughput. You need to be responsive to this need in your queue structure.

Need to run on one processor for a parallel code (e.g., MPI code), and to be charged for only one processor. The problem is many programs do not have serial versions, but nevertheless need to be run for small systems (on a single processor) from time to time. One solution is to have a special queue run on a few nodes (1-2) and to be run on a time sharing fashion. So, a MPI job can always be run (no waiting time), and the performance for a single processor job will be okay, and the charge is based on a single processor.

I'd like to see a high priority queue with very long CPU time limit (days or weeks) [this project only ran 1 processor jobs]

User environment issues:

The IBM SP is not native 64 bit and this has created headaches which were inexistent on Cray hardware.

My comment concerns the output, writing to a file. It happened that my program exceeded time limits. It was terminated (as it should be) and it returned no data in the output files. I lost many hours of valuable computing. Is there any way to do something about it? It would be useful to have an output even when the program exceeds the time limit. Other than that I am very satisfied with I/O performance.

... Finally, for such a large system, there is frightenily little disk space for checkpoint files, graphics dump files etc. I have been in a situation where a job of mine is going to generate more output than $SCRATCH can accommodate. This leaves me to sit there and quickly offload files interactively to HPSS. I would like to see this rectified in the current system, if possible. Additionally, I would like this problem to be kept in mind when budgeting is done for follow-on systems.

Other:

... But I don't understand the regular down time due to maintenance. Is it possible to shutdown only the troubled nodes and keep the rest of the machine running?

Processors are getting older. You need to get new, faster systems

 

Comments on NERSC's PDSF Cluster:   17 responses

Disk, I/O and file issues:

I've only had a little trouble with the disks; I've had jobs die with uninteruptible-sleep status that is probably due to disk access problems. Also the "effective" number of CPUs I can use for a job (well, a set of jobs) is limited by disk access.

... One of the main issues I have faced as a user is disk I/O for lots of jobs running against a common dataset. The dvio resource tool helps keep things running smoothly but it has a complicated syntax/interface and limits the number of active jobs, thus slowing down the performance of my work.

... Disk vaults seem to crash quit often and it takes a long time for them to come back. Since all my data is sitting on one vault this causes delays of my analysis.

There is still work needed on the overall data management couple with I/O performance issues.

Problems with the disk to which I was writing put constraints on the number of processors that I could use at a time to run my jobs.

... situation with NFS on the datavaults not very satisfactory ...

The commodity disk vaults are not as reliable as we would like them to be. The reliability has been markedly better this past year then previously, but we still lose a few days a year to disk vault failures. In addition, the simultaneous load caps for the disk vaults is starting to impact us as we scale up our processing. ...

iI think the individual machines could have larger swap space. Also the disk vault system seems to be a bit unreliable, perhaps it is NFS.

... Slow IO

get file list could be implemented

Batch issues:

It sometimes takes over a day for a job to start. ...

... a new batch queue between short and medium, say 8 hours of CPU, would be nice.

... We would also like an intermediate queue on PDSF in between short and medium. The jump between 1-hour and 24-hours is a large jump and we have a number of jobs of a few hours or less but more than 1 hour which would benefit from such an intermediate queue.

... I also understand that it is not good to allow jobs to run for too long.

... The only reason I didn't put "Very" satisfied in some of my answers above is that the LSF software has some "features" which I don't like that well (you have to write a script if you want to selectively kill a large number of jobs, for example), but I don't think it's NERSC's fault. It's a pretty good batching system overall. ...

LSF sometimes terminates my job when the network bandwidth gets slow. This is unpredictable, because it depends not only on my jobs, but other people's jobs that share the network. I'm not sure if there's a fix for this, but I'd like to know about it if there is.

Good system:

Very nice system, well maintained and running smoothly. System admins try and maintain the system mostly 'behind the scenes' which is a great relief compared to some other large scale clusters. The well running of PDSF was essential in achieving our scientific goals.

A well-oiled operation!

great system. ...

pdsf is well maintained and very useful. ...

Provide more interactive and debugging resources:

We need better interactive response. ...

... Oftentimes when I am running interactively, the wait is very, very long to do anything (even if I just type "ls").

- interactive nodes overloaded - when someone is using HSI on an interactive node, the node is basically unusable ...

No problems except: - No working debugger for STAR software - Slow IO

Down time issues:

Recently I got the impression that a lot of nodes were down and therefore eating up my jobs.

... We also would like to decrease the down time.

Other:

The machines are somewhat too slow. I understand that there are cost considerations. ...

My own code is usually rather utilitarian and not run many times, KamLAND data processing is excluded from this statement.

 

Comments on NERSC's HPSS Storage System:   29 responses

Good system:

HPSS is the best mass storage I've ever used or heard of.

Better than it's ever been.

HPSS is very useful. ...

Just like with the PDSF system - very well run, with little visible interference. The Storage Group has gone out of their way to help us out doing our science. Staff has contacted us on various occasions to help optimize our usage.

Very fast and very useful.

I am very happy with the "unix-like" interface.

very good

Fast, efficient, and simple.

This system is great. Of all of the mass storage systems I have used at multiple sites around the world, this is the best.

It could be that I've been around long enough to have experienced the old storage system and thus have very low expectations, but I think this is a great system.

I have been impressed with the relative ease of use and efficiency of HPSS.

HPSS is great, ...

I like it.

I couldn't get much done without this.

Hard to use / user interface issues:

web documentation is overwhelming for a first time user of HPSS and HSI.

My only real substantive comment is that I find the hsi interface to be unnecessarily user unfriendly. Why can it not be endowed with some basic UNIX shell functionality (examples: (1) a history mechanism, (2) command-line editing, (3) recall of previous commands with emacs and vi editing capability, (4) more mnemonic command names in parallel to UNIX.) None of this improvements is rocket science and could be easily implemented to make everybody's life easier.

HSI is a truly godawful tool, and useless for manipulation of data on mass storage by programs. Thus I am restricted to FTP for any work involving data on HPSS. Thank goodness there's a Perl module.

You can't backspace when you are inside hpss, which means I either have to be very slow and precise in how I type my commands (which is a pain for navigating through multiple directory levels) or I end up typing them more than once. Can this be fixed? ...

A transparent user interface i.e. direct use of ls & cp would be a nice feature.

I did not use this system for quite a while. I remember the user interface was not very good before. It is not easy to get multiple files using one command.

... I haven't gotten to spend enough time with HSI 2.8 to see if it offers improvements with it's scheduling.

Performance improvements needed:

... It is a little slow in fetching data, but I guess that is to be expected given the amount of data stored there.

... Otherwise, I find hpss useful and even though transfer rates are slower than my patience would like, I understand the time limitations of transferring from tape to disk and can work around the speed issue.

... the only negative is the occasional long wait to access something I have put on their a long time previously, but this is understandable. ...

... Faster file retrieval would always be nice. ...

KamLAND has worked with the HPSS people to improve throughput by taking advantage of the fact that we read out data in big chunks, but that could still be utilized further.

Authentication is difficult:

Very difficult to understand. I stopped using it because every time I have to use it again, I forget the whole Jazz of the login/password. For example, today I couldn't use it! Why not make it simpler?

please think about the initial password setup. maybe it is possible to make this easier for the user. Once it is setup it works really well.

I have never used it because it sounds so complicated to use it for the first time.

Don't like the down times

HPSS's weekly maintenance occurs in the middle of the work day for both the East and the West Coast. To my eyes it would make more sense to take advantage of time zones and inconvenience fewer users. (And yes, I know this comes off as self-serving since the most likely alternative --- afternoon on the West Coast --- is a great help for those of us in the East. However, I still think the idea is sound.)

The Tuesday maintenance is always irritating, but I understand that if it needs to be done it's better to have it scheduled during the day when people will notice rather than at night we people's jobs will fail because they'll forget. ...

Network / file transfer problems:

I was having problems with not getting complete file transfers from HPSS to Seaborg sometimes, I think. I would have to reload from HPSS to get complete large files.

I transferred about 800Mb of data between the HPSS at NERSC and ORNL. Apart from some bugs in hsi which were eventually resolved, it worked well, except the link between NERSC and ORNL would die every few hours, which meant more intensive baby-sitting on my part. I don't know what would cause the link to die. It may just have been random hiccups.

Other:

Looking forward to Grid access tools.

... Further, the accounting for HPSS, i.e. SRU units, is a bit strange in that prior year information stored counts so much (Gb x 4).

 

Comments about NERSC's math and vis servers:   5 responses

Network connection too slow:

The only problem is that the network connection from Germany makes interactive visualization impractical. But I'm not sure you can do anything about this.

From my point of view as a remote user, the visualization resources are not always convenient to use or well projected to those of us in the outside world (network response times are too slow). ...

It is too slow, so I was not able to use them in last year.

Good service:

The math server is very helpful for me.

I don't use escher as much as I'd like, but it has certainly been nice and easy to use when I have done anything on it.

Remote licenses:

... You need to better develop the floating license approach and make it easier to use for remote users.