NERSCPowering Scientific Discovery for 50 Years

2007/2008 User Survey Results

Hardware Resources

  • Legend
  • Hardware Satisfaction - by Score
  • Hardware Satisfaction - by Platform
  • Hardware Comments

 

Legend:

SatisfactionAverage Score
Very Satisfied 6.50 - 7.00
Mostly Satisfied - High 6.00 - 6.49
Mostly Satisfied - Low 5.50 - 5.99
Somewhat Satisfied 4.50 - 5.49
Neutral 4.50 - 4.49
Significance of Change
significant increase
significant decrease
not significant

 

Hardware Satisfaction - by Score

7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied

ItemNum who rated this item as:Total ResponsesAverage ScoreStd. Dev.Change from 2006
1234567
NGF: Reliability       1 1 16 47 65 6.68 0.59 0.25
NGF: Uptime       1 1 17 47 66 6.67 0.59 0.32
HPSS: Reliability (data integrity)     1 3 4 29 111 148 6.66 0.70 -0.04
Network performance within NERSC (e.g. Seaborg to HPSS)       4 4 49 111 168 6.59 0.66 0.06
Jacquard: Uptime (Availability)       4 6 34 82 126 6.54 0.73 -0.04
HPSS: Uptime (Availability)     1 3 6 45 96 151 6.54 0.73 -0.08
Seaborg: Uptime (Availability)       4 7 38 89 138 6.54 0.73 0.30
Bassi: Uptime (Availability) 1   1 5 6 48 122 183 6.54 0.84 0.13
HPSS: Overall satisfaction 1   2 5 7 45 103 163 6.46 0.92 0.08
NGF: Overall       3 5 20 41 69 6.43 0.81 0.05
HPSS: Data transfer rates     2 4 16 39 88 149 6.39 0.88 0.10
Seaborg: overall       5 12 55 71 143 6.34 0.78 0.25
GRID: Access and Authentication     2 1 1 15 23 42 6.33 1.00 0.07
HPSS: Data access time   1 2 4 13 52 77 149 6.31 0.92 0.23
GRID: Job Submission     1 2 2 12 20 37 6.30 1.00 -0.02
Jacquard: overall 1   2 4 12 49 66 134 6.26 0.98 -0.01
NGF: File and Directory Operations       7 1 23 29 60 6.23 0.96 -0.18
PDSF: Uptime (availability)     2 6 8 25 36 77 6.13 1.06 0.30
GRID: File Transfer 1   2 1 1 18 19 42 6.12 1.27 -0.06
PDSF: Overall satisfaction     2 7 8 27 36 80 6.10 1.06 0.29
GRID: Job Monitoring     2 2 4 12 17 37 6.08 1.14 -0.12
NGF: I/O Bandwidth     3 5 3 28 26 65 6.06 1.09 -0.11
Bassi: Disk configuration and I/O performance     2 18 15 61 67 163 6.06 1.03 -0.02
Remote network performance to/from NERSC (e.g. Seaborg to your home institution)   5 11 9 18 69 101 213 6.06 1.25 0.17
Franklin: Batch queue structure 1 2 4 18 28 101 96 250 6.03 1.08  
Seaborg: Batch queue structure 1 3 2 8 13 56 51 134 5.99 1.19 0.22
Seaborg: Disk configuration and I/O performance     2 15 15 40 50 122 5.99 1.09 0.07
Jacquard: Disk configuration and I/O performance   1 3 11 10 46 43 114 5.98 1.11 -0.08
Bassi: overall 1 1 7 7 30 74 67 187 5.96 1.11 -0.30
HPSS: User interface (hsi, pftp, ftp) 2 1 10 8 14 47 68 150 5.96 1.35 0.13
Seaborg: Ability to run interactively   1 4 15 8 37 48 113 5.95 1.22 0.24
Jacquard: Ability to run interactively     2 12 15 28 38 95 5.93 1.12 0.16
Jacquard: Batch queue structure 1   4 9 18 49 43 124 5.92 1.13 -0.03
PDSF: Batch queue structure     1 12 7 22 26 68 5.88 1.15 -0.09
Franklin: Batch wait time 2 1 14 20 30 104 87 258 5.85 1.22  
PDSF: Batch wait time     4 12 6 28 21 71 5.70 1.22 -0.24
Franklin: overall 1 8 14 14 47 102 76 262 5.70 1.29  
Bassi: Ability to run interactively 4 2 6 18 23 44 50 147 5.63 1.46 0.08
Franklin: Ability to run interactively 1 4 10 31 36 51 65 198 5.58 1.37  
Bassi: Batch queue structure 2 2 9 26 25 66 46 176 5.57 1.32 -0.35
PDSF: Ability to run interactively   1 7 10 8 24 21 71 5.55 1.38 0.16
PDSF: Disk configuration and I/O performance 1 1 3 9 13 25 17 69 5.54 1.32 0.43
Seaborg: Batch wait time 2 2 8 12 33 47 34 138 5.53 1.32 0.59
Jacquard: Batch wait time 2 3 13 6 28 40 34 126 5.47 1.46 -0.40
Franklin: Disk configuration and I/O performance 8 11 20 36 34 73 51 233 5.15 1.63  
Franklin: Uptime (Availability) 7 11 48 12 46 86 47 257 5.04 1.64  
Bassi: Batch wait time 11 19 36 16 33 46 22 183 4.46 1.80 -1.39

 

Hardware Satisfaction - by Platform

7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied

ItemNum who rated this item as:Total ResponsesAverage ScoreStd. Dev.Change from 2006
1234567
Bassi - IBM POWER5 p575
Bassi: Uptime (Availability) 1   1 5 6 48 122 183 6.54 0.84 0.13
Bassi: Disk configuration and I/O performance     2 18 15 61 67 163 6.06 1.03 -0.02
Bassi: overall 1 1 7 7 30 74 67 187 5.96 1.11 -0.30
Bassi: Ability to run interactively 4 2 6 18 23 44 50 147 5.63 1.46 0.08
Bassi: Batch queue structure 2 2 9 26 25 66 46 176 5.57 1.32 -0.35
Bassi: Batch wait time 11 19 36 16 33 46 22 183 4.46 1.80 -1.39
Franklin - Cray XT4
Franklin: Batch queue structure 1 2 4 18 28 101 96 250 6.03 1.08  
Franklin: Batch wait time 2 1 14 20 30 104 87 258 5.85 1.22  
Franklin: overall 1 8 14 14 47 102 76 262 5.70 1.29  
Franklin: Ability to run interactively 1 4 10 31 36 51 65 198 5.58 1.37  
Franklin: Disk configuration and I/O performance 8 11 20 36 34 73 51 233 5.15 1.63  
Franklin: Uptime (Availability) 7 11 48 12 46 86 47 257 5.04 1.64  
Grid Services
GRID: Access and Authentication     2 1 1 15 23 42 6.33 1.00 0.07
GRID: Job Submission     1 2 2 12 20 37 6.30 1.00 -0.02
GRID: File Transfer 1   2 1 1 18 19 42 6.12 1.27 -0.06
GRID: Job Monitoring     2 2 4 12 17 37 6.08 1.14 -0.12
HPSS - Mass Storage System
HPSS: Reliability (data integrity)     1 3 4 29 111 148 6.66 0.70 -0.04
HPSS: Uptime (Availability)     1 3 6 45 96 151 6.54 0.73 -0.08
HPSS: Overall satisfaction 1   2 5 7 45 103 163 6.46 0.92 0.08
HPSS: Data transfer rates     2 4 16 39 88 149 6.39 0.88 0.10
HPSS: Data access time   1 2 4 13 52 77 149 6.31 0.92 0.23
HPSS: User interface (hsi, pftp, ftp) 2 1 10 8 14 47 68 150 5.96 1.35 0.13
Jacquard - Opteron/Infiniband Linux Cluster
Jacquard: Uptime (Availability)       4 6 34 82 126 6.54 0.73 -0.04
Jacquard: overall 1   2 4 12 49 66 134 6.26 0.98 -0.01
Jacquard: Disk configuration and I/O performance   1 3 11 10 46 43 114 5.98 1.11 -0.08
Jacquard: Ability to run interactively     2 12 15 28 38 95 5.93 1.12 0.16
Jacquard: Batch queue structure 1   4 9 18 49 43 124 5.92 1.13 -0.03
Jacquard: Batch wait time 2 3 13 6 28 40 34 126 5.47 1.46 -0.40
NERSC Network
Network performance within NERSC (e.g. Seaborg to HPSS)       4 4 49 111 168 6.59 0.66 0.06
Remote network performance to/from NERSC (e.g. Seaborg to your home institution)   5 11 9 18 69 101 213 6.06 1.25 0.17
NGF - NERSC Global Filesystem
NGF: Reliability       1 1 16 47 65 6.68 0.59 0.25
NGF: Uptime       1 1 17 47 66 6.67 0.59 0.32
NGF: Overall       3 5 20 41 69 6.43 0.81 0.05
NGF: File and Directory Operations       7 1 23 29 60 6.23 0.96 -0.18
NGF: I/O Bandwidth     3 5 3 28 26 65 6.06 1.09 -0.11
PDSF - Linux Cluster (Parallel Distributed Systems Facility)
PDSF: Uptime (availability)     2 6 8 25 36 77 6.13 1.06 0.30
PDSF: Overall satisfaction     2 7 8 27 36 80 6.10 1.06 0.29
PDSF: Batch queue structure     1 12 7 22 26 68 5.88 1.15 -0.09
PDSF: Batch wait time     4 12 6 28 21 71 5.70 1.22 -0.24
PDSF: Ability to run interactively   1 7 10 8 24 21 71 5.55 1.38 0.16
PDSF: Disk configuration and I/O performance 1 1 3 9 13 25 17 69 5.54 1.32 0.43
Seaborg - IBM POWER3 (Decommissioned January 2008)
Seaborg: Uptime (Availability)       4 7 38 89 138 6.54 0.73 0.30
Seaborg: overall       5 12 55 71 143 6.34 0.78 0.25
Seaborg: Batch queue structure 1 3 2 8 13 56 51 134 5.99 1.19 0.22
Seaborg: Disk configuration and I/O performance     2 15 15 40 50 122 5.99 1.09 0.07
Seaborg: Ability to run interactively   1 4 15 8 37 48 113 5.95 1.22 0.24
Seaborg: Batch wait time 2 2 8 12 33 47 34 138 5.53 1.32 0.59

 

Hardware Comments:   48 responses

  • 13 Franklin Stability comments
  • 11 Franklin Performance comments
  • 7 Happy with Franklin
  • 5 Franklin User Environment comments
  • 5 comments about HPSS
  • 4 comments about scheduled downs
  • 3 comments by Bassi Users
  • 2 comments about the NERSC Global Filesystem / shared file storage
  • 2 queue configuration and wait comments
  • 2 comments on Network Performance
  • 2 comments by PDSF Users
  • 1 comment by DaVinci Users
  • 2 other comments

 

Franklin stability comments:   13 responses

Franklin is not very stable. Now it is better, but any future upgrade might make it unstable again for a long time.

Aside from the Franklin unreliability January-April, the rest of the systems are excellent. The Franklin problem appears to have been solved, overall, but the Lustre file system access can vary by a factor of 10, it seems. If something could be done to reduce the periods of inaccessible /scratch on Franklin, it would be very appreciated.

Things I would like to see improved: ... 2)Franklin going down for "unscheduled maintenance" often. Although, I am happy to see it's up-time has been much improved recently.

The unscheduled downtimes for Franklin are seriously impacting our development. We need to be pushing our codes to 4000+ processors and Franklin is our only resource for this work. I also don't like being told I need to use my cycles in a timely fashion, while also being told that the machine is often down, and when it comes up the queues are deeply loaded.

Jacquard looks much more stable than Franklin

To date, Franklin, the Cray XT4, has not achieved the level of reliability I have come to expect from other NERSC systems. Although I am not certain of the hardware details, the problem seems to be a result of the inadequacy of the LUSTRE file system to deal with the volume of output from many simultaneously running massively parallel executables. This has resulted in slowdowns (on both interactive and batch nodes), unexpected and unrepeatable code failures, corruption of output, and far too many unscheduled system outages. This is especially disappointing given that Franklin is the only operating NERSC machine with the number of nodes necessary to accommodate many MPP users at once.

... Franklin is often nearly unusable. It's only tolerable when I remember that the alternative is ORNL's machine! ...

compared to the other NERSC systems (Seaborg, Bassi, and Jacquard), I am somewhat dissatisfied by the uptime of Franklin, but compared to the other Cray XT3/4 system I have access to (at ORNL), I am quite satisfied with the uptime of Franklin. ...

NERSC remains a great resource for scientific computing, and our group relies on it very much for computational research. However, the Franklin system has been a step backwards in terms of reliability. This has contributed to very long queues on Bassi, which is probably serving more types of computations than what was intended when purchased.

I put "mostly dissatisfied" for the I/O performance on Franklin due to regular failures that we experienced when writing large restart files when running our code. However, when it does work, the I/O speeds are very high.

Franklin is fantastic when it's working.

I have had some data integrity problems on Franklin's /scratch. I also had a problem where I worked for four hours editing code on Franklin, wrote the file many times to disk (in my home directory), and then Franklin went down. When it came up, the file was empty, zero bytes. Very frustrating, but nothing anyone could do for me afterwards.

franklin has a lot of promise to be the new NERSC workhorse (especially after the quadcore upgrade). Hopefully the filesystem will catch up to the rest of the machine soon :-)

 

Franklin performance comments:   11 responses

The Franklin login nodes are painfully slow at times. This is a real show stopper as simple ls can take 3-4 minutes at times.

Franklin's login nodes are not very responsive / slow most of the time.

The command to view the queue on Franklin takes a ridiculous amount of time to execute. Sometimes it's 10 seconds or so. Also the file system access time seems pretty slow on Franklin. I don't know if these two things are related.

Things I would like to see improved: 1)Franklin frequently has slow command line responsiveness. ...

the lustre file system on franklin is problematic. sometimes i have to wait to vi a small text file or do an ls. i have also seen problems where a running job cannot find a file and simply resubmitting the job with no changes works. the login nodes on franklin are often very slow.

(AM CHECKING TO SEE IF FRANKLIN COMMENT) Sometimes response is very slow. For example, the ls command takes few second. This is frustrating.

for the codes I use the Franklin charge factor is twice what it should be in relation to both Bassi and Jacquard.

Our model runs on Franklin and uses a standard NetCDF library for I/O and frequently times out when opening (or, more rarely, closing) a NetCDF file. This causes the model run to terminate. It is very frustrating trying to do production work on Franklin for this reason.

... I am somewhat unhappy with the sometimes large fluctuations in performance on Franklin, in particular in sections of the code that involve significant MPI communication. I suspect this is related to network activity (MPI activity) on the system and/or by the physical location of the compute nodes relative to each other.

I run at NERSC infrequently, and only at the request of some projects with which I work. I found that performance variability on Franklin was quite high the few times that I have run there. I find that this is partly a characteristic of the Cray XT4 and how jobs are scheduled, and also occurs on XT systems elsewhere. My brief experience on Franklin indicated to me that the problem was worse there than elsewhere, possibly due to the nature of the NERSC workload. However, I have not been on the Franklin for a number of months, and do not know whether these issues have been resolved. If/when my projects require that I run there again, I will run my performance variability benchmarks again and report what I see.

Disappointed by the performance and cost of Franklin. It didn't offer a refuge from Seaborg for codes that like a high processor to node ratio. All of those users got pushed onto Bassi. Hopefully the quad processor upgrade will improve this, but only if the OS software too improves.

 

Happy with Franklin:   7 responses

franklin is excellent.

Over all I'm very happy with Franklin. It has allowed our research group to complete scientific projects that were not possible before. ...

... You've done a great job with franklin.

Franklin is fantastic when it's working.

I've been very happy with Franklin which, despite the usual new hardware start-up bumps, has been a great machine for my applications.

franklin has a lot of promise to be the new NERSC workhorse (especially after the quadcore upgrade). ...

I'm very happy with the number of processors and power available on Franklin, but have also found the MPP charge large enough that I've limited my runs more than I'd like in order not blow through my allocation.

 

Franklin User Environment comments:   5 responses

1) would like to write data from large franklin job to /project
2) would like more disk space
3) want old (>month) data automatically shuffled off to hpss

My problem with Franklin is that the nodes don't have local scratch disk, as they do on Jacquard.

Our code relies extensively on Python. Not having shared libraries on the compute nodes of Franklin is a big problem for us, and has been preventing us from porting our code to Franklin so far.

Lack of dynamic loading on Fanklin nodes has prevented our use (and that of others) -- ideally this should be part of the evaluation before machine acquisition.

As ever, we need better performance tools that *really* work when used on 1000s of processors, jobs where we aren't sure where the problems are, where we really need the answers, and when we are *already* in a bad mood. Totalview and ddt are close on the debugging side, but the performance tools are lacking. Make the provider produce a "screencast' showing the performance analysis of a 2000 processor job with lots of mpi. The current vendor tools still have huge problems with the data deluge. The supposedly performing academic tools aren't setup/installed/tested/trusted.

 

HPSS Comments:   5 responses

it would be nice to have a graphical interface to the storage filesystem

Would be nice if hsi had a "copy if file doesn't exist, don't overwrite if it does" flag. Maybe it does and I just can't figure it out.

Our use of NERSC is primarily focused on the storage system. I would like to see a few more tools implemented to assist in archiving.
Specifically I would like to have the md5sum and/or sha1 computed and stored/cached at NERSC and displayable through hsi commands like 'ls' or 'find'. Checking file sizes and timestamps is not sufficient for assuring data integrity, especially since many data files are identical in size, but not in content.
It would also be very helpful to have a rsync-like command implemented in hsi so that changes to a directory can be easily mirrored to NERSC.
Lastly, and somewhat less importantly, to save space I would like to see htar seamlessly implement bzip2 and/or gzip to the resulting tar files. The idx files should point to the correct spot in the compressed file archive. Many forms of data pack better into a single tar file that is compressed instead of compressing each file individually and then taring them up.

htar was a great addition to the hsi interface. ...

HPSS needs a better user interface, such as C, python, perl libraries. When a system can store petabytes of data, it isn't very practical to rely upon tools originally written for semi-interactive use. Users shouldn't have to write parsing wrappers to hsi output just to be able to do an "ls" from within their pipeline code.

 

Comments on scheduled downs:   4 responses

I'd prefer maintenance to occur OUTside of 'normal business hours' (e.g. M-F 8-5)

... I would prefer to have a single morning a week when all the systems are down. For example, scheduled maintenance on bassi, franklin and hpss happen at different times, and I often loose productive morning hours to having one of these systems down. If I knew that there was no point in logging in on Wednesday morning because bassi, then hpss, and franklin were going to be down, I would simply move my work time to later in the day. ...

On one hand, I am very satisfied with the performance and reliability of the hpss systems. On the other hand, what the heck happens during the Tuesday morning close downs...every Tuesday? It always hits at a bad time, and I always wonder if it really needs work that day. If so, that's fine. Just curious.

I think when maintenance is going on, they should keep at least a few login nodes available to retrieve files. One can block the submission or others, but it is possible that one can access files in scratch and home directories. In case, there is a need to block the file access on scratch and home directories, and they can always send a notice to us.

 

Comments by Bassi Users:   3 responses

overall, very pleased

In general very satisfactory uptime and performance.

Time limit for premium class jobs should be increased.

 

Comments about he NERSC Global Filesystem / shared file storage:   2 responses

... I have tried to do my data processing on davinci. To do this, I need to upload data from bassi or franklin to hpss, then download to davinci. This is more time consuming than simply running my data processing scripts on the supers. I don't know if you could have a shared disk between the three machines, but it might be worth discussing.

1) would like to write data from large franklin job to /project ... 3) want old (>month) data automatically shuffled off to hpss

 

Queue configuration and wait comments:   2 responses

Would like more availability for moderate-size jobs (100-500 processor cores). [Franklin/Jacquard user]

CCSM utilizes a small number of processors, but needs to run more or less continuously. We are grateful for the boost in priority that you have given us, but at busy times, your computers are still over loaded, and wait times become unacceptably long. ... [Franklin/Bassi user]

 

Comments on Network Performance:   2 responses

I mentioned in a previous section concern about rate of file transfer. The destinations were often NCSA, FNAL or TACC.

To elaborate on sftp/scp from my home institution to NERSC : It is slow , never more than 450 Kb/sec, which tends to be a factor of 10 slower that the majority of my connections.

 

Comments by PDSF Users:   2 responses

Unfortunately, the PDSF disk arrays appeared not to be very stable during 2007

The integration of new compute hardware in PDSF and its batch systems has been a slow process. Although part of this may (perhaps) be ascribed to the change of lead in 2007, the integration of new hardware needs to be a high priority for NERSC staff.
Sakrejda has been key in particular to maintaining PDSF as a viable system for science in the transition period from one lead to the next. This well exceeded her high dedication level during normal PDSF operations periods. Perhaps a way can be found to recognize this.

 

Comments by DaVinci Users:   1 response

... finally, we've been working Davinci pretty hard lately, and I am extremely thankful for access. If I had to do all this data processing on Franklin, .... well, it wouldn't get done.

 

Other comments:   2 responses

About "Access and authentication", it is not a good way that 3 incorrect inputs of password will fail. Please change the rule. For example, after one fail, reopen the access page only.

NERSC systems are managed much better than the systems I use at other (DoD) sites [Franklin / Jacquard / Bassi user]