NERSCPowering Scientific Discovery for 50 Years

2005 User Survey Results

Hardware Resources

 

  • Legend
  • Hardware Satisfaction - by Score
  • Hardware Satisfaction - by Platform
  • Max Processors Effectively Used on Seaborg
  • Hardware Comments

 

 

Legend:

SatisfactionAverage Score
Very Satisfied 6.50 - 7.00
Mostly Satisfied 5.50 - 6.49
Somewhat Satisfied 4.50 - 5.49
Neutral 3.50 - 4.49
Significance of Change
significant increase
significant decrease
not significant

 

Hardware Satisfaction - by Score

7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied

ItemNum who rated this item as:Total ResponsesAverage ScoreStd. Dev.Change from 2004
1234567
HPSS: Reliability (data integrity)       1 1 19 68 89 6.73 0.54 -0.01
HPSS: Uptime (Availability)       2 1 21 65 89 6.67 0.62 0.01
Network performance within NERSC (e.g. Seaborg to HPSS)     1 1 2 31 71 106 6.60 0.67 0.14
Seaborg: Uptime (Availability)       3 2 48 85 138 6.56 0.64 0.30
HPSS: Overall satisfaction   1 1   2 34 58 96 6.51 0.79 -0.05
HPSS: Data transfer rates 1     3 4 31 51 90 6.40 0.93  
NERSC CVS server         2 5 4 11 6.18 0.75 0.85
Remote network performance to/from NERSC (e.g. Seaborg to your home institution)   3 6 3 9 47 61 129 6.12 1.19 0.01
HPSS: User interface (hsi, pftp, ftp) 1 1 5 2 7 27 46 89 6.12 1.29 -0.01
Seaborg: Disk configuration and I/O performance   1 3 14 6 40 54 118 6.06 1.16 0.12
HPSS: Data access time 1 2 3 4 9 29 39 87 6.00 1.31 -0.25
PDSF: Overall satisfaction       3 4 22 10 39 6.00 0.83 -0.52
PDSF: Batch queue structure       6 2 14 14 36 6.00 1.07 -0.31
Seaborg: overall   3 7 2 19 69 44 144 5.92 1.13 0.15
PDSF: Uptime (availability)     1 5 3 16 12 37 5.89 1.10 -0.51
Jacquard: Disk configuration and I/O performance     2 11 5 16 26 60 5.88 1.25  
Jacquard: Uptime (Availability)   2 4 4 14 19 31 74 5.85 1.32  
PDSF: Batch wait time     1 6 3 14 11 35 5.80 1.16 -0.07
PDSF: Ability to run interactively     2 5 5 13 13 38 5.79 1.21 0.11
Jacquard: overall   2 4 5 12 31 25 79 5.78 1.25  
DaVinci: overall   1 1 1 5 5 7 20 5.65 1.42  
Jacquard: Ability to run interactively   1 4 8 6 25 13 57 5.56 1.28  
Seaborg: Ability to run interactively 3 1 4 18 20 42 31 119 5.53 1.38 0.19
Jacquard: Batch queue structure 2 1 4 10 3 37 12 69 5.46 1.42  
Jacquard: Batch wait time 2 1 10 8 12 24 13 70 5.16 1.54  
PDSF: Disk configuration and I/O performance   1 2 8 10 8 6 35 5.14 1.29 -0.45
Seaborg: Batch queue structure 6 3 14 17 17 53 16 126 5.06 1.58 0.39
Seaborg: Batch wait time 17 15 28 13 33 27 5 138 3.95 1.76 0.10

 

Hardware Satisfaction - by Platform

7=Very satisfied, 6=Mostly satisfied, 5=Somewhat satisfied, 4=Neutral, 3=Somewhat dissatisfied, 2=Mostly dissatisfied, 1=Very dissatisfied

ItemNum who rated this item as:Total ResponsesAverage ScoreStd. Dev.Change from 2004
1234567
CVS server         2 5 4 11 6.18 0.75 0.85
DaVinci Analytics Server   1 1 1 5 5 7 20 5.65 1.42  
HPSS: Reliability (data integrity)       1 1 19 68 89 6.73 0.54 -0.01
HPSS: Uptime (Availability)       2 1 21 65 89 6.67 0.62 0.01
HPSS: Overall satisfaction   1 1   2 34 58 96 6.51 0.79 -0.05
HPSS: Data transfer rates 1     3 4 31 51 90 6.40 0.93  
HPSS: User interface (hsi, pftp, ftp) 1 1 5 2 7 27 46 89 6.12 1.29 -0.01
HPSS: Data access time 1 2 3 4 9 29 39 87 6.00 1.31 -0.25
Jacquard: Disk configuration and I/O performance     2 11 5 16 26 60 5.88 1.25  
Jacquard: Uptime (Availability)   2 4 4 14 19 31 74 5.85 1.32  
Jacquard: overall   2 4 5 12 31 25 79 5.78 1.25  
Jacquard: Ability to run interactively   1 4 8 6 25 13 57 5.56 1.28  
Jacquard: Batch queue structure 2 1 4 10 3 37 12 69 5.46 1.42  
Jacquard: Batch wait time 2 1 10 8 12 24 13 70 5.16 1.54  
Network performance within NERSC (e.g. Seaborg to HPSS)     1 1 2 31 71 106 6.60 0.67 0.14
Remote network performance to/from NERSC (e.g. Seaborg to your home institution)   3 6 3 9 47 61 129 6.12 1.19 0.01
PDSF: Overall satisfaction       3 4 22 10 39 6.00 0.83 -0.52
PDSF: Batch queue structure       6 2 14 14 36 6.00 1.07 -0.31
PDSF: Uptime (availability)     1 5 3 16 12 37 5.89 1.10 -0.51
PDSF: Batch wait time     1 6 3 14 11 35 5.80 1.16 -0.07
PDSF: Ability to run interactively     2 5 5 13 13 38 5.79 1.21 0.11
PDSF: Disk configuration and I/O performance   1 2 8 10 8 6 35 5.14 1.29 -0.45
Seaborg: Uptime (Availability)       3 2 48 85 138 6.56 0.64 0.30
Seaborg: Disk configuration and I/O performance   1 3 14 6 40 54 118 6.06 1.16 0.12
Seaborg: overall   3 7 2 19 69 44 144 5.92 1.13 0.15
Seaborg: Ability to run interactively 3 1 4 18 20 42 31 119 5.53 1.38 0.19
Seaborg: Batch queue structure 6 3 14 17 17 53 16 126 5.06 1.58 0.39
Seaborg: Batch wait time 17 15 28 13 33 27 5 138 3.95 1.76 0.10

 

What is the maximum number of processors your code can effectively use for parallel computations on Seaborg?   51 responses

Processor CountNumber of
Responses
Number of respondents who actually ran
codes on this Number of Processors
Percent
4,560 - 6,000 3 2 1.4%
4,096 1 6 4.3%
2,016 - 3,074 1 10 7.2%
1,008 - 1,728 1 16 11.6%
512 - 768 7 25 18.1%
256 - 400 11 14 10.1%
112 - 192 8 22 15.9%
64 - 96 5 25 18.1%
32 - 48 8 9 6.5%
≤16 6 9 6.5%

 

Hardware Comments:   38 responses

  • 5 overall hardware comments
  • 2 comments by Bassi Users
  • 8 comments by Jacquard Users
    3   Need more scratch space
    3   Queue issues
  • 3 comments by HPSS Users
  • 3 comments on networking performance
  • 5 comments by PDSF Users
  • 16 comments by Seaborg Users
    11   Turnaround too slow
    3   Queue /job mix policies should be adjusted

 

Overall Hardware Comments:   5 responses

It appears that the load on the resources implies need for much larger and/or faster computers.

Cannot run interactive jobs in the night. This should be fixed, considering there are many people who are willing to work at night.

I have generally been quite satisfied with NERSC's hardware resources, though of course faster machines are always helpful, and I look forward to seeing what Bassi can do. Copying large datasets to my home machines for visualization can be a bottleneck, as scp seems to be limited to about 600 kB/s.

I feel that NERSC should focus on providing computing resources for real world parallel applications rather than focusing on machines with high theoretical performance and poor performance with realistic parallel applications which use domain decomposition.

I am a long-time nersc user, but only recently began work with a large parallel application: it is too early to have an opinion on many questions in this survey and thus you find them unanswered.

 

Comments by Bassi Users:   2 responses

The IBM-SP5 should be expanded as soon as possible since it is an order of magnitude faster than the IBM-SP3

I could comment here on Bassi. I am mostly satisfied with Bassi. I run on 48 processors. The present queue structure is a pain with its 8 hour time limits, but I understand that this will almost certainly change when it goes into production next week.

 

Comments by Jacquard Users:   8 responses

Need more scratch space

Availability of scratch disk on Jacquard is a major restriction on the usefulness of that system. The default 50 GB scratch is not enough for many runs. While temporary extensions are useful and have been granted, permanent increases would be a great improvement to the usefulness of the system. ...

Increasing the available scratch space on Jacquard would significantly improve the environment from my perspective.

... My major issue with jacquard arises from insufficient /scratch file system space to store even one year of inputs and outputs for our minimal-resolution model configuration. I am working around this currently by downloading half-year outputs as they are produced, but this requires me to offload outputs before submitting the next run segment to batch. This cramped file system also makes jacquard impractical for higher-resolution simulations that we will need in the future. I expect that other jacquard users are facing similar limitations. Is additional storage for jacquard /scratch prohibitively expensive?

NERSC response: Users requiring large amounts of scratch space should consider using the NERSC Global Filesystem.

Queue issues

... I hope that Jacquard will be more user friendly for our applications that require 10 -100 processors.

The only complaint I have is the wait time to run short large jobs (128 or 256 nodes) on Jacquard. It seems like it takes about a week to get these jobs through, where other systems can get them through the queue in a day or two.

My biggest complaints are over-allocation leading to long queue wait times, and the inability of PBS on Jacquard to run larger than average jobs. PBS should be replaced, or most of the nodes should be exclusively allocated to queues with a minimum job size of 16 nodes.

NERSC response: NERSC is investigating alternatives to PBS Pro for Jacquard scheduling, including using the Maui scheduler.

Other

The lack of any compiler option on Jaquard except for the pathscale compilers makes this machine useless to me. Pathscale is just not up to the standard of freely available compilers available from INTEL or commercial compilers like NAG. I don't understand why INTEL compilers can't be installed on Jaquard

Still working on best setup for jacquard.

 

Comments by HPSS Users:   3 responses

... htar is terrible -- random crashes without returning error codes, etc. hsi would be significantly improved by allowing standard command line editing. HPSS is ok, but the interfaces to it are poor.

Though the HPSS mass storage is very good already, I found a system installed at the supercomputer in Juelich easier to use. There, the data migration is done by the system software and the migrated data can be accessed the same way as a regular file on disk. This data storage system is very convenient. ...

when data migrates to tape on HPSS, it sure takes a long time to retrieve it

 

 

Comments on Network Performance:   3 responses

... NERSC to LBL bandwidth needs improvement for X based applications to be usable (including Xemacs) ...

... Network performance appears to be limited by LBL networking not NERSC, so the neutral answer reflects the fact that I cannot really evaluate NERSC performance.

Accessing PDSF for interactive work from Fermilab, BNL is slow. Probably due to latency. Not sure whether this is intrinsic (distance) or a problem in the network.

 

Comments by PDSF Users:   5 responses

Some of the PDSF disks which store STAR data become unavailable quite often due to users running with un-optimised dvio requirements. This makes it impossible for other users to do their work. It would be great if we could somehow come up with a system for ensuring that users cannot run jobs irresponsibly - i.e. perhaps setting a safe minimum dvio requirement that everyone has to adhere to? Or perhaps developing a system tool which users can use to benchmark their code and better judge what dvio requirement to run with? These comments obviously refer mainly to users, not to PDSF hardware. In general, I have no problems with the hardware itself - it's how we use it that is sometimes not optimal!

It seems that even "normal" system usage (on PDSF where I do all of my work) is too much for the system to handle. I can always expect some choppiness in responsiveness, even when doing simple tasks like text editing. In periods of high usage, the system may even stop responding for seconds at a time, which can be very frustrating if I'm just trying to do an "ls" or quickly edit a file. Is the home directory one giant disk in which everyone is working? If so, I would suggest dividing users' home directories among multiple disks to ease the situation.

... PDSF diskvaults are very unreliable, but they are upgrading to GPFS which works much better. I am dissatisfied with the current situation but quite happy about the future plans. ...

NERSC response: NERSC has made a substantial investment by providing GPFS licenses for PDSF. This has allowed us to consolidate and convert the many NFS filesystems to GPFS. NERSC expects GPFS to be a more robust and reliable filesystem.

The PDSF cluster has been instrumental for our data analysis needs and overall we are very satisfied with the system. Because the system is build out of commodity hardware, older hardware that may no longer meet the computing requirements, but is otherwise in good condition, is reused for other tasks. One such task is providing cheap disk servers, which is very valuable for our data intensive applications.

Refresh time just running emacs on pdsf is slow. Network?? Recently there have been times where one could not effectively do anything even at the command line. Surprising it took days to find and fix.

 

Comments by Seaborg Users:   16 responses

Turnaround too slow

If we have to wait for more than 2 weeks to get a batch job run on Seaberg, it diminishes the usefulness of NERSC facility.

batch jobs had to wait too long to really get some work done.

Of course I would prefer shorter queue times, but having seen the load distribution on Seaborg using the NERSC website that seems to be a function of demand, not mismanagement. One feature I would find useful (perhaps it is available and I am unaware of it) is email notification when a job has begun.

Seaborg is user-friendly but somewhat old and vastly oversubscribed. As a result, I find it necessary to use premium queue for routine production jobs, or see my jobs spend several days waiting to do 6-hour runs. I'm glad to learn of the arrival of bassi.nersc.gov and would appreciate hearing more about it.

Your queueing structure is optimized for the wrong thing. It shouldn't be maximum CPU utilization. It should be maximum scientist productivity. If the wait time in the queue is long, then it really doesn't matter how fast the machine is; it's equivalent to being able to use a much slower machine right away. The total turnaround time is what matters for a scientist being able to get work done, and wait time in the queue is a huge part of that.

... The turn around time for large seaborg jobs is very long. I sometimes would appreciate that my production jobs cannot prevent my debug jobs from starting. As it is now, a couple of large jobs block my queue for weeks even for the small test jobs. It would be better to restrict the number of jobs in each class, so that debug jobs are still running though I am waiting for production jobs to start.

Seaborg is a great resource, but it is heavily utilized and turn around times are sometimes quite slow. If at all possible, it would be a real asset to scientific computing if NERSC could expand their facilites.

INCITE-driven wait times (and their seasonal fluctuations) are killing us. Occasional super-priority has been a huge boon and is much appreciated.

During most of last year queue wait times on seaborg were very bad. This has changed dramatically in the last few weeks, and wait times are now very good.

The Seaborg wait times of 2 weeks make the machine useless or hard to use. Some very cycle hungry GYRO users in may group still use it, but I didn't use much last year. ...

My primary issue with seaborg is that batch wait times can vary wildly with little notice (for example: in October 2005, a "reg_1" queue turnaround time of overnight abruptly increased to more than a week, and even "premium" jobs were subject to waits of more than two days). I am not certain if this is an unavoidable issue with shared supercomputer resources, but is it possible to provide some warning of batch wait times to be expected by a job? ...

NERSC response: NERSC has made several changes based on the over allocation that occurred in 2005 and the resulting long turnaround times:

  • Time for new systems will never be pre-allocated but allocated shortly before the system enters full production.
  • Time has not been over allocated in 2006.
  • NERSC is investigating with DOE options to under allocate resources in order to improve turnaround times while still meeting the DOE mission.
  • NERSC has suggested to DOE that the allocation process align allocations more to the mission and metrics DOE has for NERSC

However, many things that affect turnaround time are outside of NERSC control, such as large numbers of projects that have similar project deadlines.

 

Queue /job mix policies should be adjusted

SEABORG is getting more and more difficult to use for our kind of needs where the inter-processor communication is large. This limits our use of the resources to about 100 processors. These kind of jobs are heavily penalized with very low priority compared to larger jobs using thousands of processor. I understand that this is the policy of NERSC. I hope that Jacquard will be more user friendly for our applications that require 10 -100 processors.

We can't use very many processors at one time, but we need processors for a significant portion of every day to get reasonable turn around on our model runs. Queues on Seborg have always been a terrible headache, but this last year they were abysmal. It takes just under 24 hours to execute one model year. Our model is getting only one run slot per real-time week in the regular queue this fall. At this speed, it will take two years to finish the run. Because of restrictions on the queue, there is no way for us to use the time allocated to us, except by running our jobs in the premium queue all the time.

Jobs using 64-128 processors in the seaborg batch queue seem to be unfairly penalized as compared with larger jobs.

NERSC response: Users running jobs on fewer than 512 processors should consider using the Jacquard Linux cluster or the Bassi Power5 system.

Other

increase memory

It seems that the processors run not fast as anticipated. As compared with other supercenter, the same parallel programs run at seaborg even two times lower! I think there should be some more and more upgrades needed.