NERSC 1998 User Survey Results
Question 7. In what areas should NERSC improve?
These are the complete responses to this question, with a few minor edits.
Some responses have been repeated (to group them under multiple categories).
Provide more cycles / more machines:
- More j90/c90/(t90?) hardware.
- More CRAY's T3E
- Increase duty cycle of machines.
- more computer power (obviously)
- enabling truly large-scale computing
- New computer and more computational power
Better software support:
- Keeping up with, and especially correcting, CRAY FORTRAN bugs is not done very well.
- I use totalview debugger. It is very slow. I think a line
debugger will be better for remote users since is faster.
- Increase the variety of supported software.
- It would be useful if "dead" languages like Fortran 77 were kept
available so that old programs can be run without being modernized.
- Add the resources mentioned below. [Multithreaded Perl, GNU tcsh, ssh, CVS)
- I would like to have tcshell access to killeen. I have used this
before and found it very nice.
- I would like to see a return to the old
NERSC policy that if user software needed enhancements or
additions, the NERSC personnel would write the needed software,
rather than waiting for the vendor to provide it.
- Improve ease of use for large-scale GCA and SPP projects. [collaboratory tools, visualization
environments, ...]
Better / different batch management:
- Longer queue times on the T3E would be very helpful.
- have longer running time for batch jobs on T3E
- T3E queue policy needs improvement. We need longer runs.
- T3E scheduling (see below) [Having 4 hour time limits is extremely inconvenient when trying to
perform large-scale simulations...]
- I often encounter long (>8 hour) waits on even very small batch
jobs. This is very detrimental to my work since these small jobs
are usually of a debugging nature so the delay hinders the
development cycle. [T3E user]
- The 4 hour maximum queue structure on the T3E is ill-suited to
time dependent calculations. (You cannot "parallelize" over
time without violating causality.) The important problems for
the fusion program involve slowly evolving time dependent physics
and require immense amounts of spatial resolution. They can
easily require 50-100 hours of wall clock time even when using
hundreds of processors. The 4 hour time limit makes these jobs
almost impossible to run, even with "checkpointing". (It
recently took 48 hours between submission of 4 hour job and
its execution. At 4 hours every 2 days, an 80 hour run will
require 40 days to complete!) The vaunted capabilities of
the T3E are thus completely negated. We don't see this as an
improvement in performance.
- you have to figure out some way that it is possible
to do medium sized development work in a acceptable
time frame. some bugs show up only if the job is a
little larger - and in the present set-up with about
2 jobs / day ... it takes for ever to find anything. [T3E user]
- give more priority to large job which can not be done locally [PVP user]
- C90 queue structure...
- Increase the capacity of simultaneous jobs running on CRAY queues (especially C-90)
- The batch queuing process language is less than transparent
- Transparency of queuing scheme and time estimates for execution of batch jobs.
- Simplify things such as running in batch or the allocation process
- Queueing systems is not straightforward. Qstat command does not display job IDs
- Batch file submission and the associated folklore is always a problem.
- Regarding software, a better accounting program and better queuing
program will make NERSC even better for users.
Better/different allocations process:
- Allow easier access to temporary increases in allocation
and or/group account facilities.
- notification of account allocations
- Make allocations more easily available to local users
- Encouraging users to use their allocated time more evenly thru out year.
- Simplify things such as running in batch or the allocation process
- provide a one time allocation, that can be renewed during the year, if needed.
- The ERCAP software needs replacing -- it is very user unfriendly for those
of us used to using powerful editors and TeX/LateX/html for
producing our documentation. The earlier deadline for the most
recent ERCAP was a major inconvenience.
- We requested 8000 SU on T3E in our proposal, and we were given
5000 SU. During our data analysis, our allocation was used up,
and we requested another 10000 SU. But our request was denied.
We understood that the resource was very tight at NERSC. At the
beginning, we thought about 20% of the allocation would be left
to be used by startup project. We really wish we could get more
allocation so that we can finish our analysis. At the end,
our usage was about 5000 SU more than our initial assignment.
Improve documentation/training:
- I do not find it easy to locate technical info on the website.
- Is there a Unix tutorial on the web site? i.e., unix-for-dummies?
- There should be additional "deeper" tutorials on
effective use of multiple CPUs (SMP's and MPP's, multithreading, etc. etc.)
- I would like to have good manual available with tutorials in hand
More storage, better storage interfaces, better file management, new file
services:
- Increase storage capacity.
- CFS interface remains brain dead.
- HPSS as it stands is woefully inadequate as a replacement for cfs
in every regard, with the possible exception of capacity and needs major work.
- Lete the users know what will happen when HPSS replaces
CFS (where do the files go? how to access them? Does everyone
need to get a need password?...)
- Offline file storage - I prefer the old practice of maintaining
a few megabytes per user permanently on-line.
- Disk space management is a real problem. Migration & home
directory system would be fine if the migration system weren't
ridiculously unreliable and if the home directory policy didn't
keep changing. The migration + quota system has made me waste
a significant amount of time shuffling files around using
unreliable storage systems. Quite a problem.
Better accounting/charging/account management procedures:
- i am occasionally confused by the accounting scheme.
- Account creation is much too slow.
- Account and queuing policies and establishment of group accounts,
to allow multiple users on the same project to be able to
work in a coherent fashion.
- On one occasion it took a long time to set up an HPSS password for an
additional user; that may have been an anomaly and not typical.
- Softwarewide, a better accounting program and better queuing
program will make NERSC even better for users.
- My BIG GRIPE with NERSC involves your accounting system. My time was provided by [PI name
deleted] at 20% of his total allocation. This was (or should have been) established upon creation
of my account on the T3E. Somehow, no boundary or cutoff was activated and so I inadvertently
burned up the total allocation (my 20% PLUS the remaining 80% which was not mine). Now I have an
incomplete project and no more time! FIX THIS!
Keep users better informed; better interactions with users:
- updating users about changes in the system
- be kind and compassionate to those of us who view computers as
useful black boxes to solve physics problems using FORTRAN 77,
and who barely know UNIX, let alone other things like MPI, PVM,
HPF, etc.
- Get the information about improvements out to the users. I
think it would be good to provide some of this information
via e-mail (for those who don't check the web pages regularly)
to the users, and not just the PI's. A case in point is (again)
HPSS: letting the users know what will happen when HPSS replaces
CFS (where do the files go? how to access them? Does everyone
need to get a need password?...). These are minor complaints.
- To be more user oriented rather than computer keepers
Improve the T3E / other T3E issues:
- more flexibility in allocation of memory and inodes on the T3E
- Some work needs to be done to resolve T3E performance issues.
Faster networks / better response time:
- faster network access
- improve ftp speed on Killeen
- I would like to have a dedicated high speed network between NERSC's T3Es and my workstations
at PNNL. Just kidding!
- Increased network access on the C90 cluster is desirable.
My text editor (Emacs) runs prohibitively slow when the
cluster has a lot of users logged on. (Much faster than
before, though.)
- I assume that it is something over which you have no control,
but I would like speedier access via telnet. Sometimes I get
pretty twitchy waiting for responses.
Less down time:
- less down time,
- Regarding hardware, I am wondering whether the maintenance time can
be reduced while the good condition of the machines is maintained.
- Twice-weekly mid-day T3E downtimes / HPSS downtimes were a nuisance.
Provide a workstation farm; better PDSF support:
- A lot of cpu
power is needed to analyze huge data sets in a timely
fashion. There NERSC does not help us so much. The reasons are
the following:
1. one has to understand what sort of software we are running
2. one has to understand the nature of our problem:
huge amount of data (several TB) is segmented in files of
500-800MB needs to be analyzed, for each input file an output
file is produced of the order of 200 MB, that we need to put
back into mass storage.
The natural way to do this is: have a UNIX workstation farm,
each workstation processes one file after another, it's a kind
of course grain parallelism. Compilers available there are
able to digest our partially very old FORTRAN code which
contains often VMS extensions. The analysis software has been
developed on a single workstation and the idea is to run
multiple instances of the same executable on different
workstations.
That's why NERSC should push facilities like PDSF!!!
Visualization support/software:
- Graphics support to the average user could be improved.
Increase research / scientific computing support:
- Additional support for computational science research would be nice.
NERSC is excellent at helping the average/beginning user. A possible
improvement would be to work with users at a higher level more often.
- I think more people are needed to help out in areas of scientific
computing support.
No changes needed / no opinion:
- No strong opinions.
- I don't know.
Miscellaneous:
- interoperability with other systems (workstations, etc.)
- X server
Back to Survey Summary Page