NERSC 1998 User Survey Results
Question 28. Comments about NERSC's large scale computers, suggested improvements, future needs
These are the complete responses to this question, with a few minor edits.
T3E Queue Related Comments:
- it would be nice to have longer t3e queues for smaller PE amounts, such as 64 PEs
- The queue structure on the MPP needs to allow longer runs at 64 PEs.
- One 'very large' queue could be available for large jobs
- The queue structure on the T3Es makes them practically unusable
for large time-dependent problems. The promise of massively
parallel supercomputing has not materialized for this type of
physics problem.
- see #7. [I often encounter long (>8 hour) waits on even very small batch
jobs. This is very detrimental to my work since these small jobs
are usually of a debugging nature so the delay hinders the development cycle.]
- The batch system on the T3E is my only major complaint. Having 4 hour time limits is
extremely inconvenient when trying to perform large-scale simulations (our simulations are as
large as grand challenge projects, even if we don't have that designation).
I also don't like the first fit job selection. Small jobs that can probably run on workstations
block large-scale problems for a long period of time. Some type of weighting (taking into
account how long a job has been queued) must be implemented, even if it means that some
processors must be left idle.
- On several occasions I have watched the T3E queues (with qstat) and usage (with tstat).
I get the impression that the job scheduler needs to be improved. I'll see large chunks
of the machine sit idle for an extended period of time.
- My major complaint about the queue structure on the T3e is the
rather short times. (The addition of the gc128, gc256 queues were
a big improvement.) I understand that making a structure that is
fair to everyone is hard. I think increasing the 128 queue limit
to 15000 (same as the 64) would be an improvement: The current
set-up seems to assume that one wants to run the same problem
on, for example, 64 and 128 nodes, rather than the 128 node
problem is really twice as large. (The same holds for 256).
The current queue structure makes it harder to do the large
problems that the machine was designed for.
I don't know what the fairest queue structure is; I do know that
I checkpoint my codes more frequently (with the related i/o cost)
just because I need to ensure that I can run up to the last
possible moment of the queue limit.
- About the queues on the T3E
approximately how long it would take for my batch jobs to be
completed. Even when the priority number is high, it may take
a long time to execute if the concerned queue is running a low
number of jobs. In addition, the number of jobs running per
queue varies, and not always the longest queue runs the most
jobs. Finally, sometimes when I submit a job, it takes quite
a time (up to one or two days) before it appears in the queue,
even if my limit is not reached. This makes it a little
difficult to plan ahead of time if the result of one simulation
is needed for the next one.
Other T3E Comments:
- I/O access time is superb!
- Add more nodes to the T3E, or try to acquire another parallel
machine. My particular work could benefit from a much larger
T3E (e.g., I'm currently using the 1024 node Paragon at Oak
Ridge, and get about .95 efficiency on 1024 nodes).
- Increase the quota of I-nodes per user.
- I/O could be improved by looking at increasing the number of files
that can be simultaneously opened on the T3E.
- memory - I run out of it real quick. Maybe as no virtual memory or poor malloc?
- The T3E clearly is the some of the best hardware available, with
fast processors and great interprocessor communications. However,
I have not been able to get several codes to scale up to the full
512 processors efficiently, and have seen poor efficiency for
even smaller runs, all due to the fact that the T3E system does
not allow you to map a problem onto a given arrangement of nodes.
Codes that have local interactions can take advantage of this
fact to do only local communications on most MPP systems, but
on the T3E local regions may not be mapped to neighboring nodes
which causes congestion that greatly affects the ability of
codes that scale well on other MPPs (nCube, Paragon) to scale
well on the T3E. This is something that NERSC should be aware
of when testing out computers for the next 'big' MPP purchase,
especially if the machine is to be used for Grand Challenge
problems that are supposed to be run on the entire machine at times.
- My only other worry is Cray's commitment to improving the T3E
now that it seems a development dead-end. There are many ways in
which NERSCs efforts are hamstrung if Cray doesn't actively
address the issues reported by NERSC or continue to update
and improve software.
- MPP: T3E needs better support for 32 bit programming. As of now
performance for 32 bit codes is far below what it should be. The
Memorial day slowdown needs to be resolved.
- Would like to use T3E as I've used J/C 90s. I.E., minorly.
PVP Comments:
- very slow batch when running large (64mw) jobs!
- some more disk space seems to be an evident necessity!
- Give more weights to large memory jobs
- The J90s are too slow.
- I will, hopefully, be soon running a new version of one of our codes
and hope that when I no longer have the C90 that the J90s will have sufficient
CPU, memory and disk space.
- Get more J90's and more CPU's for the interactive machine - the
current number on Killeen is inadequate for some program development.
- The queue structure on the J90's needs to be reconsidered. Having
4 machines with separate queues is confusing. One never knows if
the turnaround time can be improved by resubmitting to another machine.
- PVP: C-90 with memory at least as great as the J-90's.
Comments about future machines:
- MORE MEMORY, MORE MEMORY, MORE MEMORY!!! The SV1 needs to have
A LOT more memory than the C90 currently has!
- More memory -- 1 TB+ for the next machine would be good.
Performance sort of meaningless since sustained performance is
so poor compared to peak for present generation MPPs/SMPs and
likely to get worse (sigh).
- Of course bigger/better/faster will always apply to CPU,
memory, I/O, and disk space. For my needs more memory/node
on the T3E (or its successor) are the highest priority.
- I/O and total memory are fundamental research areas for both
nersc and the community at large.
- i only use nersc for parallel computing. i can do everything
else i need to do at my own shop.
- One of the
requirements for future MPP systems should be that single PE
performance for codes written in high level languages such as
fortran should be of order 50% of peak or better as it is for
PVP machines. The T3E's best feature is its fast communications,
and new machines should perform as well in this regard. Somehow
the hardware/firmware caching of the current generation of
general purpose microprocessors seems to impede performance.
- The areas of dissatisfaction I think are mostly out of you control. The only
real solution would be to buy more faster computers. The type of simulation that we do tends to
be limited more by CPU speed rather than memory or IO.
No opinion:
Back to Survey Summary Page