NERSCPowering Scientific Discovery Since 1974

2002 User Survey Results

Hardware Resources

  • Legend
  • Satisfaction - Compute Platforms (sorted by Average Score)
  • Satisfaction - Compute Platforms (sorted by Platform)
  • Max Processors Used and Max Code Can Effectively Use
  • Satisfaction - HPSS
  • Satisfaction - Servers
  • Summary of Hardware Comments
  • Comments on NERSC's IBM SP:   54 responses
  • Comments on NERSC's Cray T3E:   18 responses
  • Comments on NERSC's Cray PVP Cluster:   14 responses
  • Comments on NERSC's PDSF Cluster:   9 responses
  • Comments on NERSC's HPSS Storage System: 31 responses
  • Comments about NERSC's auxiliary servers:   3 responses

 

Legend:

Satisfaction 

Average Score 

Very Satisfied

6.5 - 7

Mostly Satisfied

5.5 - 6.4

Somewhat Satisfied

4.5 - 5.4

 

 

Significance of Change 

significant increase

significant decrease

not significant

 

 

 

Satisfaction - Compute Platforms (sorted by Average Score):

Topic

 No. of Responses

Average Score

 Std. Dev.

Change from 2001

SP Uptime

176

6.56

0.81

1.03

PDSF Uptime

41

6.51

0.78

 

T3E Uptime

62

6.48

0.84

0.26

SP Overall

187

6.38

0.87

0.56

PVP Uptime

39

6.31

0.89

-0.14

PDSF Overall

46

6.26

0.93

 

PDSF Ability to Run Interactively

39

6.18

0.97

 

T3E Overall

72

6.11

1.03

-0.12

PVP Overall

47

6.06

0.94

-0.08

PVP Disk Configuration and I/O Performance

27

6.00

1.11

0.00

PDSF Queue Structure

39

5.97

1.14

 

SP Disk Configuration and I/O Performance

143

5.97

1.25

0.30

SP Queue Structure

165

5.92

1.07

0.73

PVP Ability to Run Interactively

33

5.82

1.26

-0.16

T3E Queue Structure

55

5.76

1.09

0.40

PDSF Batch Wait Time

39

5.74

1.21

 

T3E Ability to Run Interactively

56

5.73

1.37

0.09

PVP Queue Structure

36

5.69

1.09

0.28

T3E Disk Configuration and I/O Performance

47

5.68

1.12

0.08

PDSF Disk Configuration and I/O Performance

38

5.63

1.30

 

SP Ability to Run Interactively

150

5.47

1.58

0.76

SP Batch Wait Time

175

5.41

1.47

0.49

T3E Batch Wait Time

61

5.23

1.52

0.26

PVP Batch Wait Time

35

4.77

1.44

0.21

 

 

 

Satisfaction - Compute Platforms (sorted by Platform):

Topic

 No. of Responses

Average Score

 Std. Dev.

Change from 2001

SP Uptime

176

6.56

0.81

1.03

SP Overall

187

6.38

0.87

0.56

SP Disk Configuration and I/O Performance

143

5.97

1.25

0.30

SP Queue Structure

165

5.92

1.07

0.73

SP Ability to Run Interactively

150

5.47

1.58

0.76

SP Batch Wait Time

175

5.41

1.47

0.49

 

 

 

 

 

PDSF Uptime

41

6.51

0.78

 

PDSF Overall

46

6.26

0.93

 

PDSF Ability to Run Interactively

39

6.18

0.97

 

PDSF Queue Structure

39

5.97

1.14

 

PDSF Batch Wait Time

39

5.74

1.21

 

PDSF Disk Configuration and I/O Performance

38

5.63

1.30

 

 

 

 

 

 

T3E Uptime

62

6.48

0.84

0.26

T3E Overall

72

6.11

1.03

-0.12

T3E Queue Structure

55

5.76

1.09

0.40

T3E Ability to Run Interactively

56

5.73

1.37

0.09

T3E Disk Configuration and I/O Performance

47

5.68

1.12

0.08

T3E Batch Wait Time

61

5.23

1.52

0.26

 

 

 

 

 

PVP Uptime

39

6.31

0.89

-0.14

PVP Overall

47

6.06

0.94

-0.08

PVP Disk Configuration and I/O Performance

27

6.00

1.11

0.00

PVP Ability to Run Interactively

33

5.82

1.26

-0.16

PVP Queue Structure

36

5.69

1.09

0.28

PVP Batch Wait Time

35

4.77

1.44

0.21

 

 

 

Max Processors Used and Max Code Can Effectively Use:

Topic

 No. of Responses

 Average

 Std. Dev.

Change from 2001

SP Processors Can Use

115

546.25

948.16

-204.75

T3E Processors Can Use

36

520.00

1179.09

164.00

Max SP Processors Used

151

171.38

272.76

-30.62

Max T3E Processors Used

48

149.35

160.08

16.35

PDSF Processors Can Use

22

97.50

241.58

 

Max PDSF Processors Used

22

35.14

81.22

 

PVP Processors Can Use

19

6.95

7.90

-23.05

Max PVP Processors Used

22

5.45

5.18

-4.55

 

 

 

Satisfaction - HPSS:

Topic

 No. of Responses

 Average

 Std. Dev.

Change from 2001

Reliability

127

6.51

0.94

-0.12

HPSS Overall

142

6.39

0.91

-0.11

Uptime

125

6.37

1.04

0.04

Performance

126

6.35

1.06

-0.01

User Interface

130

5.95

1.30

-0.07

 

 

 

Satisfaction - Servers:

Server

 No. of Responses

 Average

 Std. Dev.

Change from 2001

Escher (viz)

9

5.44

1.13

0.36

Newton (math)

8

5.38

1.30

-0.09

 


Summary of Hardware Comments

Comments on NERSC's IBM SP: 54 responses

21  

Good machine

15  

Queue issues

9  

Needs more resources / too slow

7  

Provide more interactive services

6  

Hard to use / would like additional features

2  

Stability issues

2  

Disk issues

1  

Need cluster computing at NERSC

Comments on NERSC's Cray T3E:   18 responses

11  

Good machine / sorry to see it go

2  

Mixed evaluation

2  

Provide better interactive services

1  

Queue issues

Comments on NERSC's Cray PVP Cluster:   14 responses

11  

Good machine / sorry to see it go / need PVP resource

2  

Improve batch turnaround time

Comments on NERSC's PDSF Cluster:   9 responses

4  

Good system

2  

Queue and priority issues

2  

Disk issues

1  

Would like new functionality

1  

Needs more resources

 

 

Comments on NERSC's HPSS Storage System:   31 responses

16  

Good system

6  

Don't like the down times / downs need to be handled more gracefully

4  

Performance improvements needed

4  

Would like new functionality

1  

Hard to use

1  

Authentication issues

Comments about NERSC's auxiliary servers:   3 responses

 


Comments on NERSC's IBM SP:   54 responses

Good machine:

Excellent platform for efficient parallel computing. Among the best managed supercomputers, if not the best, we have pursued our work on!

Excellent support. We've gotten some custom mods to the system for our use which has been very helpful. Consultants are always available and helpful. Excellent collaboration.

A truely great machine. Extremely well run. ... Worldwide the BEST machine my group uses.

It is very good machine. But too many people are using it.

Very good machine as setup, my research relies heavily on it.

This has been a very productive machine for us and the support of our efforts has been excellent.

Always has been up when I have wanted to test my code. I like that I can get jobs in for debugging purposes easily.

I've been incredibly happy with the SP. Batch queue turnaround times are very quick, and one can usually get away with the low queue on weekends. We've investigated efficiency quite extensively and found that we can run on 2-4 nodes fairly effectively and have run on up to 16 nodes (16 processors per node, in all cases).

We are rewriting our code to effectively use 64+ processors and then we will see if we are able to get our jobs through in a timely manner. So far, using one node, we have been happy.

I am very happy with using IBM SP

... The system had fantastic uptime, I got a lot of jobs done. The system was very stable and had high production quality unlike some other systems, in particular clusters. The maximum number of processors I used on seaborg is fairly low, since I did mostly parameter studies of smaller runs and other members of the group did large simulations. The code has been running on more than 1500 processors already.

I think the machine is great. I plan to do more performance analysis in the future.

I think seaborg provides a very friendly user interface.

Great machine - many processors and sufficient memory for each.

Everything is good ...

The best experience I have had with any system! ...

It works very well. ...

Great machine. ...

The system is excellent. ...

Perfect for my needs (testing scalability of numerical algorithms).

very efficient system with well provided service

Queue issues:

... Also, it would be great to have a slightly faster queue time for regular_long jobs (current wait is about 3 days).

... though I did put in a request under the Big Splash initiative to get another regular_long job to go through (2 at a time) and it hasn't been carried out yet.

My code parallelizes well for many processors, but I have only used up to 80 processors in order to decrease the waiting time.

... (1) one really long queue would be handy (*) ...

A 48-hr queue would be desirable for some large jobs.

Job running is great, but the walltime hard limit is too "hard". I do not know if there are some techniques to flash the memory data into disk when jobs are killed. That's very important to my time-consumed project....

The 8 hour limit on the regular queue is too short.

Queue waits for jobs larger than 32 procs have been horrible (up to 7 days for a 128 processor job). ...

The queues have gotten really bad, especially if you have something on low priority. ...

Allocation of processors in chunks smaller than 16 would be useful. More and longer regular_long time should be allocated.

I was very impressed by the short times I spent on the queue, but the short maximum run-time limits really limits the applicability of the SP for my systems of interest.

Check restart needed if system goes down. Longer time limits needed. I have been trying to use Gaussian 98 which typically runs for 4 days so the 24 hr limit is not enough.

My jobs run most efficiently on 32 processors (or 16) over several days rather than short periods of time on a large number of processors. When the job is interupted data is lost so when I restart I lose information. It would most efficient if I could access the nodes for extended periods, and a low number of CPU.

It would be nice to have a queue for serial jobs, where they could share a node without being charged for the entire node.

... It is a handicap to be charged for all 16 processors even if you use only 1.

Needs more resources / too slow:

The individual processors are really slow - slow compared to every other chip I use, Athlon, P4, etc. This machine is a real dog.

The system is excellent. However, I wish that NERSC had a much larger and more powerful computer. The problems I would most like to attack requir two to three orders of magnitude more computing power than is currently available at NERSC. (In responding to the last question, I indicated the maximum number of processors my code can use per job on Seaborg. On a more powerful machine it could effectively use thousands of processor).

... Individual processor speed is relatively slow compared with other parallel systems I use (mostly PC based LINUX clusters with 2.2 GHz procs.) However, the stability of the system is better than most computers that I have used.

Processor by processor, it is much slower than Intel P4.

... But the processors are getting slow -- my code runs 30% faster on Athlon 1.4 Ghz.

CPU performance is fine. Communication performance restricts the number of nodes our codes can use effectively. At the number of processors we use for production, the codes typically spend 70% of their time communicating and only 30% of their time calculating.

I have a code which uses a lot of FFTs and thus has a decent amount of global communication. The only problem that I have with the IBM SP is that communication between the nodes is too slow. Thus, it takes the same amount of time to run my code on the Cray T3E as it does to run on the IBM SP.

Try to get more memory

... For future improvements, please increase I/O speed, it's a limiter in many of my jobs (or increase memory to 16GB/CPU, which is obviously too expensive). ...

Provide more interactive services:

Interactive jobs are effectively impossible to run, even small serial jobs will not launch during the day.

Debugging code is currently VERY FRUSTRATING because of LACK OF INTERACTIVE ACCESS. You can't run totalview unless the job clears the loadleveler, which has become dramatically more difficult in the last couple months.

I find it very difficult to run interactively and debug. There seems to be only a few hours per day when I can get any interactive work done.

I wish it was easier to get a node when running interactively. I realize that most of the nodes are running batch jobs, but it might make sense to allocate more nodes for interactive use.

... (4) sometimes the interactive jobs are rejected... why? can some rationale be given to the user other than the blanket error message? (*) ...

A few more processors for interactive use would be helpful.

Needs to run(debug)interactive job on > 16 processors. ...

Hard to use / would like additional features:

I just don't like IBM.

need mechanism to inquire on remaining wall clock time for job when jobs are terminated by the system for exceeding wall clock time, a signal should be sent to the job with some time remaining to prepare

... I would like the default shell to be C shell.

i need to understand the POE better in order to optimize my code better. ...

... Nice machine overall, but I miss some of the software that was on the T3E (Cray Totalview and the superior f90 compiler).

Great machine. It would be nice to have (not in order of importance, * means more important than the rest)- (1) one really long queue would be handy (*) (2) a program that tries to estimate when a queue job will run (very hard I realize but still useful) (3) when submitting a job, the llsubmit tells you how many units it will use (max) so you know what you're getting into (4) sometimes the interactive jobs are rejected... why? can some rationale be given to the user other than the blanket error message? (*) (5) PLEASE REOFRM THE ACCOUNTING SYSTEM OF MPP HOURS WITH ALL THE CRAZY FACTORS (2.5 is it or not?) (**)

Stability issues:

We have had some problems with jobs with high I/O demands crashing; I don't know if the stability of these operations could be improved or not. ...

When can we get the new version of operating system installed in order to run MPI 32 without hanging the code on large number of processors?

Disk issues:

... The big drawback is there is no back-up for user home directory.

The number of inodes per user was too small. ...

Need cluster computing at NERSC:

My main complaint is that this 2000+ supercomputer is being used in a broad user based time-share envionment as 10-20 clusters. The factional use with > 512 (0r even >256) is to small. We are paying dearly for unsued connectivity and the system is not being used properly. For the same hardware price, NERSC could be a "cluster center" offering 3-5x more FLOP's per year. The users need a computer center with more capacity (not capabiity). If we had a "cluster center", a machine like seaborg could be freed up the its best use....codes that need and can used > 1024 (or2048) ps. The expansion ratio (turn-around time/actual run time) has much improved this year (generally below 2 and used to be often > 10); but next year seaborg is going to be overloaded again

Other:

Max processors depends on the job. [comment about the survey question]

I would welcome examples on how to improve application performance on the SP with increasing numbers of processors.

Just starting / don't use:

Usage limited so far, not much to say.

Have not used it yet.

we are still working on putting mpi into our model. once we have that completed we can use 32 maybe 64 processors to do our model runs

 


Comments on NERSC's Cray T3E:   18 responses

Good machine / sorry to see it go:

I understand that it has to go, but I am sad about it.

Better balance of communication and computation performance then the IBM SP.

Good machine. Shame to see it go.

Though cache limitations were a problem, overall an excellent machine -- sorry to see it go.

Saying goodbye to an excellent resource.

Sad to see it go. (Especially now that everyone has switched over to seaborg).

I like the T3E.

With the demise of the T3E seaborg loading is going to get worse.

I wish it had more memory per node, faster nodes and lived forever.

Operating environment is easier than SP.

good system, more memory is needed.

Mixed evaluation:

Single-processor speeds on this machine have fallen behind the curve; we have ceased development on this platform. Communication speeds are excellent, however.

Not competitive with the SP; otherwise find for parallel computing.

Provide better interactive services:

Too small time is avaible for interactive mode

Even though interactive jobs have top priority, MCurie's list of started jobs would get filled with long batch jobs, and interactive jobs couldn't get in. In the end I gave up and moved to Seaborg.

Queue issues:

128 processor queue can be very slow -- up to 1 week.

Don't use:

I do not use T3E nowadays.

Other:

I wish the default shell would be C shell

 


Comments on NERSC's Cray PVP Cluster: 14 responses

Good machine / sorry to see it go / need PVP resource:

It is unfortunate that no viable replacement for the vector machines is planned. By a viable replacement I mean a machine which can run straight fortran or C codes without having to resort to something like MPI to use it effectively. The current PVP cluster is far too slow, which has effectively slowed down those projects that we have assigned to it.

Not competitive with the SP; otherwise fine or parallel computing.

interactive use is very important

Saying goodbye to a useful and excellent resource.

I'm sorry to see the Cray's go. One of my codes is optimized for these vector machines and runs quite efficiently on them. Now we will have to run it on the IBM SP where it will not run efficiently at all unless parallized.

I wish that there were going to be a replacement for the PVP cluster.

Wish it were around a bit longer

Hope it can continue to exist in nersc.

My codes are legacy codes from ancient NERSC (MFENET) days. They now run on local UNIX machines but the larger mesh jobs run only on the NERSC PVP cluster.

Overall I like everything about the PVP Cluster ...

With the demise of the PVP, the loading of seaborg is going to get worse....there is now no place to run the vector codes

Improve turnaround time:

batch turn around can be slow. killeen turn around usually good

Overall I like everything about the PVP Cluster except for the long wait times in the queue, and the bad presentation of queue status information so I can gauge how long my wait will be.

Don't use:

I do not use

Never used it.

 


Comments on NERSC's PDSF Cluster:   9 responses

Good system:

runs great. ...

Generally excellent support and service. What issues do arise are dealt with promptly and professionally.

Keep up the good work

beautiful! but sometimes misused/stalled by infinitely stupid users.

Queue and priority issues:

*** Obscure priority setting within STAR - Not clear why individual people get higher priority. - Not clear what is going on in starofl. Embedding gets always top priority over any kind of analysis made by individual user. + intervention of LBL staff in setting embedding/simulation priority should be stopped. ...

NERSC response: Rules for calculating dynamic priorities are explained in the STAR tutorial   (05/03/02). They are based on the user's group share, the number of jobs the user currently has in execution and the total time used by the running jobs. Shares for all the STAR users (with the exception of starofl and kaneta) are equal (see bhpart -r). starofl is designated to run production that is used by the whole experiment and kaneta does DST filtering. The justification is that no analysis could be completed without embedding and many users run on pico-DST's produced by Masashi. STAR users should direct their comments regarding share settings within STAR to its computing leader (Jerome Laurent - jeromel@bnl.gov). NERSC staff does not set policies on the subdivision of shares within the STAR group.

The short queue is 1 hour, the medium queue is 24 hours, and the long queue is 120 hours. There's only a factor of five difference between the long and mediume queues, while there's a factor of 24 between the short and medium queues. An intermediate queue of 2 or 3 hours would be useful as it is short enough to be completed in a day but can encompass jobs that take a bit more than an hour.

NERSC response: To answer that question we have to ask another one. How and under what circumstances would users benefit from the introduction of this additional queue. The guess is that the user hopes for a shorter waiting time if her/his job would be submitted to such a queue.

The PDSF cluster works under fair share settings on the cluster level. This model allows groups of varying size and "wealth" to share the facility while minimizing the amount of unhappiness among the users. In this model each user has a dynamic priority assigned based on the user's group share, subdivision of shares within a group (decided by the groups), the number of jobs the user has currently executing and the total time used by the user's running jobs. Jobs go into execution based on that dynamic priority and only if two users have identical dynamic priority is the queue priority is taken into account. So the queue priority is of secondary importance in this model unless a pool of nodes is dedicated to run a given queue.

We use the queue length to manage the frequency with which job slots open. The short queue runs exclusively on 30 CPU's (as well as on all the other CPU's sharing with medium and low). This means that on average a slot for a short job opens every 2 minutes. These settings provide for a reasonable debugging frequency at the expense of those 30 nodes being idle when there are no short jobs.

We created a medium queue based on the analysis of average job length and in order to provide a reasonable waiting time for the "High Bandwidth" nodes which only run the short and the medium queues. We have 84 such CPU's. So on average (if those nodes were running only medium and no short jobs) a slot opens every 15 minutes. In practice a fair number of those nodes runs short jobs too so the frequency is even better. But then again in the absence of short and medium jobs, those nodes idle even if we have a long "long" queue.

Introducing one more queue would have a real effect only if we allocated a group on nodes that would run that semi-medium or short queue exclusively. That would only further increase resource fragmentation and encourage users to game the system by subdividing jobs which only increases the LSF overhead and wastes resources. We closely monitor the cluster load and job length and if a need shows up we will adjust queue length and node assignments, but we do not plan on adding more queues.

Disk issues:

... Not clear how the disk space is managed ***

NERSC response: Disk vaults are assigned to experiments based on their financial contribution. STAR subdivided their disk space between the physics working groups and Doug Olson (DLOlson@lbl.gov) should be contacted for more details. A list of disk vaults with their current assignments is available at: http://pdsf.nersc.gov/hardware/machines/pdsfdv.html. PDSF staff (by request from the experiments) does not interfere with how the disk space is subdivided between the users but if experiments wish we can run a cleanup script (details set by the requester). Currently this is in place on pdsfdv15.

Data vault are not pratical to use => IO requirements are just a patch not a true solution.

NERSC response: Indeed disk vaults have poor performance while serving multiple clients. Very recently we converted them from software to hardware raid which improved their bandwidth. We also brought in an alternative solution for testing, the so called "beta" system. PDSF users loved the performance (instead of couple tens, it could serve couple hundred clients without loss in performance), but such systems are much more expensive (currently factor of 4 at least) and at the end the experiments decided to go with a cheaper hardware raid solution. We are watching the market all the time and bring in for testing various solutions (like "beta" and "Blue Arc") and if anything that the experiments can afford comes by, we will purchase it.

*** High bandwidth node usage limited to very specific tasks.

NERSC response: High Bandwidth nodes (compute nodes with a large local disks - 280GB in addition to 10GB /scratch) are all in the medium queue and access is governed by the same set of rules as for any other compute node in the medium queue. The only restriction is the allocation of the 280GB disk where only the starofl account has write privileges. That is necessitated by the type of analysis STAR does and the experiment financed purchase of this disk space. If you are from STAR and do not agree with this policy, please contact Doug Olson (DLOlson@lbl.gov).

Need more disk space on AFS. AFS should be more reliable. I find AFS is very reliable from my laptop. It is much less for PDSF.

NERSC response: PDSF AFS problems result from afs cache corruptions. It is much easier to preserve cache integrity if there is only one user (like on the laptop) than tens of users (on PDSF). Heavy system use exposes obscure bugs not touched upon during single user access. To improve the afs performance on PDSF we are moving away from the knfs gateway model for the interactive nodes. The afs software for linux matured enough so that we just recently (the week of 10/14/02) installed local afs clients on the individual pdsfint nodes. This should greatly reduce afs problems and boot its performance.

Would like new functionality:

... however, using the new INTEL fortran compiler might be useful, since it increases speed by a factor 2 (or more) at least in INTEL CPUs.

NERSC response:The two largest PDSF users groups require the pgf compiler, so we can look at the INTEL FORTRAN license as an addition and not a replacement. Additionally INTEL FORTRAN does not work with the Totalview debugger, currently the only decent option for debugging jobs that are a mixture of FORTRAN and C++ on linux. Also INTEL licenses are pricey, but we will check what kind of user base is there for this compiler and see whether this is something we can afford.

Needs more resources:

Buy more hardware ! when STAR is running a large number of jobs (which is almost all the time!) it's a pain for other users..

NERSC response:PDSF is a form of cooperative. NERSC helps to finance it (~15%) All the groups get access that is proportional to their financial contribution. These "Shares" can be checked by issuing a bhpart command on any of the interactive nodes. STAR is the most significant contributor, thus it gets a high share. However, the system is not quite full all the time - please check our record for the past year at: http://www-pdsf.nersc.gov/stats/showgraph.shtml?merged-grpadmin.gif .
We did purchase 60 compute nodes (120 CPUs) recently and we are introducing them into production right now (a step up on the magenta line in http://www-pdsf.nersc.gov/stats/showgraph.shtml?lsfstats.gif ).
Also it helps to look at this issue in a different way. In times of high demand everybody is getting what they paid for (their share) and when STAR is not running, other groups can use the resource "for free".

Don't use:

What is it?

NERSC response:PDSF is a networked distributed computing environment used to meet the detector simulation and data analysis requirements of large scale High Energy Physics (HEP) and Nuclear Science (NS) investigations. For updated information about the facility, check out the PDSF Home Page.


Comments on NERSC's HPSS Storage System:   31 responses

Good system:

Generous space, quick network.

I really like hsi. ...

Gets the job done.

Flawless. Extremely useful.

Excellent performance for the most part, keep it up!

Not very useful to me at present, but it works just fine.

NERSC has one of the more stable HPSS systems, compared to LANL/ORNL/NCAR

I really love being able to use hsi, extremely user friendly.

This is a big improvement over the old system. I'm impressed with it's speed.

Really nice and fast. ...

HPSS us terrific

Easy to use and very efficient. We went over our allotted HPSS allocation, but thank you for being flexible about this. We plan to use this storage for all our simulations for years to come.

Everything is great, ...

It works well.

After the PVP cluster disappears the HPSS will still be my primary storage and archive resource.

very useful system with high reliability

Don't like the down times / downs need to be handled more gracefully:

... Downtime is at an inconvenient time of day, at least in EST.

The weekly downtime on Tuesdays from 9-noon PST is an annoyance as it occurs during the workday for users throughout the US. It would seem to make more sense to schedule it for a time that takes advantage of time differences --- e.g., afternoon on the west coast --- to minimize the disruptions to users.

... except that it goes down for 3 hours right in the middle of my Eastern Time day every Tuesday. How annoying.

It is unfortunate HPSS requires weekly maintenance while other systems are up. This comment is not specific to NERSC.

I don't like the downtimes during working hours.

The storage system does not always respond. This is fatal when the data for batch jobs is too large to fit on the local disk. I had several of my batch jobs hang while trying to copy the data from the HPSS system to temporary disk space.

Performance improvements needed:

cput for recursively moving directories needs to be improved both in speed and in reliability for large directories.

scp can be very slow for large files

Commands that do not require file access or transfer pretty slow, e.g. listing or moving files to different directory

large file ~10gbyes is hard to get from local desktop

Would like new functionality:

It would be very helpful, if the HPSS works in the background bufferd by a huge hard disc

would like to try srb as combined interface to hpss and metadata catalog

... What would be nice is a command that updates entire directory structures by comparing file times and writes the newer ones to disk (like a synchronization). Currently, I use the cput command but it doesn't quite do this. Having such a command would be a great help (maybe it cal already be done with a fancy option which I don't know).

It would be nice to navigate in the HPSS file system directly from PDSF via NFS (of course not to copy files but to look at the directory structure). This is done at CERN.

Hard to use:

too cumbersome to use effectively

Authentication issues:

The new authentication system for hpss seems to be incompatible with some windows OS secure shell soft. Since the change was mande I have not been able to connect using my laptop. I am still trying to get this fixed with the help of the support people here at LLNL, but no good news so far.

Other:

When questions arise and NERSC is contacted for guidance the consultants always come across as condescending. Is this intentional and for what purpose?

Don't use / don't need:

We are not using this at this time.

 


Comments about NERSC's auxiliary servers:   3 responses

Is there a quick way on Escher to convert a power point picture into a ps file without using "xv"?

I have never used these servers, would I need a separate account on these?

Never used.