DOE and NERSC Scaling Initiatives
|Very Useful||2.5 - 3.0|
|Somewhat Useful||1.5 - 2.4|
How useful were the following scaling initiatives?
|Initiative||No. "don't use" Responses||No. of other Responses||Average||Std. Dev.||Change from 2002||Change from 2001|
|large scale job reimbursement program||9||24||2.75||0.61||NA||NA|
|consulting scaling services||66||32||2.59||0.50||NA||NA|
|new Seaborg batch class structure||48||115||2.17||0.72||NA||NA|
Comments on NERSC's scaling initiatives:
[Read all 39 comments]
|20||Good scaling initiatives|
|4||Didn't help / not ready to use / not interested|
|2||Startup projects weren't eligible|
|2||Users need more technical info|
|1||Sometimes hurts smaller jobs / sometimes OK|
Did you participate in the Large Job Reimbursement Project?
|No. of Responses|
Have you or your project submitted information to the Applications performance Matrix?
|No. of Responses|
Have you use poe+?
|No. of Responses|
Do you plan to submit information to the the Applications performance Matrix in the coming year?
|No. of Responses|
If you don't plan on submitting to the Matrix next year, why not?
[Read all 36 responses]
|13||Don't know about the Matrix / not my role to do this|
|7||Codes don't scale well / don't have large codes|
|6||Question doesn't apply to PDSF users:|
|5||It's inappropriate / don't have enough time:|
|2||Have already submitted and no changes expected next year:|
Comments on NERSC's scaling initiatives: 39 responses
- Good scaling initiatives: 20 responses
Please push this project as much as you can. This type of consulting is very important if one goes to the limit of a system in terms of #processors and sustained performance.
I think, it's a good idea. It promotes more efficient use of the NERSC resources.
We were not quit ready to take advantage of the initiatives, but they are are a good idea.
- Ultimately, I think this is a good idea and will lead to better architectures in the future, as well as allowing us to make optimal use of the systems we have today.
- I don't think anything prompts a systems vendor to fix issues better than having a clear characterization of those issues.
always good to think about that.
The restructured the LoadLeveler classes on Seaborg has provided us a leap in our progress in high resolution climate simulations.
Thought the program was very successful and was very beneficial to our group.
These services are very great to get our work done quickly.
It provides more incentive to improve the scalability of our codes.
I favor the shift to large-CPU, large-RAM jobs.
the initiative is great!!! ...
My research was particularly benefited from the Large Job Reimbursement Project, which helped us to: 1) test our code using large number of processors, 2) run very long simulations at no cost.
Interesting, I appreciate it.
... That having been said, I am glad that the queues are becoming more friendly to large jobs.
It is great, ...
I think it is appropriate since NERSC is not in the workstation business, it is for serious users of computer power.
It is the right thing to do. Usage efficiency and proficiency are important.
Eventually, I will want to run jobs on more nodes than at present. I expect these initiatives to be very helpful. As a user of fewer nodes, I don't really notice a difference, positive or negative, in my productivity due to these initiatives.
It is very important step in the development of scientific computing, since parallelization is the trend, especially for massive computing techniques.
NERSC response: Thanks for the feedback. The NERSC scaling initiative has thus far been positive for many projects.
- Wrong approach: 11 responses
I believe that they are totally misguided. The emphasis should be on maximizing the SCIENTIFIC output from NERSC. If the best way to do this is for the user to run 100 1-node jobs at a time rather than 1 100-node job, every effort should be made to accommodate him/her. Even for codes which do run well on large numbers of nodes, it makes little sense for a user to use 1/8 or more of the system, unless he/she has an allocation of 1/8 or more of the available CPU time. Even then it might be more useful to run for example 4 jobs using 1/32 of the machine at a time rather than 1 job using 1/8 of the machine. The IBM SP is not designed for fine-grained parallelism requiring communication every 10-1000 floating-point operations. In the final analysis, it should be up to the users to decide how they use their allocations. Most, if not all of us, will choose a usage pattern which maximizes our scientific output. Remember that most of us are in computational science, not in computer science. We are interested in advancing our own fields of research, not in obtaining Gordon Bell awards. ...
If it were not for the fact that our FY2003 allocations are nearly exhausted, we would be complaining loudly because with the new Class (Queue) structure which favours "large" jobs, we can't get our work done.
Don't freeze out the small-to-moderate user --- the science/CPU hour is often higher for the moderate user
Although we successfully tested large jobs, I do not believe these jobs could serve our scientific goals well. I could easily see using 8 nodes of seaborg for our activation energy barriers determination jobs, but using more nodes than that would not be efficient or necessary. In fact, I see a significant threat to excellent quality supercomputing research by expanding in the direction of using more and more nodes per job. I suspect that a good fraction of the work pursued at seaborg, although excellent, because of the very nature of the problems handled, one cannot expect linear scaling to very many nodes. We believe that 25% of the resources devoted on this super large jobs is already too much.
I have a negative view of NERSCs scaling initiatives. I understand that NERSC needs to propagate itself and to justify the purchase of a new, larger machine. But in my opinion not all good science is done on thousands of processors and I feel penalized both in time available and priority in the queues because I use hundreds of processors and not thousands.
There is always a tension between massive users and those who want to run smaller jobs. While many researchers use a single node (16 processors), I think it would not be cost effective for DOE to pay them to run on their own machines. ...
The new classes mean that jobs not using a lot of processors but still doing useful state-of-the-art physics calculations are not getting through as much as before. In fact, requesting fat nodes may mean a wait of a few weeks. Not all good physics is done by using 6000 processors.
NERSC response: The queue policies for how nodes are selected were changed this summer (2003) in an effort to improve access to large memory ("fat") nodes. The larger memory nodes are now the last to be selected for batch work unless the job specifically requests the large memory resource. If your work requires the 64 gigabyte memory nodes we encourage you to contact NERSC consultants in order to discuss your specific requirements.
For our work, this initiative has been counterproductive. We perform very challenging time-dependent computations for stiff systems of nonlinear PDEs, which require the solution of ill-conditioned matrices at every time-step. Although we are using state-of-the-art parallel linear algebra software (SuperLU_DIST), scaling to increase speed for a given problem has limits. Furthermore, when solving initial-value problems, the time-dimension is completely excluded from any 'domain' decomposition. Our computations typically scale well to 100-200 processors. This leaves us in a middle ground, where the problems are too large for local Linux clusters and too small to qualify for the NERSC queues that have decent throughput. My opinion is that it is unfair for NERSC to have the new priority initiative apply to all nodes of the very large flagship machine, since it weights the "trivially parallel" applications above the more challenging computations, which have required a more serious to achieve parallelism.
Excessive emphasis is being placed on this.
It is detrimental to research group like ours. We need NERSC resources to run 100-1000 jobs or so at a time (in serial mode with one processor per job) or 10-20 jobs with 2-3 nodes per job. There are no other resources available to us that would enable us to do this. On the other hand, our jobs are no longer as favored in the queue since they are smaller scale jobs.
I am unsatisfied with the scaling initiatives. Quite a lot of my calculations require small number of nodes with long wallclock time (due to the code I used), which is slow in the queue. Good science often comes from small, routing calculations, not from massive parallel jobs.
First, I understand the reasons for the initiatives. But it may not be the most cost-effective way to use the computing resources. For example, we only need 32 nodes for eight hours to complete a simulation. The 32 processors is most efficient because we simulate the collision of two beams and each beam is on a single node(16 processors). But we need to repeat the same simulation with thousands of different parameters to help optimizing the performance of collider. In this situation, if we are forced to use more processors for each simulation, it actually wastes more resource.
NERSC response: The focus of the scaling initiative is not to meant to judge what is or is not good science. NERSC realizes that scaling up the parallelism in scientific codes is not always possible, can require significant effort, and is not in every case productive. Where it is possible and productive we stand ready to assist and promote highly parallel computations which make full use of the resources we provide. Providing opportunities which encourage researchers to explore the parallel scaling of their codes has been useful to many but not all projects and codes.
It is the inevitable outcome of advances in computing that computations which were once difficult should become easier. As the domain of what is considered a workstation sized problem expands, so should the realm of capability computing. The best matching of compute resources to scientific problems is a moving target and we try our best at providing resources which meet NERSC users' needs.
- Didn't help / not ready to use / not interested: 4 responses
Doesn't help us, we would need much faster I/O to scale up to more processors at good efficiency. That may change in the future with new codes.
we're not yet ready to use these, but we're gearing up.
I am more interested in getting the results than exploring the scalability. ...
... perhaps another person in my group used this but I did not. I plan to do this in the future, but currently my runs which scale to such large processor numbers take an extremely long time to run, hence we have not performed these long runs. There is the possibility that further performance and scaling work on our code could increase our ability to use larger processor numbers.
- Startup projects weren't eligible: 2 responses
As a startup account, we run jobs on 2000 Cpu s but could not be part of the Reimbursement Project
I am currently trying to scale my code to 4096 P, however the sheer cost of start-up alone means that my small ERCAP allocation was exhausted rapidly when I began testing on 4096 P. It would be useful to have fund available at all times for large scale testing.
- Users need more technical info: 2 responses
... I would like consultants to know more about the software available at NERSC, e.g. compilers, parallelization libraries, and mathematical libraries.
... the minus is some lack of information (detailed surveys) about performance and scaling characteristics of currently available chemistry codes
- Sometimes hurts smaller jobs / sometimes OK: 1 response
This month with everyone trying to use up their computer time the turn around for smaller node jobs is now on the order of a number of days while those using many nodes get reasonable turn around. The rest of the year letting them have much higher turn around probably doesn't hurt those of ue who can't use 32 or 64 nodes.
If you don't plan on submitting to the Matrix next year, why not? 36 responses
- Don't know about the Matrix / not my role to do this: 13 responses
I am just new using NERSC and I do not know what it is that about
I don't really understand what it is.
Not familiar with the Applications Performance Matrix
I'm not one of the lead PI's on this project. They would have done this ... not me.
My repo may well do so. I'm not certain of our plans in the regard.
I don't understand what it is
I don't know what this is about. I do use poe but have no idea what poe+ is.
More exactly, probably not. The reason is that I just have not looked at Application Performance Matrix and don't know much about it.
I was not aware of that service
It would be more appropriate for others in my group to do this.
I'm unaware of what it is.
I don't know how to do it.
NERSC response: Submitting data to the application performance matrix is done via a submission form on the web .
- Codes don't scale well / don't have large codes: 7 responses
our code scales badly using more than 16 nodes
We were somewhat surprised by the results of our benchmarks and would like help identifying and eliminating bottlenecks.
I have to learn more about the APM before I can give an answer. In any case, I believe my applications are too small.
I do not currently conduct large-scale simulations, just diagnostics on the output of previous simulations conducted by others (escher). Large scale simulations would require a shift in our funding paradigm.
I used hpmcount since I used a serial or one node job using 16 processors. I will try to do a combined OMP/MPI this coming year. We also have some molecular dynamics calculations to run that apparently scale well in MPI.
POE isn't necessary for optimizing performance for my codes.
Applications I am running right now are not large enough to make this relevant.
- Question doesn't apply to PDSF users: 6 responses
I wasn't aware that such as system existed. Also, I have the feeling that these 'measurements' have a fundamental flaw: often things like 'efficiency' of the application is measured by comparing the CPU time to the wall clock time. This makes codes that calculate theoretical models, which often do very little IO, appear very efficient, while codes reconstructing experimental data and do much IO appear less efficient. Within experimental reconstruction software there is also a wide range of codes: some large experiments for instance have to do tracking on almost all events and require a lot of CPU, while other experiments need to scan through many background events without doing very much except for IO in order to find a signal event.
Primary processing of our data is done; I am not involved in the microDST construction for individual analyses, so I am not running large jobs at the moment.
I don't think it applies to me personally - would be better suited to a study of the STAR simulations/reconstruction usage probably.
I think it applies to MPP and not to PDSF style computing.
Our applications are not large enough
The ATLAS code doesn't support true parallelization at this point, only breaking the jobs into pieces and submit the pieces to multiple machines.
- It's inappropriate / don't have enough time: 5 responses
We are benchmarking ourselves. The performance of our application could be made to vary from very bad to very good. Such results are not representative.
I would like to, but have very limited time available for this type of activity.
1. Life is too short 2. As a small-to-moderate user, I just do not have the time to do this --- my NERSC computing remains a small (<33%) fraction of my overall scientific responsibilities and I do not want to set aside the hours-to-days-to-weeks to get involved with this
I will submit using the Applications Performance Matrix because the application form seems to require it. However, I believe it is basically useless as a measure of code performance. I use an AMR code, the result of which is that the performance is highly dependent on the geometry of the problem, the size of the problem, and what physics is being simulated. Depending on the problem and the physics my code may scale from anywhere to 16 processors (for smooth collapse over a large dynamic range using radiation, gravity, and hydrodynamics) to hundreds of processor (for pure hydrodynamics with a fairly small dynamic range). A single measure like poe+ is not useful in this case.
lack of manpower
- Have already submitted and no changes expected this year: 2 responses
I have already benchmarked the codes I intend to run in the coming year, or at least codes which should perform very similarly. If I gain access to a new machine with a large number of nodes in the coming year or there are major hardware/software improvements on those machines on which I plan to run, perhaps I will perform new benchmarks and submit the results to the Applications Performance Matrix. Similarly if we find new algorithms, or ways to improve the performance of existing codes then I might submit new results. However, I do not have any such plans, a priori.
No expected change in performance.
- Other: 3 responses
Assuming a refereed publications describing the application is available.
If asked (and I assume we will be), our project will submit data.
I will not be at LBL the coming year