ERSUG Meeting Minutes - January 8,9, 1996
Lawrence Berkeley Laboratory
January 8-9, 1996
Kirby Fong (NERSC, LLNL), Dave Stevens (LBL), Ricky A. Kendall (PNNL,ExERSUG Vice-Chair)
Opening Remarks From the Berkeley Laboratory
This ERSUG meeting, originally planned for Princeton, New Jersey, was rescheduled at the Lawrence Berkeley Laboratory to facilitate a user visit of the recently announced new home of the National Energy Research Supercomputer Center. Due to blizzard conditions closing many airports in the eastern U.S., DOE staff and Berkeley Laboratory director Charles Shank (who was in the east) were unable to attend the ERSUG meeting. In Charles Shank's place, Pier Oddone, the Berkeley Laboratory Deputy Director welcomed attendees and summarized the three main points that Director Shank wanted to present. The first point was that the Berkeley Laboratory is committed to serving the national community and to establishing production services at LBL as soon as possible. As evidence of past production success, he recalled that LBL had provided nationwide computing on its CDC 7600 previously and is currently operating the Advanced Light Source as a national facility. The second point was that LBL planned to enhance NERSC to become regarded by users and DOE program managers as an integral and essential component in energy research science. LBL currently has three major program areas, high energy and accelerator physics, energy sciences (e.g. materials, chemistry, earth sciences, and environmental sciences), and life sciences (e.g. human genome center). NERSC would complement existing institutional computing at LBL to form a fourth major program. The physical proximity of NERSC to so many ER scientists would bring NERSC into closer and more frequent contact with prototypical NERSC users. The third point was LBL wanted to measure and prove the success of NERSC. Simple measures like uptime and throughput would not be enough; the real measure of success would be large scale science that could not have been accomplished without supercomputing. While a large computational resource can be sliced up to serve a lot of medium scale computing needs, this latter pattern of use is not sufficient to sustain a successful supercomputing center. Significant, large scale computational science breakthroughs will be needed every year or two to make NERSC a continuing success.
Following these opening remarks, several users made comments which we summarize briefly. Bruce Cohen cautioned LBL against giving the appearance that NERSC was becoming an LBL institutional facility as opposed to a national facility. The lack of integration of NERSC with the rest of LLNL at least reassured users that NERSC was truly a national facility, not partial to LLNL. Jack Byers doubted that the DOE program managers really understood the value of NERSC to science. He said EXERSUG had been unable to solve that problem and hoped that LBL could. Pier Oddone assured Jack that he and Director Shank would be active advocates and missionaries on NERSC's behalf at DOE. Maureen McCarthy felt a successful NERSC also needed both effective user participation and integration of NERSC resources into the computing resources at user sites. Pier Oddone agreed there could be many factors in measuring success, not just headlines about the science accomplished using NERSC resources. Victor Decyk offered to list additional aspects LBL should use in its metric. Bas Braams asked about other computational science at LBL and how they would be combined with NERSC. Pier Oddone assured him that these projects all had their own funding and would not be drawing any monetary resources from NERSC.
News and Plans From NERSC
After a break, Bill McCurdy, LBL's new Associate Laboratory Director for Computer Sciences spoke about the past and future models of supercomputing centers and how he intended to organize and integrate NERSC at the Berkeley Laboratory. The previous model of supercomputing centers used by NERSC and the NSF centers was and continues to be good, but it is no longer sufficient to satisfy sponsors. NERSC needs to be aware of emerging technologies and to do the long range planning to make the one or two key decisions that lead it to accomplishing the magnitude of computational science that will please program managers or sponsors. In explaining the integration of NERSC into LBL's technical computing, technical information, and administrative computing activities, he drew a graph which we reorganize here in outline form in order to include explanatory comments.
- Information and Computing Sciences Division. This division retains the library and computer research activities that are already at LBL. It is led by Stewart Loken.
- Center for Computational Science and Engineering. This center, while separately funded from NERSC, has been associated with NERSC at LLNL. Its staff specializes in applied mathematics and numerical analysis. John Bell is the leader.
- Mathematics Department. This is an existing LBL research department.
- Networking and Telecommunications.
- Internal. These are all the functions that are currently being performed to support local area networking and connectivity to outside networks. This is Bob Fink's group.
- External. These are the wide area network functions that have been performed by ESnet at LLNL.
- National Energy Research Supercomputer Center.
- LBL institutional. These are the institutional computing services already being funded and performed at LBL under the leadership of Harvard Holmes.
- National. These are primarily the reorganized supercomputing services identified with NERSC when it was at LLNL.
- User Services. This will contain the consulting, documentation, and third party software support.
- Systems. These people operate and administer not only the supercomputers and shared file systems but administer the workstations used by supercomputing access staff.
- File Storage. This is the permanent or archival file storage.
- Scientific Applications. These are predominantly experts in parallel programming. Scientific visualization is also in this group.
- Distributed Computing and Computer Science. This is a new, small group responsible for long range planning. It is expected to interact extensively not only with other groups in NERSC but with universities and other computer centers sharing interests with NERSC.
Mike McCoy assured users that though he would be leaving NERSC to run the computer center at LLNL, the services currently provided by NERSC would, except for accommodating the new MPP machine and the above reorganization, continue at LBL as evidenced by the job descriptions he had written. The NERSC job openings are posted on the LBL Web server for anyone to inspect. Except for some adjustments for budgetary constraints and some shifting from third party applications code support into parallel computing support, all the services are preserved.
Tammy Welcome reinforced the theme of continuity by listing the existing user services and their new directions that users could expect.
- Consulting both by telephone and electronic mail.
- Documentation delivered primarily through Web pages.
- Training via classes and workshops and in the future as on-line tutorials to be played back on demand.
- Collaboration with scientists and researchers to develop large scale (principally massively parallel) scientific applications.
Several members of the audience spoke up on behalf of training, especially the need to capture and deliver information about massively parallel computing. Suggestions included capturing and digitizing some university courses such as the one Jim Demmel teaches each spring on parallel programming at U.C. Berkeley, contracting with Cray Research to teach classes, and having NERSC staff visit user sites to provide training. Tammy concluded with a warning that the parallel programming environment was not complete; NERSC would continue to seek new ideas, software, and tools from other sources.
Moe Jette explained that the MPP procurement (described later in these minutes) opened the door to more extensive use of distributed computing software since it would bring not only a Cray T3E massively parallel processor but four Cray J90 symmetric multi-processor (SMP) machines. The Open Software Foundation's Distributed Computing Environment (DCE), already being tested at NERSC, would become more prominent. DCE includes the Distributed File System (DFS) which will be the successor to the Andrew File System (AFS). DCE also includes the Kerberos 5 authentication facility, support for multiple threads, remote procedure calls, and global naming of files and services. This will enable NERSC to support global home directories, provide more secure logins, restore single authentication to all NERSC machines and services, provide more flexible file access through Access Control Lists, and enable distributed applications to be developed more easily. NERSC is currently running the AFS client on its Cray C-90 and SAS machines and the AFS servers on Sun workstations. The servers can be converted into DFS servers while still serving AFS clients through the use of a protocol translator. NERSC will maintain a separate AFS/DFS cell from the rest of LBNL. Maureen McCarthy said Battelle Pacific Northwest Laboratory is eliminating its firewall, and before they permit their AFS cell to trust NERSC's AFS cell, they need to understand to whom NERSC is granting access to its own cell. If PNL suffers a security incident via NERSC, that would set back their distributed computing progress. Steve Louis added that CFS will not be integrated with DFS; the High Performance Storage System that succeeds CFS will be integrated with DFS. In the realm of batch processing, NQX will succeed NQS as the job queuing and scheduling system.
The moving of equipment from LLNL to LBL is dominated by the relocation of CFS and the C-90. CFS could be down as long as 21 days, but that could be cut to 14 by renting some used silos and loading the tape cartridges into them initially. The C-90 should be moved while CFS is down since it would not be able to read input from or write output to CFS anyway. The pair of Cray-2s will be left at LLNL and will be shut down during the summer as the new J90s arrive to take over their load. The first J90 is to become available for use in May. The J90s will have much greater disk capacity, so the big file batch queue should enjoy faster turnaround. One of the J90s could be tuned to favor very large batch jobs.
Long Range Planning
Bill McCurdy initiated an audience participation discussion of long range planning by introducing Jim Demmel from the University of California at Berkeley Computer Science Department. He is part of a separately funded effort that is working on many aspects of parallel computing including the SCALAPAK parallel linear algebra project. Part of the research work is to build software in layers, trying to identify the right abstractions for each layer that will allow easier porting of parallel applications to future architectures. Other research at UCB includes GLUNIX, an operating system for a network of workstations (NOW). This investigates an alternate approach to massively parallel computing by clustering several high end SMPs together. He warned the audience of a future hardware discontinuity; researchers are currently working on building the CPU and memory on the same chip. This could drastically change the memory model for MPP programmers.
After this point, audience members made many different comments about different subjects, sometimes backtracking to earlier subjects. Rick Kendall led off by saying his group was funded to be at the leading edge of computational chemistry. They certainly were interested in new tools that assist in writing codes for new architectures. He emphasized that codes really have to be rewritten, not ported, in order to realize the benefits of new architectures. DOE needs to be aware of and be prepared to pay for the cost of conversions or rewrites. Jim Demmel responded that they were trying to define the layers of abstraction for parallel programming that would cope with more levels of memory hierarchy, thereby reducing the reprogramming costs. Rick Kendall was still concerned that there are major vector or superscalar applications that aren't being rewritten at all.
Dale Koehling mentioned the e-mail exchange prior to the ERSUG meeting questioning the merits of MPP machines versus clusters of SMPs. He said the MPP procurement was a fait accompli. NERSC and its users should therefore concentrate their immediate planning efforts on making the MPP machine do large scale computational science.
Phil Colella made two points. One is that users ought to have a procedural mechanism to identify the computing abstractions they need NERSC to support. The second is how to foster the evolution of research software into production quality software.
Bas Braams endorsed NERSC's concentration on applications. It should be the role of universities and vendors to research new hardware and software tools that NERSC subsequently acquires.
Maureen McCarthy felt that users, not just computer scientists need to be involved in NERSC's long range planning. Users should identify the new large scale computational science that needs to be done, after which the computer scientists can evaluate new hardware and software for their suitability.
Bas Braams asked if it wasn't sufficient for experts in the NERSC applications group to identify what various applications needed in order to become efficient parallel codes. Jim Demmel said he believed overall it would be more efficient to have an effort dedicated to providing the right infrastructure for applications development than for each application to be parallelized ad hoc.
Bill Johnston brought up new areas that had been mentioned in the Berkeley Laboratory's proposal to DOE. These were to branch out to researchers in experimental science, data analysis, and instrument control. He emphasized the importance of scalability of architecture so that any prototype system that proved to work could actually be scaled up to solve bigger problems or serve more users. Furthermore, a scalable computing environment would have to be built out of mass production components in order to be affordable. Ever more expensive MPP machines will not be budgetarily viable. Collaboration between NERSC, LBL, and UCB could prototype the computational capabilities of the future, and those results could have major influence on the course of the supercomputing market.
While attendees were eating their box lunches Pier Oddone shifted the discussion in another direction by asking how users felt the Berkeley Laboratory could work more effectively for NERSC. Jack Byers asked why LBL had not used NERSC more. Pier said they had found people were sizing their problems to fit their workstations because they like the control they have over their own workstations. The closer collaboration with NERSC should change LBNL scientists' behavior; furthermore, the collaboration would be needed to solve the bigger problems that scientists were heretofore avoiding. Phil Colella tried to address the original question by saying that LBL (and also UCB) were good places for trying to expand the constituency for ER supercomputing. NERSC really is a valuable ER resource, but it fails to register as such in the perception of some of its sponsors. LBL can help raise NERSC's visibility. Victor Decyk considered NERSC to be a unique resource, but larger constituencies would not necessarily accomplish large scale computational science. Bill McCurdy agreed that more small users is no advantage; university computer centers used to have a lot of small users, and nearly every one of those centers is gone. For any supercomputer center to survive, its customers must generate significant results that convince sponsors that supercomputing is worthwhile.
Jack Byers brought up the role and relationship of the Supercomputing Access Committee (SAC) with the user community. He felt SAC should have made greater attempts to solicit and incorporate user input in the recompetition process. Dale Koehling, based on his experience as a detailee at headquarters, said that's not how SAC works. Yes, SAC has been stymied by lack of input, but the way to provide input is for principal investigators to interact with SAC members and program managers on a daily or weekly basis to let them know what's working and what's not. Bill McCurdy said ER-1 really needs NERSC to survive the relocation, so the next few months while they're paying attention to NERSC is an opportune time for EXERSUG to discuss with them how EXERSUG is organized and how it relates to DOE.
Mike McCoy said he was as much or more concerned about the relation between programs and MICS as between programs and SAC. SAC hands out annual computer time allocations as a commodity. The new MPP machine is a capability as well as a commodity and needs to be recognized as such in the allocation process. Alice Koniges added that the process of assigning a programmer along with the computer time allocations on the existing T3D machine has been very effective for making the best use of the T3D time. She confirmed what Rick Kendall said earlier, that one must create a parallel code; porting existing codes is not really effective. Gary Kerbel concurred that it was necessary to make the right initial investment in a parallel code as he had done. He said he did not have to do much work in moving his code from a CM-5 to a T3D. According to Maureen McCarthy, PNL continues to have internal discussions on whether to seek MPP allocations or additional workstations. Alex Friedman lauded NERSC for at least starting out with a balanced hardware configuration that included capacity in the form of SMP machines as well as an MPP. He thought these intermediate class machines served an important research need and did not want NERSC or anyone else to lapse into a bimodal thinking where they assumed the only choices for scientific computing were MPPs or workstations.
Gary Kerbel asked for elaboration upon the experimental and data analysis computing that Bill Johnston described. Bill McCurdy said the Berkeley Laboratory's proposal did call for expansion of NERSC into those areas, but there's currently no funding for it. Stewart Loken will work on writing proposals to obtain funding for these areas.
Bruce Cohen asked whether NERSC and Livermore Computing will be able to sustain any synergy from dealing with common problems and solutions in supercomputing. Mike McCoy said as head of LC he would be very favorable toward establishing high bandwidth networking and collaborations on projects of mutual interest. LC will be experimenting with clusters of SMPs which are of interest to NERSC.
Winding up the discussion period, Bill McCurdy listed three action items. First is that he and his staff would lay out the various computer capabilities at NERSC that could be allocated. Second is that he would think about how to include outside participation in NERSC long range planning and not rely solely on full time employees at NERSC. Third is that users should offer ideas on how to reformulate ERSUG and the allocation process. Jack Byers wanted to see MICS and SAC take more initiative in learning about user needs, perhaps by having a couple SAC members attend ERSUG meetings. Pier Oddone remarked that LBL is an ER laboratory and already has contacts with the various DOE program managers in ER. LBL should be able to help NERSC and its customers improve their connections with the program sponsors. In response to Jack Byers question why ERSUG couldn't be as effective as the ESnet Steering Committee in representing users to program sponsors, Sandy Merola said ESSC does not advise DOE; ESSC advises NERSC and ESnet. ESSC succeeds only in activities where everyone gains and in which users tell their program managers what benefits ESnet has provided them.
Sandy Merola gave a progress report on the relocation of NERSC. There is a transition team of about 20 people planning and organizing the move. The four principles to which the team is trying to adhere are to keep production computing working, to keep all stake-holders informed, to be cognizant of the balance between cost, time, and quality, and to make sure the inevitable mistakes in the transition are fixable. The transition team was initially preoccupied with preparation of the machine room and office space and was just turning its attention to staffing. For various reasons, most jobs have to be posted; current job holders cannot be automatically transferred. The status on January 5 was:
Transition Hiring Status:
FY97 Budget Posted Filled
HPCAC (supercomputing access) 65.0 42 0
ESNET 18.5 15 2
CCSE 6.2 9 2
Comp. Sci. Org. (directorate) 5.0 2 3
Advertising for these jobs has or will appear in the San Jose Mercury, HPCwire, the New York Times, and IEEE Spectrum. Postings also appear on the LBL Web page. There have also been direct presentations to LLNL employees about the opportunities at the Berkeley Laboratory. Given the very preliminary state of staffing, it is impossible to predict when it will be completed. Office space preparation is well under way, so it is possible to predict that all of it will be ready by April. Similarly, site preparation is in progress and lists of equipment to be moved are being compiled. The site should be ready April 1 at which point the equipment will start moving. Installation is tentatively scheduled to be complete by April 15. The actual functions of ESnet and CCSE should move from LLNL to LBL by March. The function of high performance computing will probably move in mid or late April. A basic rule will be that functions do not move until LBL has sufficient, trained staff to assume the function. If LBL has to hire novices, they can be sent to LLNL for training. Al Geller from Cray Research said their company might be able to advance the delivery date of the first J90 from May to April so there would be something besides the Cray-2s to use while the C-90 was being moved. Mike McCoy said that while there was a written agreement whereby current NERSC staff would not be allowed to transfer to other positions at LLNL until their functions could be assumed by LBL, he would try to be as accommodating as possible on a case by case basis and use all the resources at his disposal in LC to assure that someone performs the NERSC functions as long as the functions are still at LLNL.
Massively Parallel Processor Procurement
After almost three years, the MPP procurement has finally completed. Bill McCurdy announced that Cray Research was the winning vendor. It is a two phase procurement. In phase one during 1996, a pilot early production T3E and four J90 SMPs will be delivered. If various production status requirements are met by December 15, NERSC would proceed with an upgrade of the T3E to a fully configured machine. The initial T3E will have 128 processing elements (PEs), 256MB of memory per PE, and 320GB of disk space, comprised of 40 DD-318 disks. The fully configured machine would be a 512 PE T3E with 256MB of memory per PE and 1.5 TB of disk space. The first J90 machine will be a J932 32-8192 Classic with 234GB of disk space. It will be followed by the T3ELC 128-256 in July. A J932se-24-4096 and a J932se-24-8192 with 522 GB of DA-308 disks should arrive in September. The last J90, a J932se-24-4096 will come in November. If all goes well, the T3E upgrade will be performed in January 1997.
The C-90 will probably be released in the summer of 1997 but could be kept a little longer if necessary. The Cray-2s will have to be phased out some time before the end of FY96. It is conceivable NERSC could get another SMP or two in 1998 to take up some of the load from the C-90. The next major procurement would not occur until 1999. If that is to become reality, the user community will have to contribute to a requirements document to be written by 1997. Any machine, including the T3E, is presumed to have a technological lifetime of three years before it is superseded by the next generation of hardware. For that reason, if the T3E can meet production requirements, NERSC wants the upgrade to happen as soon as possible to maximize the use of the fully configured machine while it is still a state-of-the-art machine.
Tammy Welcome outlined some of the production status requirements, all of which were still being negotiated with Cray Research. The requirements are based on the promises in the vendor's bid about what they could deliver and when. Briefly, there are requirements for a space sharing scheduler, message passing support, a data parallel programming model, a shared memory programming model, an interactive debugger, a profiling tool, parallel I/O, checkpoint and restart capability, and benchmark performance on the fully configured machine.
ExERSUG and Allocations
The January 9 session began with a report on the Executive Committee of ERSUG (EXERSUG) meeting of the previous night. The discussion focused on how to handle allocations on the new MPP machine to maximize the machines utility. This was combined with ideas on how to re-constitute the relationship of EXERSUG to DOE. Rick Kendall presented a transparency outlining the main points of a proposal for reorganizing EXERSUG.
- EXERSUG should have input to SAC about the allocation of MPP time to help insure that people with allocations are ready to make effective use of the system since it will be a scarce and valuable resource.
- EXERSUG members should be appointed by SAC, and one of their responsibilities will be to review the requests for the largest blocks of MPP time. EXERSUG however would not be the sole reviewers of large MPP time requests. There should also be provision for one NERSC staff member to be on EXERSUG.
- Flux in the high performance computing market causes massive changes in user requirements. EXERSUG should be the conduit for communicating such changes to headquarters.
- EXERSUG will provide periodic reports on scientific accomplishments achieve with supercomputers to the program managers.
- There could be a conflict of interest when a reviewer is a requester.
- There is a skewing of attention toward the large requests.
- There is no anonymity in the review process.
- Since SAC must be involved in the ERDP review process, it will automatically come into greater contact with EXERSUG.
- DOE could offer a staff member to EXERSUG to act as its political adviser.
- ExERSUG members from different programs will be "cross pollinated" and learn about algorithms and technology in other programs when they review the large requests.
- This provides a more formal process for determining the mix of hardware and software required by the user community.
- This follows the "user facility" model of NSF centers and NCAR where advisory boards are involved in the allocation policy.
After Rick Kendall's presentation to ERSUG, Dave Nelson and Tom Kitchens joined the meeting via telephone from the east coast. In his brief introductory remarks, Dave Nelson assured the audience that the FY96 budget for MICS was firm. FY97 was not clear, but the expected reduction was one of the factors prompting the recompetition of NERSC. He felt the move of NERSC to LBL was right because LBL could provide more for the same amount of money, because there would be greater chance for collaboration with ER programs at LBL, and because LBL was in a better position to put NERSC to use in supporting co-laboratory projects. Rick Kendall then described for Dave Nelson and Tom Kitchens the proposal he had just reviewed for ERSUG.
In response to further audience comments about support for supercomputing, Dave Nelson responded that it's the users' relations with their program managers that really count when program managers need to make funding decisions. ESnet users had already contacted all their program managers to speak on behalf of ESnet so that the program managers were already in a favorable frame of mind by the time Dave asked them for ESnet funding. Supercomputing users should build up a similar rapport with their program managers to win support for supercomputing. Also with respect to influence, Tom Kitchens pointed out that EXERSUG was intentionally set up to be more independent of DOE so that it could be more influential, but EXERSUG has never really wielded the influence it was meant to have.
Rick Kendall, reiterating the point he had made before the conference call, said that EXERSUG felt it should somehow become involved in the allocation process to help ensure that MPP allocations were given only to researchers who were ready to use them. For perspective, Tom Kitchens reviewed the past allocation practices. In the past, everything except the Special Parallel Processing time was assigned first to the program offices in proportion to their budgetary contribution. This assured that the subsequent allocations would correspond closely with budgetary proportions and that each program could manage its own allocations. The SAC likely would be hesitant to modify the review process in ways that permitted people in one program to pass judgment on the technical merits of the computational science of another program. Bill McCurdy explained that EXERSUG was not proposing to allocate MPP time but seeking to provide SAC with comments on whether specific projects were ready to benefit from large amounts of MPP time. Tom Kitchens allowed that the model of SPP allocations might be appropriate for MPP allocations. Here, NERSC assumed the role of certifying that codes were ready to benefit from the SPP time. Bill McCurdy said the much greater magnitude of MPP time versus SPP time meant NERSC itself did not have the resources to assess the readiness of codes. Dave Nelson said it might be okay for EXERSUG to perform an assessment of a code's readiness for MPP time but not a review of the science itself. Tom Kitchens made it quite clear that the program managers consider it their prerogative to interpret DOE's mission and decide what science to support. What EXERSUG is proposing is to provide input to SAC, but it is not really a peer review since there is no anonymity. Dave Nelson advised EXERSUG to prepare its proposal as a package with two parts. Part one is to state clearly what the problem is in allocating MPP time. Part two is to offer ExERSUG's services or participation in dealing with the problem. That way part one persists for SAC to deal with even if SAC does not desire ExERSUG's participation.
Tom Kitchens then brought up the grand challenge program and its review process which could also hand out MPP time. The grand challenge projects enjoy the support of the program managers. For this program, MICS appoints the review panel, and, depending on various requirements, the review panel may or may not include participation from EXERSUG. Tom Kitchens asked EXERSUG not to write a proposal that made it sound like EXERSUG would be a second review panel.
Rick Kendall announced that Brian Hingerty would be stepping down as chairman of EXERSUG and that Rick would become the next chairman.
Bas Braams asked how the declining fusion budget would affect allocations of computer time for fusion. Tom Kitchens and Dave Nelson collectively explained that historically, fusion was the sole supporter of NERSC and had invested more heavily in central computing than workstations compared to other programs. In recognition of its historic support, fusion allocations as measured in computer resource units were being maintained at approximately current levels until the capacity of the NERSC resources expanded to the point where current allocation levels matched the funding levels. Of course, if this takes too many years, fusion allocations might decline sooner. This would also depend on the level of support within fusion for theory and its computational needs.
Bill McCurdy asked about plans for the High Performance Computing Research Centers. Dave Nelson replied that MICS is still working that issue. It is not a simple question of whether or not to recompete the centers. MICS has to find the best solution for supporting grand challenge programs.
Discussion Following the Break
The meeting took a break after the conference call then resumed with a short discussion period. Here are some of the comments.
- We should take this discussion to any SAC members and program officers we happen to know.
- We must make it clear that EXERSUG is not interested in overseeing science.
- EXERSUG should collect measures to keep track of which scientific programs are making effective use of NERSC.
- Some fraction of the allocations should be convertible for use across different machine types at NERSC.
- The Grand Challenge program should serve as a reminder to EXERSUG of the complexity of setting MPP allocations.
- The SPP model won't work for MPP because the SPP system (the Cray C-90), unlike the MPP, was a mature programming environment from the start.
- Priority bidding (like in the CTSS system) might protect against over allocation to those people who aren't ready to use it by allowing ready users to run on the MPP.
- Allow the major programs to swap or broker allocations.
- The SPP set aside was successful at getting the big computational codes the resources they needed. Perhaps some sort of MPP set aside could work.
- NERSC has a responsibility to ensure that there is a future for supercomputing. How about giving NERSC an MPP allocation to use in fostering future supercomputing applications?
- NERSC hasn't had its own allocations in the past. This may not yet be the right time to try it.
- We could have EXERSUG manage such a NERSC allocation.
- Allocation brokering might work better at the program manager level, especially if we can give the program managers better and more timely information about how their principal investigators are currently using NERSC resources.
Moe Jette solicited ERSUG for its opinions on CUB, the central user bank. CUB has passed through five different programmers. It relies on the Oracle data base management system and requires two FTEs to support. Because of the difficulty and expense of maintaining CUB, Moe wanted to know whether users really need it. CUB performs two distinct but related functions. One is to regulate the dispensing of time allocations, and the other is to collect usage information and generate accounting reports. On the accounting side, one possibility is to use whatever accounting information a computer's operating system generates and merge the data from several computers into a single report. On the allocation side, Cray Research has a "fair share" scheduler, but it deals only with a single machine. The fair share scheduler is not incompatible with CUB; it simply operates on the domain of a single machine. Moe wanted to know whether global as opposed to individual machine allocations were necessary. How precise do the allocations have to be? Are allocations by group rather than user sufficient? How critical is the integration of accounting and allocation systems? Moe raised the CUB issue now because CUB is expensive to maintain and the existing experts might not accept jobs at LBL. He wanted to know if users thought CUB was still necessary. It would take about two months to move the CUB resource allocation management demons from the T3D (which is at LLNL under a Cooperative Research and Development Agreement) to the new T3E.
Rick Kendall said PNL definitely needs to be able to move time allocations around. Bas Braams' site also uses this capability. CUB may be invisible to most users, but it is very helpful to account managers. Alex Friedman asked how one establishes fair exchange rates for moving CRUs from one machine to another and pointed out the need for managing allocations over multiple machines to support distributed applications that use a front end machine and an MPP simultaneously. Bruce Cohen conceded that large amounts of T3E time should not be exchangeable for C90 or J90 time since this could seriously perturb the C90 and J90s; however, he does use CUB at least to dole out additional time to users who have legitimately used up their individual allocations. As it became clear that something resembling CUB was still necessary, Dale Koehling and Bill McCurdy suggested that an EXERSUG subcommittee would be appropriate for advising Moe on what was needed and, if necessary, for defining an interim policy on how users should behave on new machines before CUB can be brought up there. Maureen McCarthy asked that there be some provision for scheduling special capability time on the J90s like the SPP time on the C-90.
Rick Kendall asked all major programs to send him any information they wanted in the "green book" which he was in the final stage of drafting. The green book is a formal document describing the scientific computing accomplishments at NERSC and outlining the future hardware and software requirements. Versions of this document are needed every three years as it is the formal input from NERSC users to the program managers on what they achieved with NERSC and what they need NERSC to have or to become in order for it to remain an essential scientific resource. Send information to [email protected] Rick plans to put the first draft on the Web and will announce its availability and location through the ERSUG mailing list. Major users are urged to get on that mailing list if they are not on already.
NSF Call for Proposals
Bill McCurdy told the audience that the National Science Foundation had recently issued a call for pre-proposals for its Partnerships for Advanced Computational Infrastructure program which will succeed its current supercomputer center program. (URL http://www.cise.nsf.gov/cise/ASC/ASCHome.html ) Proposals are expected to come from universities for providing leading-edge computational facilities to support academic science and engineering research. One difference with the current program is that successful proposals must involve multiple partners who can be, but are not limited to, universities, NSF-funded centers, research consortia, regional computing centers, private sector organizations, and national laboratories. As a consequence of the last item, NERSC has already received one inquiry about its interest in becoming a partner in a proposal and might receive other feelers soon. Bill wanted to know how users felt about NERSC collaborating in a proposal.
Alex Friedman and Bill Herrmannsfeldt cautioned against accepting operating funds from a second sponsor; that allows each sponsor to tell NERSC to ask the other sponsor for the balance of the operating budget. Phil Colella and Rick Kendall cited ESnet as something NERSC could lend to a partnership. Phil and Bill McCurdy both observed that DOE in general does not get the recognition it deserves for its accomplishments in science and computing and that NERSC participation in a partnership would be a step in rectifying that situation; however, it might be better for NERSC to emphasize the scientific rather than the computing benefits of its participation. Maureen McCarthy could see NERSC helping in the development of co-laboratories since DOE appeared to be interested in that, but she did not want the NERSC computers to become loaded with NSF users unless NSF provided a commensurate increase in computing capacity. Various other people mentioned projects or activities that might be of mutual interest to NERSC and other supercomputer centers. The discussion was inconclusive, but the tone was cautiously favorable.
The next ERSUG meeting will be in the Washington area to enable more DOE people to attend since one of the major items will be the reconstitution of EXERSUG and its relationship to DOE.