GUPFSGUPFS HomeOverview Testbed Technology Results Documents/Downloads Contact Us Links |
Updating the Testbed Configuration for FY 2004 The testbed system has provided us with a useful facility for developing the benchmark methodology and special benchmark codes for the GUPFS project. It has also been useful in helping to establish the credibility of GUPFS with technology vendors, and in building relationships with various technology vendors. However, with the inexorable progression of technical advancements, it became apparent that the testbed was inadequate in size for conducting the types and levels of technology evaluations needed for the GUPFS project in FY 2004 Technological advancements over the last year have outstripped the ability of the existing testbed to incorporate them. Experience with the testbed and attempts to integrate new storage and fabric technology into it demonstrated that more nodes were needed in the testbed to allow emerging advanced technologies to be integrated into it for evaluation, even in the near term. Given the technologies that the GUPFS project plans to begin evaluating in FY 2004, it was clear that the existing testbed could not accommodate their inclusion. In addition to the testbed being limited in accommodating new technologies, practical experience during FY 2003 indicated to us that the testbed could be improved in a number of ways that would increase the speed at which file system evaluations could be conducted, in conjunction with specific combinations of fabrics and high-performance storage. Foremost of these was that the testbed should be able to conduct multiple, simultaneous, independent evaluations with node sets of various sizes. Next was that the testbed be able to reconfigure the fabric and storage connections much more easily. Other important areas to improve were to greatly increase the Gigabit Ethernet connectivity for iSCSI and cross-fabric testing, and to only have a single node type used in testing. Based on these needs, we designed a testbed upgrade to facilitate more simultaneous evaluations and much more rapid and easy reconfiguration. The design considerations for the update to the testbed and discussed in the following section. We completed the improved testbed system design at the end of the third quarter of FY 2003. We then identified, tested, and procured the components during the fourth quarter. These were assembled and integrated them into the existing testbed system at the end of the fourth quarter of FY 2003 in preparation for the planned FY 2004 activities. The configuration of the expanded testbed system is discussed in detail later in this section. 1. Updated Testbed Design Considerations The design of the upgrade to the GUPFS project testbed system was predicated on a number of factors. These included (1) the lessons learned and (2) limitations encountered from using the FY 2003 testbed to integrate and test new technologies, and (3) the new and emerging technologies expected to be investigated in the next year; these are discussed below, along with (4) a brief review of other factors affecting the upgrade. Lessons Learned The FY 2003 testbed included many design features that improved its usability and utility over the FY 2002 testbed. However, a number of lessons were learned through the use of the updated FY 2003 testbed and the evaluations conducted on it. These directly impacted the design of the testbed upgrade for FY 2004. The major lessons learned are: ·
Multiple evaluations need to
be done simultaneously. An evaluation of a technology component is complex, and time consuming
to both set up and conduct. The reality of the situation is that a technology
component cannot be evaluated in isolation. In general, one of the GUPFS
component technology types can only be evaluate in conjunction with one or more
of the other component types. For example, to evaluate a file system, it is
necessary to test it in conjunction with storage and some fabric or
interconnect connecting the clients to the storage, In addition, each component
usually has required versions of software, necessitating client systems being
configured with specific OS and driver versions and specific fabric
connectivity. Setting up the required configurations to conduct specific
evaluations is time consuming, often in the order of several weeks or a month.
Often, a single evaluation takes several months and requires dedicated
resources. Because of the number of important component technology issues that
need to be investigated, particularly in the file system arena and
multi-cluster configurations, and because of the rapidly changing component
technologies, it is vital to have enough resources to be able to have extended
unrelated evaluations in progress simultaneously. ·
Only a single type of
compute node should be used. The FY 2003 testbed contained two types of compute
nodes — the 18 dual Pentium-4 compute nodes and the 5 legacy dual Pentium-3
compute nodes. Because of the limited number of available compute nodes and the
high volume of items to evaluate or investigate, it was necessary to use both
types of nodes for evaluations. This predictably caused problems in a number of
areas. First, the different node architectures
required distinct software configurations be built, installed, and tested for
each type, thereby increasing the administrative burden for the testbed.
Secondly, the hardware components and performance differences between the types
of nodes made it very difficult to compare results obtained from evaluations,
and made evaluations using both types of nodes at the same time too difficult
to understand. The solution to this difficulty is to standardize on a single
node type for use as compute nodes in evaluations, and to obtain adequate
numbers of them to conduct the necessary number of evaluations simultaneously
at adequate scale. ·
A separate interactive node
is needed. The FY 2003 testbed used a single dual
Pentium-3 node as both a management node for running and maintaining the
testbed environment and as an interactive node for the project members to log
into and from which to conduct tests and evaluations. With the increased size
and complexity of the testbed and the increased number of evaluations in
progress at one time, it became apparent to us that a single node of this type
could not perform both functions without impacting one or the other. In
addition to periodic failures and substantial delays caused by overloading the
management node, the launching of long-running benchmarks frequently prevented
timely maintenance activities, such as rebooting the management node to clear
software problems or activate patches. Another reason to separate the
management and interactive functionality was to reduce the possibility of the
management node accidentally being destroyed by the setting up evaluations and
running of benchmarks by the project members, both activities that frequently
required running with elevated privileges, leading to a number of close calls.
Management node functionality is much more difficult and time consuming to
configure and install than interactive node functionality. The necessity of
separating the management and interactive node functionality contributed to the
decision to stop using any of the Pentium-3 nodes as compute nodes, and to
assign them to testbed support roles. Limitations In addition to lessons learned, the design of the FY 2004 testbed was also influenced by certain limitations encountered while using the FY 2003 testbed. These limitations include: ·
The EMC CX600 did not meet
expectations. The EMC CX 600 storage
device did not live up to expectations regarding performance scalability. This
was partly the result of unexpected architectural limitations and partly the
result of configuration constraints. As a consequence, neither the expected
aggregate bandwidth nor the desired scalability was achieved. This made the
GUPFS project dependent on other scalable storage that was being evaluated,
such as the Yotta Yotta NetStorager and the 3PARdata Inserv. With the
completion of the 3PARdata evaluation and the Yotta Yotta extended beta test at
the beginning of the 4th quarter of FY 2003, the GUPFS project faced the
prospect of not having storage that was of high enough performance or
sufficiently scalable to use for the file system evaluation planned for the end
of FY 2003 and for all of FY 2004. ·
The number of Gigabit
Ethernet switch ports was inadequate. The GUPFS testbed’s Gigabit Ethernet fabric quickly became limited by
the 48 available switch ports. The available ports were quickly consumed by the
testbed systems, inter-switch links between the two Gigabit switches, original
iSCSI router and Intel iSCSI HBA connections, and the InfiniCon InfiniBand to
Gigabit Ethernet bridges. With the introduction of additional equipment
requiring Gigabit Ethernet connections, the number of available ports was
oversubscribed by at least a factor of two. This resulted in the serialization
of evaluations and made it necessary to disconnect various equipment in order
to connect and test other equipment. The additional equipment included Topspin
InfiniBand to Gigabit Ethernet fabric bridges, Panasas storage devices, Adaptec
iSCSI HBAs and TOE cards, and the inter-switch links to the Alvarez management
Ethernet switch. Based on the need to maintain adequate inter-switch bandwidth,
a 3x expansion of the Gigabit switch ports was needed. ·
Reconfiguring the physical
connections between storage, fabric elements, and client systems became
extremely difficult. In order conduct evaluations
of various file systems, fabric components and bridges, and storage
combinations, and to conduct evaluations of various loaner equipment, it was
necessary to change the physical fiber optical connections of the equipment
connected to the three 16 port Fibre Channel switches in order to connect the
correct set of components. This was made extremely difficult by the rigidity of
the bundles of fiber cables and the fragility of the connectors. Moving a fiber
between one switch and another often required major efforts to obtain adequate
slack to permit the connector to plug into another switch. This was
particularly a problem when we were evaluating new equipment such as Fibre
Channel switches, which might be physically mounted in a different cabinet.
Making such connections required that substantial time be devoted to rebundling
fibers or stringing new ones. In addition, replugging the connectors exposed
them to mechanical failure (the 2 Gb/s SFPs are especially fragile), and to
contamination of the optics with dust. Another problem with making changes to
the fiber configuration was that it soon be came very difficult to determine
what was connected to what and which fibers were active. A fiber patch panel is
needed to resolve these problems. ·
A single dedicated metadata
server node is not enough. It became apparent that a
single special-purpose node acting as a dedicated metadata server was
inadequate. Nearly all of the shared file systems being tested required either
a metadata server or a centralized lock manager. As a consequence, testing of
these file systems became serialized because of the single metadata/lock
server. Frequently, more than one of these file systems was in some stage of
the installation and evaluation cycle; sometimes a file system would be
undergoing several tests at once, each instance having different hardware
and/or software configurations. This required either very careful alternating
of test segments, or the suborning other nodes to become additional metadata
servers. Using other testbed nodes as secondary metadata servers impacted other
activities by limiting the number of nodes available to them. Similar
constraints applied to testing configurations supporting metadata/lock server
redundant operation and failover. At least one more special purpose Pentium-4
node with an identical configuration needs to be dedicated to the metadata/lock
server role. ·
A full complement of eight
4U Pentium-4 nodes is needed. The GUPFS testbed only had six 4U Pentium-4 nodes.
Most new technologies are initially implemented on standard height (4U)
PCI/PCI-X cards, and only in the second or third generation of the technology
do low-profile (2U) cards become available. The initial Gigabit Ethernet, 1x InfiniBand, 4x InfiniBand, Myrinet 2000, iSCSI HBA,
Gigabit Ethernet TOE cards, and 1 Gb/s Fibre Channel cards were all standard
height cards, requiring 4U cases. Most fabrics provide switching capabilities
as powers of 2, and frequently with 8 ports as a minimum, leading to a standard
purchase of an 8-port switch and 8-host interface cards, as 4-host cards
provide little in the way of insights about scalability. Earlier constraints
limited the testbed to six 4U Pentium-4 nodes, preventing eight-way evaluations
when standard height interface cards were required. Adding two more 4U
Pentium-4 nodes would enable eight-way evaluations requiring standard height
interface cards to be conducted, permitting more direct comparisons with other
evaluations. New and Emerging
Technologies The design changes for the FY 2004 testbed were influenced by the new and emerging technologies impacting the GUPFS solution, which are expected to be available for evaluation during the coming year. In this regard, several important issues need to be investigated in the near term, including: ·
Conducting cross-platform
file system tests. The GUPFS project plans to
conduct cross-platform file system tests to explore functionality and
deployment issues in a heterogeneous environment that involves multiple
hardware and different OS architectures, which is designed to mimic the NERSC
environment in which GUPFS will be deployed. These tests will require either
incorporation of additional systems into the testbed, or opening up the testbed
to other NERSC systems, both of which require additional fabric-switching
capabilities. ·
Conducting multiple cluster
file system tests. The GUPFS project plans to
conduct file system tests involving multiple clusters accessing the same file
system and storage simultaneously, as is expected in the NERSC environment at
deployment. PDSF, Alvarez, and Dev2 are likely candidate peer systems. This
will require opening up and connecting the testbed to these other NERSC
systems. ·
Evaluating 4 Gb/s and 10
Gb/s Fibre Channel. Both 4 Gb/s and 10 Gb/s
production quality Fibre Channel equipment will be becoming available in the
time frame of the initial phase of GUPFS deployment. Because of the anticipated
aggregate performance needs for production use — and as it is likely that the
backend storage controllers, if not the storage itself, for most shared file
system solutions will be Fibre Channel connected — these technologies are
likely to be important to a successful GUPFS deployment. As such, they need to
be evaluated and understood. ·
Evaluating 10 Gb/s Ethernet.
The 10 Gb/s Ethernet technology is expected to be
deployed during FY 2004. Because it is most likely that PDSF will be accessing
the GUPFS file system over the Ethernet, 10 Gb/s Ethernet is a likely component
of the deployed GUPFS solution and needs to be understood in a storage fabric
context. ·
Evaluating Panasas file
system and storage. The Panasas ActiveScale File
System is a very interesting object-based file system implemented over the
Ethernet. Architecturally, it is quite similar to Lustre, but is more standards
based, being implemented with a variant of iSCSI. The Panasas file system
offering is integrated with Ethernet-attached storage devices specific to the
file system, and can be accessed either through integrated NFS and CIFS
gateways, or as part of a shared file system through the DirectFlow client
software. The Panasas file system should be accessible over any IP-based fabric
that can bridge to the Ethernet. This is a promising candidate file system and
needs to be evaluated for GUPFS. ·
Evaluating the IBRIX file
system. The IBRIX file system is an interesting
potential GUPFS file system solution that is based on federating the individual
file systems of storage engines (SEs). The IBRIX file system is distributed
over IP networks. It utilizes back-end SAN based storage. The IBRIX file system
was originally scheduled to be available in preproduction versions for
evaluation in FY 2003. This schedule has slipped into FY 2004. ·
Evaluating the IBM TotalStorage SANFS (StorageTank) file system. The IBM StorageTank
file system was renamed the TotalStorage SANFS file system at the end of FY
2003. SANFS is expected to become available as a product in the first half of
FY 2004. It targets very large numbers of client systems, and supports multiple
hardware architectures and operating systems. It uses metadata servers in
conjunction with block storage accessed via iSCSI over IP networks (making it largely fabric agnostic), or
accessed directly by Fibre Channel. The ability to access remote tanks over the
WAN is being developed. SANFS is an extremely promising GUPFS candidate,
although quite young, and needs to be investigated thoroughly. ·
Further iSCSI
investigations. The iSCSI protocol is
making an appearance in several promising shared file systems. It also provides
a cheap mechanism for accessing block storage over inexpensive fabrics,
although at the expense of the higher processor overhead. With the ability of
most fabrics and interconnects to perform IP transfers, and with the ability of
most fabrics to bridge to the Ethernet, iSCSI may facilitate the implementation
of heterogeneous fabrics directly tied into cluster interconnects. However, it
needs much more investigation as its availability and use expand. Additional
Considerations Other considerations that affected the design of the FY 2004 testbed include: ·
InfiniBand technology
refresh needed. The testbed InfiniBand technology needs to be refreshed. Current
second-generation 4x InfiniBand equipment, particularly HCAs and Fibre Channel
and Gigabit gateways need to be acquired. The original 1x IB equipment and the
loaner first generation 4x IB equipment are no longer supported. ·
Multiple management nodes
needed. The testbed needs at least two management
nodes. The management node is currently a single central point of failure in
the testbed, and is extremely difficult and complex to configure. A second
management node is needed to ensure the testbed and GUPFS project evaluations
and investigations can continue if the existing management node fails. In
addition, the availability of a second management node would allow the
management node software versions and configurations to be upgraded one at a
time without disruption. ·
Gigabit Ethernet emerging as
the standard fabric-to-others bridge. Gigabit Ethernet is emerging as the common fabric to which all other fabrics and interconnects
bridge. Because of this, a large number of Gigabit Ethernet switch ports are
needed in the testbed, particularly in conjunction with the iSCSI,
cross-platform, and multi-cluster file system testing planned for FY 2004. ·
The Myrinet to Gigabit
Ethernet bridge would expand the file systems that can be tested with Alvarez. Myricom announced a Gigabit Ethernet bridge blade for their Myrinet
switches, with 8 Gigabit Ethernet ports. With such a fabric bridge, the number of shared file
systems that could be tested on and in conjunction with the LBNL Alvarez Linux
cluster increases substantially. IP-based file systems such as Panasas and
IBRIX could be evaluated for scalability. Block-based file systems, such as
StorNext, could be tested for scalability using iSCSI bridged to storage in the
GUPFS testbed. In addition, a Myrinet upgrade to the Rev D card in late FY 2003
allowed low-profile PCI-X Myrinet 2000 cards to be installed, enabling the
Myrinet network to be installed in 2U nodes, thus freeing up the 4U nodes for
other uses and enabling the full use of the Myrinet switch with 8 hosts. 2. Updated Testbed Configuration for FY 2004 Design of the updated testbed configuration was completed at the end of the third quarter of FY 2003. This design was based on all of the considerations presented in the previous section. The central tenet of the updated configuration was to increase the number of simultaneous evaluations that could be conducted, increase the maximum scale of these evaluations, and simplify the process of physically reconfiguring the connectivity from testbed nodes to storage devices through various fabric components. The updated configuration expanded the total number of Pentium-4 nodes from 22 to a total of 36. As in FY 2003, four of these Pentium-4 nodes were retained as dedicated special purpose nodes, although there were some changes in assigned functions. The remaining 32 Pentium-4 nodes were assigned as compute nodes dedicated to running benchmarks and conducting other investigations. To facilitate conducting multiple simultaneous independent evaluations, the 32 compute nodes were logically partitioned into four sets of eight. This logical partitioning allows up to four independent 8-way investigations to be conducted simultaneously, or in various size combinations, such as one 32-way, two 16-way, or one 16-way and two 8-way tests. While the partitioning of the compute nodes into groups was at a logical level, there were some physical characteristics related to their partitioning. Each group of eight compute nodes was connected to a separate Dell Power Connect Gigabit Ethernet switch, which allowed the nodes in the group maximum communication performance among themselves. The Dell switches for each of the groups were then connected by four-way trunks to a central Extreme 7i switch. This allowed nodes in any of the groups to communicate with each other, but reduced aggregate bandwidth and increased latency. Another physical characteristic related to the partitioning of the nodes was the additional PCI-X fabric interface cards each node had. For a variety of reasons, the GUPFS project conducts evaluations of fabric interfaces/interconnects with a minimum of eight hosts for each fabric. In addition to Fibre Channel interfaces, present on all nodes except the management nodes, the GUPFS testbed contains three other sets of high-performance fabrics, each of which is connected to eight compute nodes. These additional fabrics are a Myrinet 2000 interconnect, and after the testbed upgrade, two 4x InfiniBand fabrics from different vendors. The three sets of nodes with extra fabric connections are each put into logically separate groups to facilitate the independent testing of these extra fabrics. An additional element of the updated configuration was the dispensation of the original six Pentium-3 nodes. One of these has always been used as the testbed management node. The remaining five were used as compute nodes in FY 2002 and 2003, although in an auxiliary role during FY 2003. With the addition of more Pentium-4 nodes in the updated configuration, it was possible to stop using the Pentium-3 nodes as compute nodes and assign them to other supporting duties. One was to become a second testbed management node for redundancy and simplifying upgrades. Another was to become a dedicated interactive node, offloading this function from the management nodes for the reasons discussed earlier. The remaining three Pentium-3 nodes became dedicated development nodes for benchmark and analysis code development, and possible auxiliary HPSS integration investigation roles. Another part of the upgraded configuration included the installation of a fiber-optic patch panel, allowing all fiber-optic ports to be centrally connected in a static configuration and then cross connected as necessary using easily movable fiber-optic patch cords. All fiber-optical Gigabit Ethernet, Myrinet, and Fibre Channel host adapters, switch ports, and device connections were hardwired into the central patch panel to simplify physical reconfiguration of the fabrics and connections. The other major element of the updated configuration included purchasing the Yotta Yotta NetStorager as the standard high-performance storage to be used in evaluations in lieu of the disappointing CX 600, the previously mentioned Gigabit Ethernet switching capacity, additional Fibre Channel switching capacity, and 4x InfiniBand technology refresh from two vendors. A front view of the updated testbed for FY 2004 appears as Figure 1. A rear view of the testbed, showing the nodes and cable connections, appears in Figure 2. The updated FY 2004 testbed configuration is shown in Figure 3. The Port Fibre optical patch panel is shown in Figure 4.
Figure 1. The FY 2004 testbed, with the Refurbished NetStorager in front. The following major components were added to the testbed as part of its technology upgrade for FY 2004: · Fourteen additional dual Pentium-4 nodes: 12 in 2U cases and 2 in 4U cases (these nodes were identical to those already in the testbed. · One 64-port 2 Gb/s Fibre Channel Qlogic SANbox2-64 switch with 48 ports · Five 24-port Dell Power Connect 5224 Gigabit Ethernet switches · One Yotta Yotta NetStorager GSX 2400 Disk Storage Subsystem · InfiniCon ISIS InfinIO 7000 switch and fabric bridge 4x InfiniBand upgrade · One Topspin TS90 4x InfiniBand switch and Fibre Channel gateway · One Myrinet 2000 MS-SW16-8E switch line card with 8 Gigabit Ethernet ports · A
Fiber Optic patch panel and cables for Gigabit Ethernet, Myrinet, and Fibre
Channel
Figure 2. Rear view of the FY2004 testbed. The new Pentium-4 nodes obtained as part of the upgrade were as identical as possible to those obtained the previous year. They were configured with the same motherboard, the same quantity and performance grade of memory, the same 2.2 GHz Xeon CPU, and the same Intel Gigabit Ethernet and Qlogic Fibre Channel PCI-X cards. All of these components differed from the originals only in revision numbers. The nodes were all equipped with the same speed (10,000 RPM) and capacity (36 GB), and U160 SCSI disks, but from a different manufacturer. Special efforts were made to configure the new Pentium-4 nodes to be identical to those obtained earlier. This was done to ensure uniformity of performance so that results obtained from both sets would be comparable and they could be intermixed without affecting evaluation results. The motherboard proved to be the most difficult to obtain as it was being phased out. However, a newer revision of the motherboard was acquired that showed nearly identical performance, which allowed the new and existing Pentium-4 nodes to be intermixed with negligible impact on the benchmark results.
Figure 3. Updated GUPFS
testbed configuration for FY 2004. The major components of the updated GUPFS testbed for FY 2004 included: System nodes · 36 dual Pentium-4 nodes: 28 in 2U cases and six in 4U cases; 32 for use as compute nodes and 4 for use as special-purpose nodes · 6 dual Pentium-3 nodes in 4U cases, used as management, interactive, and auxiliary testbed support nodes Fabric · One 240 connector Fiber Optical patch panel with SC connectors and patch cables (see Figure 4) · Ethernet o
One 32-port Extreme 7i
Gigabit Ethernet switch o
One 16-port Extreme 5i
Gigabit Ethernet switch o
Five 24-port Dell Power
Connect 5224 Gigabit Ethernet switches o
Two 10/100 Ethernet
switches for system management · Fibre Channel o
One 48-port 2 Gb/s Fibre
Channel Switch (SanBox2-64) o
One 16-port 2 Gb/s Fibre
Channel Switch (Brocade 3800) o
One 16-port 1 Gb/s Fibre
Channel Switch (Brocade 2800) o
One Cisco SN5428 iSCSI
Router fabric bridge to Ethernet · Myrinet o
One Myrinet 2000 8-port
switch with 8 Revision D host interface cards o
One Myrinet 2000 MS-SW16-8E
switch card with 8 Gigabit Ethernet ports for bridging between Myrinet and
Gigabit Ethernet · InfiniBand o
One InfiniCon ISIS
InfinIO 7000 4x InfiniBand switch, 8 4x HCA host adapters, and single fabric
bridge modules for Fibre Channel and Gigabit Ethernet o
One Topspin TS90 4x
InfiniBand switch, 8 4x HCA host adapters, and single fabric bridge modules for Fibre Channel and Gigabit Ethernet Storage · Yotta Yotta NetStorager GSX 2400 · EMC CLARiiON CX600 disk subsystem · Dot Hill 7124 RAID disk subsystem · Silicon Gear Mercury II RAID subsystem · Chaparral A8526 RAID subsystem with attached storage The expanded testbed, with its increased scale, updated and new technologies, and features to support easier reconfiguration, will facilitate evaluations to be conducted during FY 2004. The increased scale of the testbed will enable both more simultaneous small-scale initial evaluations and single larger scale evaluations. The testbed’s multiple fabrics and the bridges between them will allow the issues involving heterogeneous fabric environments, such as those expected at NERSC, to be investigated and understood.
Figure 4. 240-Port Fibre optical patch panel. The increased number of testbed nodes will also facilitate the conducting of cross-platform and multiple OS tests of promising file systems supporting such capabilities. A great deal of important information and experience stands to be gained through these tests, which will explore the issues associated with deployment in a heterogeneous environment, as is expected at NERSC. The increased Gigabit Ethernet fabric switching capacity and additional improved fabric bridging capabilities will further facilitate this testing and will enable multiple-cluster testing to be conducted with both the Alvarez cluster and PDSF systems. This will allow phased deployment to be simulated, and we can then begin addressing the networking issues associated with deployment. |
![]() |
Page last modified: Tue, 22 Jun 2004 22:56:22 GMT Page URL: http://www.nersc.gov/projects/GUPFS/testbed/GUPFS_testbed04.php Web contact: webmaster@nersc.gov Computing questions: consult@nersc.gov Privacy and Security Notice |
![]() |