National Energy Research Scientific Computing Center 2004 Annual Report

Navigation
Science-Driven Analytics
Simulations and experiments are generating data faster than it can be analyzed and understood. Addressing this bottleneck in the scientific discovery process is the emerging discipline of analytics, which has the simple goal of understanding data.
The term analytics refers to a set of interrelated technologies and intellectual disciplines that combine to produce insight and understanding from large, complex, disparate, and sometimes conflicting datasets. These technologies and disciplines include data management, visualization, analysis, and discourse aimed at producing specific types of understanding. These in turn rely on the computational infrastructure, expertise in using that infrastructure, and close cooperation between domain scientists, computational scientists, and computer scientists.
More specifically, the term visual analytics is the science of analytic reasoning facilitated by interactive visual interfaces. Its objective is to enable analysis of overwhelming amounts of information, and it requires human judgment to make the best possible evaluation of incomplete, inconsistent, and potentially erroneous information.
NERSC’s analytics strategy builds on two of the Center’s existing strengths: (1) proven expertise in effectively managing large, complex computing, infrastructure, and data storage systems to solve scientific problems of scale; and (2) exemplary user services, consulting, and domain scientific knowledge that help the NERSC user community effectively employ the Center’s resources to solve challenging scientific problems. On this foundation, NERSC’s analytics strategy adds an increased emphasis on facilities, infrastructure, expertise, and alliances that can be used to realize analytics solutions.
With the establishment of its new Analytics Team, NERSC is realigning its resources to support analytics activities. The NERSC Center’s infrastructure is being broadened to include elements such as database deployment and support, with an increased focus on data analysis and scientific data management to support analytics. The existing visualization program is being expanded to include information visualization and integrated data management, analysis, and distributed computing. The goal is a well-rounded service and technology portfolio that is responsive to the analytics needs of NERSC’s user community.
NERSC’s analytics strategy includes five elements:
1. Taking a proactive role in deploying emerging technologies. NERSC will increasingly become a conduit for prototype technologies that emerge from the DOE computer science research community. Analytics will require adapting and deploying technologies from several different areas—data management, analysis, visualization, dissemination—into a unified workflow that functions effectively in a time-critical production environment. The role of NERSC staff will include deploying new system and support software, helping applications software engineers effectively use NERSC resources, and playing a proactive role in providing feedback to the original computer science researchers and developers to address security or performance concerns.
2. Enhancing NERSC’s data management infrastructure. The NERSC Global Filesystem offers increased performance for all applications, including data-intensive analytics tasks. It also helps streamline distributed workflows and provides high I/O rates, which are important for large datasets. NERSC also plans to increase its archival storage to nearly 40 PB over the next five years. In the near term, NERSC will evaluate and deploy software that provides distributed, file-level data management.
3. Expanding NERSC’s visualization and analysis capabilities. One of the most significant activities performed by the NERSC visualization staff is in-depth, one-on-one consulting services, such as those provided to INCITE and other large projects. These activities typically involve finding or engineering solutions where none exist off the shelf. In addition, visualization staff will evaluate new visualization hardware and software technologies to determine which are beneficial to the user community. These technologies may include information visualization, which differs from the better-known scientific visualization in that the underlying data does not readily lend itself to spatial mapping—for example, comparing the results of genome alignment across multiple species. As data size and complexity grow, it will become increasingly crucial to use analysis technologies to reduce the processing load through the computational and visualization pipelines, as well as to reduce the “scientific processing load” on the humans who must interpret and understand the results. A portfolio of commercial, production, open-source, and research-grade technologies is expected to be most effective in meeting users’ scientific needs.
4. Enhancing NERSC’s distributed computing infrastructure. NERSC’s strategy for supporting distributed computing will be tailored to provide services that have the broadest possible benefit and that conform to security requirements. In addition to providing low-level infrastructure such as the Open Grid Services Architecture (OGSA) and similar technologies that provide authentication and secure data movement across the network, NERSC will investigate and deploy higher-level applications and services that emerge from research and applications communities like the Particle Physics Data Grid and the Earth System Grid. Both of those projects rely on standard services for brokering access to data and tools that serve large, distributed user communities. NERSC will work closely with the user community to provide the documentation and assistance they need to construct analytics workflows.
5. Understanding the analytics needs of the user community. To be effective, NERSC’s new program focus on Science-Driven Analytics will require additional information from the user community. To that end, the entire user community was surveyed in early 2006 to identify their most pressing analytics needs. The findings from this survey have been instrumental in shaping and prioritizing the emerging analytics effort. NERSC will continue soliciting input from users as well as tracking analytics trends in the larger scientific community.