NERSCPowering Scientific Discovery Since 1974

Costas Maranas

Case Study Worksheet

Project Information - Metabolic engineering, protein design, and bioinformatics

Document Prepared By Costas Maranas
Project Title Metabolic engineering, protein design, and bioinformatics
Principal Investigator Costas Maranas
Participating Organizations Penn State University
Science Category Climate Environmental Science Biological Sciences
Funding Agencies DOE SC DOE NSA NSF NOAA NIH Other:

Project Summary (Scientific Objectives)

Please give a brief description of your project and its scientific objectives for the next 3-5 years.

At the genome-scale metabolic model level (Aim 1), we intend to develop computational methods that will simultaneously bring to bear multiple types of analyses and data (i.e., network connectivity, gene essentiality experiments, metabolomic and transcriptomic data) to automatically assess the quality of genome-scale reconstructions and generate hypotheses for their correction. At the isotope mapping model construction front (Aim 2), using as input genome-scale metabolic reconstructions of aim 1, we propose to develop computational techniques that will enable the largely automated tracking of labeled atoms. This will provide comprehensive isotope tracking maps to support metabolic flux elucidation through metabolic flux analysis (MFA). At the flux elucidation level (Aim 3), we plan to make use of the developed isotope maps of aim 2 to estimate metabolic flux values consistent with experimental data and pinpoint what additional measurements are needed to fully resolve all fluxes in genome-scale metabolic models. To this end, we will develop customized global optimization procedures for addressing the presence of nonlinearities in the isotope balance equations and extend the algorithmic base to handle isotopically nonstationary data using dynamic optimization concepts as well as quantify the impact of measurement error. Finally, on the strain optimization front (Aim 4), we propose to develop computational tools that will make direct use of flux information from aim 3 as well as regulatory, thermodynamic or even kinetic information whenever available. The key concept here is that instead of looking for specific engineering strategies one at a time, we seek to classify all fluxes in the metabolic model depending upon whether or not they must increase, decrease, or become equal to zero to meet a pre-specified overproduction target. Additionally, once we identify these pathways, we can explore whether overexpression alone can achieve the necessary activity, or whether we need to explore computationally engineering substrate specificities and catalytic activity through protein design.

Current HPC Usage and Methods

Facilities Used NERSC NCCS ACLF NSF Centers Other: Lion XJ cluster at High Performance Computing center at PSU
Architectures Used Cray XT IBM Power BlueGene Linux Cluster Other:
Total Computational Hours Used per Year 500,000 hrs/year Core-Hours NERSC Hours Used per Year 0 Core-Hours
Number of Cores Used in Typical Production Run 4 to 200 Wallclock Hours of Single Typical Production Run 200 to 400
Total Memory Used per Run 5 to 10 GB Minimum Memory Required per Core 1 GB
Total Data Read & Written per Run Less than 1 GB Size of Checkpoint File(s) Between 1 to 50 GB
Amount of Data Moved In/Out of NERSC GB How Often Never
On-Line File Storage Required (Directly Accesible from a Running Job) Less than 1 GB ~100s Files
Off-Line Archival Storage Required Less than 1 GB ~100s Files

Please list any required or important software, services, or infrastructure (beyond supercomputing and standard storage infrastructure) provided by HPC centers or system vendors.

CPLEX, CONOPT, CHARMM, Gaussian03

Please list your current primary codes and their main mathematical methods and/or algorithms. Include quantities that characterize the size or scale of your simulations or numerical experiments; e.g., size of grid, number of particles, basis sets, etc. Also indicate how parallelism is expressed (e.g., MPI, OpenMP, MPI/OpenMP hybrid)

Our current primary codes use mathematical optimization (i.e., MILP and NLP) formulations and combinatorial graph analyses. The size of optimization problems is characterized by thousands of binary variables. Graph sizes are also large accounting for thousands of nodes. Parallelism is currently handled by manually seggregating computing tasks to different computing nodes.  
 
Further, in-house protein design algorithms utilize optimization formulations (i.e. MILP) as well as molecular mechanics energetic calculations (i.e. binding energy, molecular dynamics) on the order of tens of thousands of atoms. We utilize quantum mechanics calculations (i.e. DFT, MP2) to estimate the ground and transition state energetics of proteins to their substrates. For these calculations, we use the MPI librarie for communication between processors.

Please list the known limitations/obstacles/bottleneck of resources currently available HPC systems, and in particular, those at NERSC.

Key limitations include computational time due to the NP-hard nature of the underlying mathematical problems. In some cases, memory usage due to the combinatorial explosion of branch-and-bound trees becomes limiting.

HPC Usage and Methods for the Next 3-5 Years

Anticipated changes to codes, mathematical methods and/or algorithms needed to achieve this project's scientific objectives.

One objective is that we will need to develop techniques to solve flux ranges in an automated and fully parallelized manner. Additionally, the non-linear nature of flux elucidation for large-scale models can benefit from the use of decomposition and a computationally efficient representation of the isotope mappings.  
 
Our in-house protein redesign algorithm Iterative Protein Redesign and Optimization (IPRO) currently has time consuming steps that can be greatly reduced through parallelization. Access to more nodes simultaneously than we currently use can reliably obtain efficient protein redesigns in significantly less time. IPRO currently requires the manual determination of the initial starting candidate. We anticipate that being able to massively parallelize IPRO will allow us to computationally determine redesigns of a large number of candidate structures. 
 
Determining which measurements to make in order to elucidate fluxes using an incidence matrix will require large amounts of memory, even when using sparse matrices. We anticipate scaling will require increased memory usage and efficient branching algorithms.  
 
We are also currently limited solving global optimum problems by the size of our models using the branch and bound technique. Having access to longer term jobs and more memory would allow the determination of more physiologically relevant models. 

Computational Hours Required per Year 800000
Anticipated Number of Cores to be Used in a Typical Production Run
Anticipated Wallclock to be Used in a Typical Production Run Using the Number of Cores Given Above
Anticipated Total Memory Used per Run GB
Anticipated Minimum Memory Required per Core GB
Anticipated total data read & written per run GB
Anticipated size of checkpoint file(s) GB
Anticipated On-Line File Storage Required (Directly Accesible from a Running Job) GB Files
Anticipated Off-Line Archival Storage Required GB Files

Known or Anticipated architectural requirements (e.g., 2 GB memory/core).

4 GB memory/core 
very high-bandwidth, ultra low-latency network interconnect

Please list any additional required or important software, services, or infrastructure beyond those listed in the previous section.

Accelrys Discovery Studio

It is believed that the dominant HPC architecture in the next 3-5 years will incorporate processing elements composed of 10s-1,000s of individual cores. It is unlikely that a programming model based solely on MPI will be effective, or even supported, on these machines. Do you have a strategy for computing in such an environment? If so, please briefly describe it.

We do not

What Do You Need from NERSC?

Please tell us what you need from NERSC to meet your project's computing needs over the next 3-5 years. Also please feel free to make any general comments.