1999
Annual Report
Table of Contents Year in Review Science Highlights  

Science Highlights:
Advanced Scientific Computing Research and Other Projects
Sparse Linear Algebra Algorithms
and Applications for MPPs
Director's
Perspective
Year in Review
Computational Science
Shared Memories:
Reflections on
NERSC's 25th
Anniversary
Researchers Solve a Fundamental Problem of Quantum Physics
User Satisfaction Continues to Grow
New Computing
Technologies
NERSC-3 Procurement Team Recognized for
Successful Effort
Oakland Scientific Facility Under Construction
Towards a DOE
Science Grid
----------------
Grand Challenge Retrospective
----------------
Science Highlights
Basic Energy Sciences
Biological and Environmental Research
Fusion Energy Sciences
High Energy and Nuclear Physics
Advanced Scientific Computing Research and Other Projects


Horst Simon, Chris Ding, Parry Husbands, Osni Marques,
and Kesheng John Wu, NERSC, Lawrence Berkeley National Laboratory
Alan Edelman, Massachusetts Institute of Technology
Hongyuan Zha, Pennsylvania State University
Zhenyue Zhang, Zhejiang University, Hangzhou, China


Research Objectives

The goal of this research is to use high performance computing to help understand and improve subspace-based techniques for information retrieval, such as latent semantic indexing (LSI). In the past, LSI has been used only on very small data sets. We are developing an environment and scalable linear algebra algorithms with which LSI-based information retrieval can be applied to matrices representing millions of documents and hundreds of thousands of key words.

To improve the accuracy and efficiency of the retrieval algorithms, we hope to: (1) Further develop our subspace-based model and gain deeper understanding of the effectiveness of LSI. (2) Develop more robust and computationally efficient statistical tests for the determination of the optimal latent-concept subspace dimensions. (3) Further explore the low-rank-plus-shift structures of the term-document matrices and develop fast and memory-efficient numerical algorithms for the computation of low-rank matrix approximations. We will also investigate linear-time partial singular value decomposition (SVD) algorithms based on term and document sampling. (4) Investigate a variety of statistical methods such as canonical correlation analysis and a generalized linear model for translingual text retrieval.


Computational Approach

We currently plan to explore the use of SVD. In the future, other low-rank approximations to the document structure will be considered. For large-scale linear algebra operations we make extensive use of the ScaLAPACK, LAPACK, PARPACK, TRLAN, PETSc, and Aztec libraries.

Accomplishments

We successfully ported the parallel computing extension MATLAB*P to NERSC's Cray T3E. This tool provides the ability to manipulate dense and sparse matrices stored on the T3E using the popular MATLAB environment. The goal is to reduce the programming overhead in setting up our numerical experiments.

Preliminary experiments on a large collection with over 500,000 documents have demonstrated the feasibility of our software. We have been able to perform LSI on the complete collection, whereas previous attempts have had to restrict their attention to a smaller subset of the data.

MATLAB*P provides a familiar, transparent interface for bridging the gap between workstations and supercomputers. It allows manipulation of dense and sparse matrices on MPP systems with little or no modification of scripts written for workstations.


Significance

This project is one of the first to compute decompositions of complete, large, term-document matrices. The algorithms developed here will be useful not only in text retrieval, but also in more complicated settings such extraction of images with desired features from large collections. Combining effective search and classification algorithms for image problems with the computing and storage capabilities of future NERSC systems will position Berkeley Lab in a leadership position when it comes to the development of algorithmic techniques for the data and visualization corridors of the next decade.


Publications

Parry Husbands, Charles Isbell, and Alan Edelman, "MATLAB*P: A tool for interactive supercomputing," in Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing (1999).

Zhenyue Zhang, Hongyuan Zha, and Horst Simon, "Low-rank approximations with sparse factors I: Basic algorithms and error analysis; II: Penalized methods with discrete newton-like iterations," SIAM Journal of Matrix Analysis (submitted); LBNL-44003 and LBNL-44217 (1999).

Chris H. Q. Ding, "A similarity-based probability model for latent semantic indexing," presented at the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, December 1999; LBNL-43202 (1999).


< Table of Contents Top ^ Next >