|
|||||||
Science Highlights: Advanced Scientific Computing Research and Other Projects |
Sparse
Linear Algebra Algorithms and Applications for MPPs |
|||||||
|
The goal of this research is to use high performance computing to help understand and improve subspace-based techniques for information retrieval, such as latent semantic indexing (LSI). In the past, LSI has been used only on very small data sets. We are developing an environment and scalable linear algebra algorithms with which LSI-based information retrieval can be applied to matrices representing millions of documents and hundreds of thousands of key words. To improve the accuracy and efficiency of the retrieval algorithms, we hope to: (1) Further develop our subspace-based model and gain deeper understanding of the effectiveness of LSI. (2) Develop more robust and computationally efficient statistical tests for the determination of the optimal latent-concept subspace dimensions. (3) Further explore the low-rank-plus-shift structures of the term-document matrices and develop fast and memory-efficient numerical algorithms for the computation of low-rank matrix approximations. We will also investigate linear-time partial singular value decomposition (SVD) algorithms based on term and document sampling. (4) Investigate a variety of statistical methods such as canonical correlation analysis and a generalized linear model for translingual text retrieval.
We currently plan
to explore the use of SVD. In the future, other low-rank approximations
to the document structure will be considered. For large-scale linear algebra
operations we make extensive use of the ScaLAPACK, LAPACK, PARPACK, TRLAN,
PETSc, and Aztec libraries. Accomplishments We successfully ported the parallel computing extension MATLAB*P to NERSC's Cray T3E. This tool provides the ability to manipulate dense and sparse matrices stored on the T3E using the popular MATLAB environment. The goal is to reduce the programming overhead in setting up our numerical experiments. Preliminary experiments on a large collection with over 500,000 documents have demonstrated the feasibility of our software. We have been able to perform LSI on the complete collection, whereas previous attempts have had to restrict their attention to a smaller subset of the data.
This project is one
of the first to compute decompositions of complete, large, term-document
matrices. The algorithms developed here will be useful not only in text
retrieval, but also in more complicated settings such extraction of images
with desired features from large collections. Combining effective search
and classification algorithms for image problems with the computing and
storage capabilities of future NERSC systems will position Berkeley Lab
in a leadership position when it comes to the development of algorithmic
techniques for the data and visualization corridors of the next decade.
Publications Zhenyue Zhang, Hongyuan Zha, and Horst Simon, "Low-rank approximations with sparse factors I: Basic algorithms and error analysis; II: Penalized methods with discrete newton-like iterations," SIAM Journal of Matrix Analysis (submitted); LBNL-44003 and LBNL-44217 (1999). Chris H. Q. Ding, "A similarity-based probability model for latent semantic indexing," presented at the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, December 1999; LBNL-43202 (1999). |
||||||||
|
||||||||