Mining the biobibliome
The National Library of Medicine's MEDLINE citation database, the world's largest searchable collection of biomedical literature, has, since 1966, accumulated more than 11 million titles and abstracts from articles published in over 4,000 relevant journals. As a tool for retrieving information about a particular gene or protein, it is unsurpassed. As a tool for discovering new connections between particular genes and biological processes, it is also arguably the world's most underexploited (in silico) repository of data. A group led by Eivind Hovig (of The Norwegian Radium Hospital, Oslo, Norway) has now outlined a computational approach to extracting new information from this massive archive (Nature Genetics, Vol. 28, No. 1, 01 May 2001). Conducted on a large scale across the entire database, the analysis generates networks of related genes that reveal heretofore-unknown aspects of biology.
A similar, if small-scale, approach was published last year by Benjamin Stapley and Gerald Benoit (of the University of Kentucky), who coined the term "biobibliometrics." The basic assumption is that genes that are mentioned in the same abstract are likely to have a biological relationship. By analogy to global approaches to understanding the genome, transcriptome and proteome, Hovig and colleagues have now searched the titles and abstracts of over 10 million MEDLINE citations-the 'biobibliome'-to produce a "gene-to-gene co-citation network" for 13,712 known human genes. By annotating this network with biological attributes such as medical subject heading (MeSH) terms, the authors have identified meaningful biological relationships between sets of genes that, though subsequently validated by experiment, had not been predicted. The computational tools to carry out these analyses have been deposited in a publicly available database called PubGene ( www.PubGene.org). PubGene provides an opportunity to harvest at least some of the collective wisdom-as yet unrealized-that has been produced by thousands of scientists over the last 35 years.
Though powerful, the method by Hovig and colleagues is limited by difficulties in dealing rationally and systematically with the flood of information entering the literature. Many of these problems are of long-standing concern, including inconsistencies in nomenclature, the inaccessibility of the full text of most published articles, and the sheer complexity of biology itself. These issues are discussed in an accompanying News & Views article by Daniel Masys (of the University of California, San Diego), and in this month's Nature Genetics editorial.
Dr. Eivind Hovig
The Norwegian Radium Hospital
Oslo - Norway
Telephone: +47 2293-5416
Fax: +47 2252-2421
Dr. Daniel Masys
University of California San Diego
La Jolla, California - USA
Telephone: +1 858-534-6573
(C) Nature Genetics press release.
Message posted by: Trevor M. D'Souza