HTML Preview Biological Research Papers page number 1.


Creating a Knowledge Base of Biological Research Papers*
Carole D. Hafner, Kenneth Baclawski, Robert P. Futrelle, Natalya Fridman, Shobana Sampath
College of Computer Science, Northeastern University, Boston, MA 02115
{hafner, kenb, futrelle, natasha, shobanas}@ccs.neu.edu
Tel. 617-373-2462 FAX 617-373-5121
Keywords: knowledge representation, natural language, text retrieval, semantic nets, taxonomy, frames, parsing, object
oriented databases.
Abstract
develop will be applicable to other branches of molecular
biology. We are focusing on the Materials and Methods
sections of these papers, as being both typical of texts in
experimental biology and sufficiently narrow and patterned
to be amenable to knowledge engineering techniques.
Intelligent text-oriented tools for representing and
searching the biological research literature are being
developed, which combine object-oriented databases
with artificial intelligence techniques to create a richly
structured knowledge base of Materials and Methods
sections of biological research papers. A knowledge
model of experimental processes, biological and
chemical substances, and analytical techniques is
described, based on the representation techniques of
taxonomic semantic nets and knowledge frames. Two
approaches to populating the knowledge base with the
contents of biological research papers are described:
natural language processing and an interactive
knowledge definition tool.
This report describes research aimed at creating a
knowledge base of the Materials and Methods sections of
the 132 bacterial chemotaxis papers, including both the text
and associated knowledge frames in an integrated object-
oriented structure. This knowledge base will be used to
create a prototype of an intelligent retrieval system for
biological research, and to experiment with a variety of
information retrieval techniques.
The major challenges we face are: first, to create a
knowledge model capable of expressing a significant range
of biological concepts (Section 2); and second, to
overcome the "knowledge bottleneck" by creating
automated or semi-automated tools to populate the
knowledge base with frames for a corpus of papers
(Section 3). Although 132 documents is a very small
corpus which might be represented without automated tools
(although this is still a non-trivial effort), the aim of our
research is to develop techniques and tools that will help us
"scale up" to larger knowledge bases in the future.
1. Introduction
Biological data and research results are rapidly
becoming electronically accessible on CD-ROM or through
computer networks such as Internet. Since published papers
represent the primary output of biological research - about
600,000 are published each year - the prospect of a "digital
library" presents an opportunity for computer scientists and
biologists to move beyond exact reproduction of hard-copy
resources to create intelligent text-oriented tools for
representing and searching the biological research
literature.
We are also investigating concept-based retrieval
algorithms for large document collections [Baclawski
1994] and developing an interactive query system for the
knowledge base described in this report [Baclawski 1993b].
Software is being developed on the Apple Macintosh
computer,using the WOOD object-oriented database
system [St. Clair 1993].
Our project is investigating the potential for using
artificial intelligence techniques in combination with object
oriented databases to create a richly structured knowledge
base of biological research papers. Several electronic text
and knowledge resources are being utilized:
2. Knowledge Model
a. A corpus of 132 papers in Bacterial Chemotaxis,
annotated using the Standard Generalized Markup
Language [Bryan 1988]. This is the primary corpus
around which we are building our prototype tools and
knowledge base.
Intelligent processing of language requires background
knowledge, which permits an agent (whether computer or
human) to make connections between a current input and
other objects and events that have been or are being
observed. In the sample text (Figure 1)[Kuo 1986], an
instance of a complex method called Immunoblots is
described, and details are provided for a large number of
specific sub-processes, as indicated in the following
quotations:
b. The Unified Medical Language System, a large
taxonomy of medical concepts created by the National
Library of Medicine [UMLS 1993]. The UMLS provides
a valuable point of comparison for our knowledge
model.
Initially we are dealing only with papers in the field of
bacterial chemotaxis, but the techniques and tools we
Electrophoretic transfer of proteins from the gel to
nitrocellulose
Intelligent Systems in Molecular Biology. ISMB '94, pgs. 147-155


Once you free yourself from the need for perfect acceptance, it’s a lot easier to launch work that matters. | Seth Godin