Preview Biological Research Papers

Creating a Knowledge Base of Biological Research Papers*

Carole D. Hafner, Kenneth Baclawski, Robert P. Futrelle, Natalya Fridman, Shobana Sampath

College of Computer Science, Northeastern University, Boston, MA 02115

{hafner, kenb, futrelle, natasha, shobanas}@ccs.neu.edu

Tel. 617-373-2462 FAX 617-373-5121

Keywords: knowledge representation, natural language, text retrieval, semantic nets, taxonomy, frames, parsing, object

oriented databases.

Abstract

develop will be applicable to other branches of molecular

biology. We are focusing on the Materials and Methods

sections of these papers, as being both typical of texts in

experimental biology and sufficiently narrow and patterned

to be amenable to knowledge engineering techniques.

Intelligent text-oriented tools for representing and

searching the biological research literature are being

developed, which combine object-oriented databases

with artificial intelligence techniques to create a richly

structured knowledge base of Materials and Methods

sections of biological research papers. A knowledge

model of experimental processes, biological and

chemical substances, and analytical techniques is

described, based on the representation techniques of

taxonomic semantic nets and knowledge frames. Two

approaches to populating the knowledge base with the

contents of biological research papers are described:

natural language processing and an interactive

knowledge definition tool.

This report describes research aimed at creating a

knowledge base of the Materials and Methods sections of

the 132 bacterial chemotaxis papers, including both the text

and associated knowledge frames in an integrated object-

oriented structure. This knowledge base will be used to

create a prototype of an intelligent retrieval system for

biological research, and to experiment with a variety of

information retrieval techniques.

The major challenges we face are: first, to create a

knowledge model capable of expressing a significant range

of biological concepts (Section 2); and second, to

overcome the "knowledge bottleneck" by creating

automated or semi-automated tools to populate the

knowledge base with frames for a corpus of papers

(Section 3). Although 132 documents is a very small

corpus which might be represented without automated tools

(although this is still a non-trivial effort), the aim of our

research is to develop techniques and tools that will help us

"scale up" to larger knowledge bases in the future.

1. Introduction

Biological data and research results are rapidly

becoming electronically accessible on CD-ROM or through

computer networks such as Internet. Since published papers

represent the primary output of biological research - about

600,000 are published each year - the prospect of a "digital

library" presents an opportunity for computer scientists and

biologists to move beyond exact reproduction of hard-copy

resources to create intelligent text-oriented tools for

representing and searching the biological research

literature.

We are also investigating concept-based retrieval

algorithms for large document collections [Baclawski

1994] and developing an interactive query system for the

knowledge base described in this report [Baclawski 1993b].

Software is being developed on the Apple Macintosh

computer,using the WOOD object-oriented database

system [St. Clair 1993].

Our project is investigating the potential for using

artificial intelligence techniques in combination with object

oriented databases to create a richly structured knowledge

base of biological research papers. Several electronic text

and knowledge resources are being utilized:

2. Knowledge Model

a. A corpus of 132 papers in Bacterial Chemotaxis,

annotated using the Standard Generalized Markup

Language [Bryan 1988]. This is the primary corpus

around which we are building our prototype tools and

knowledge base.

Intelligent processing of language requires background

knowledge, which permits an agent (whether computer or

human) to make connections between a current input and

other objects and events that have been or are being

observed. In the sample text (Figure 1)[Kuo 1986], an

instance of a complex method called Immunoblots is

described, and details are provided for a large number of

specific sub-processes, as indicated in the following

quotations:

b. The Unified Medical Language System, a large

taxonomy of medical concepts created by the National

Library of Medicine [UMLS 1993]. The UMLS provides

a valuable point of comparison for our knowledge

model.

Initially we are dealing only with papers in the field of

bacterial chemotaxis, but the techniques and tools we

Electrophoretic transfer of proteins from the gel to

nitrocellulose

Intelligent Systems in Molecular Biology. ISMB '94, pgs. 147-155

HTML Preview Biological Research Papers page number 1.