Picture showing the PhD booklet cover

PhD thesis: Playing hide and seek on the genomic playground

The field of natural language processing for biomolecular texts (BioNLP) aims at large-scale text mining in support of life science research. Its primary motivation is the enormous amount of available scientific literature, which makes it essentially impossible to rapidly gain an overview of prior research results other than in a very narrow domain of interest. Among the typical use cases for BioNLP applications are support for database curation, linking experimental data with relevant literature, and hypothesis generation.

This thesis discusses the extraction of information about known associations between genes and proteins to support such use cases.

Protein-protein interactions

Due to the intrinsic complexity of natural language, accurately extracting information from text is a challenging discipline. As one of the first problems addressed by the BioNLP community, the extraction of protein-protein interactions (PPIs) has been widely studied and many different predictive frameworks proposed. During literature review of these methods, it has become clear that this field is still struggling with a heterogeneous collection of datasets, data formats and evaluation methods. Several fundamental evaluation problems are discussed, including their influence on the reported performance rates. A set of practical guidelines is also proposed to ensure a meaningful evaluation.

Further, a novel machine learning framework was developed for PPI extraction from text. This framework analyses both the lexical and syntactic information from sentences and synthesizes all this information in rich feature vectors. We present the first extensive analysis of applying fully automated feature selection in this domain, obtaining more cost-effective models. Finally, our PPI extraction technique was evaluated on several novel cross-dataset experiments, offering a more realistic view on model performance.

Event extraction

Recognizing that extraction of undirected binary relations such as PPIs do not provide sufficient detail for representing complex biomolecular interactions, the focus has shifted towards a more detailed analysis of the textual statements. This approach was formalized as an event extraction task and greatly popularized in the series of BioNLP Shared Tasks on Event Extraction. The detection of biomolecular events from text includes various physical events such as phosphorylation and gene expression, as well as recursively defined regulatory events. Their extraction includes additional vital information such as the type and polarity of the relationship, the identification of the semantic roles of the participating entities and whether it was stated in a speculative or affirmative context.

A detailed account of the extension of our machine-learning framework is presented, employing a set of type-specific classifiers run in parallel for event extraction. Our work is mainly focused around the filtering of false positives, creating a high-precision extraction method. Various different techniques were tested such as different SVM kernels, feature selection and filters for data pre- and post-processing. To detect negation and speculation in text, a rule-based system was implemented; simple in design, but effective in performance. Our framework ranks 5th out of 24 international teams in the BioNLP Shared Task of 2009, achieving 33.41% recall, 51.55% precision and 40.54% F-score.

Follow-up studies further improved the method and a relative performance gain of 10% was obtained, resulting in 37.43% recall, 54.81% precision and 44.48% F-score. Black-box behavior of machine learning systems currently limits understanding of the true nature of the predictions. However, feature selection is capable of identifying the most relevant features in any supervised learning setting, providing insight into the specific properties of the classification algorithm. This allows building more accurate classifiers while at the same time bridging the gap between the black-box behavior and
the end-user who has to interpret the results. The novel method of ensemble feature selection is applied to the event extraction challenge, discarding a large fraction of machine generated features and improving classification performance. Furthermore, we present numerous examples of highly discriminative features that model either biological reality or common linguistic constructs, illustrating how feature selection can be used to gain understanding in automatically generated predictions. Finally, we discuss a number of insights from these analyses that may help improving current text mining tools.

Entity relations

One of the supporting tasks of the BioNLP Shared Task, designed to provide more finegrained text predictions, is the extraction of non-causal or ‘entity’ relations. Such entity relations between genes and domain terms identify the relations between genes, promoters, complexes and various other molecular entities found in text, enabling an enhanced representation of the biological processes underlying textual statements. We have implemented an extraction system for such non-causal relations between genes and domain terms, applying semantic spaces, machine learning and feature selection techniques. Our system ranks second in the official results of the BioNLP Shared Task of 2011, achieving 37.04% precision, 47.48% recall and 41.62% F-score.

Further, our framework is compared with the system ranking first, developed by the University of Turku (57.7% F-score). We investigate the performance discrepancy by analysing the influence of predicted domain terms, using a related and more extensive dataset. Additionally, a hybrid system is constructed, combining the two frameworks and experimenting with intersection and union combinations for respectively high-precision and high-recall predictions. Finally, extremely high-performance results (F-score above 90%) are highlighted, representing a specific subclass of embedded entity relations that are essential for integration of text mining predictions with database facts.

Finally, we present the first study of applying entity relations for enhancing event extraction performance. While obtaining promising results, we argue that an event extraction framework benefits most from this new data when taking intrinsic differences between various event types into account.

EVEX: a large-scale text mining resource

To enable full integration of textual data with existing biomolecular databases, it is crucial that text mining tools scale up to millions of articles and their results can be unambiguously linked to data records from authoritative resources such as NCBI, UniProt, KEGG and BioGRID.

We present the first bibliome-wide study that combines automated extraction of complex biomolecular events with a gene normalization system that maps ambiguous gene mentions in text to unique gene identifiers. This pipeline, consisting of state-of-the-art components that were thoroughly evaluated on two highly relevant community-wide challenges, was applied to all 21 million PubMed abstracts and all 372 thousand PubMed Central open-access full-text articles. The resulting dataset, called EVEX, contains more than 34 million biomolecular events among 67 million gene mentions that could be linked to more than 120 thousand distinct genes from over 4800 species covering the full taxonomic tree, including viruses, bacteria, fungi, plants and animals.

The data was further enriched with gene family data, providing interesting opportunities for homology-based hypothesis generation. Further, abstract generalizations accounting for lexical variants and synonymy. The originally extracted event occurrences, as well as their generalized variants, are publicly available as a MySQL database.

Further, an intuitive web application is developed, allowing explorative browsing of the EVEX text mining results without prior knowledge on BioNLP. This web application allows for knowledge summarization on any given gene as well as retrieval of indirect associations between two genes, such as co-regulation.

Real world applications

Finally, we discuss the applicability of event-based text mining tools for database and pathway curation. These opportunities are illustrated on a specific use case involving NADP(H) metabolism in E. coli. The analyses show promising results and highlight interesting future prospects.

→  Full PDF: UGent

→  Slides: Speakerdeck

→  Promotors: Yves Van de Peer, Bernard De Baets, Yvan Saeys