Sofie’s Blog

Blog Overview

  • Once a Maintainer: Sofie Van Landeghem

    Allison Pike interviewed me about my work as a core maintainer of spaCy and Typer. We ended up talking about how I got into NLP in the first place, the disconnect between research and industry applications, and how to manage an Open-Source project.

    Once a Maintainer: Sofie Van Landeghem
  • How to uncover and avoid structural biases in your ML projects

    Even with today’s impressive zero-shot LLM capabilities, the success of any NLP project can be predicted by the quality of the data it’s built on. First and foremost, you need a representative evaluation data set to measure progress and performance throughout the development of your NLP pipeline.

    How to uncover and avoid structural biases in your ML projects
  • From quick prototyping with LLMs to more reliable and efficient NLP solutions

    At AstraZeneca’s NLP Community of Practice, I talked about how to use LLMs for fast prototyping in NLP applications, with a specific focus on mining clinical trials.

    From quick prototyping with LLMs to more reliable and efficient NLP solutions
  • How to get the most out of your open-source contributions?

    As I’m celebrating 10 years of contributing open-source code on Github, I share some tips & tricks on how to collaborate efficiently on a code base.

    How to get the most out of your open-source contributions?
  • Integrating Large Language Models into structured NLP pipelines

    In this talk presented at the Belgian NLP meetup, I showcase how to build such a structured pipeline with the open-source NLP toolbox spaCy, and its recent extension ‘spacy-llm’.

    Integrating Large Language Models into structured NLP pipelines
  • spaCy: A customizable NLP toolkit designed for developers

    At ODSC Europe 2023, I presented the open-source NLP toolbox spaCy, and demonstrated how Large Language Models (LLMs) can be integrated into your NLP pipelines.

    spaCy: A customizable NLP toolkit designed for developers
  • How to maximize probability of success for your Machine Learning solution?

    This LinkedIn post offers some tips and tricks from personal experience, helping you get the most out of your ML/NLP projects. It’s all about data and iteration!

    How to maximize probability of success for your Machine Learning solution?
  • Multi hash embeddings in spaCy

    In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail. Further, we critically evaluate the hash embedding architecture with multi-embeddings on Named Entity Recognition datasets from a variety of domains and languages.

    Multi hash embeddings in spaCy
  • End-to-end Neural Coreference Resolution in spaCy

    This blog post introduces a novel coreference implementation for spaCy. We’ve based our implementation on a recent incarnation of the neural paradigm published in the paper “Word-Level Coreference Resolution” by Vladimir Dobrovolskii, which was published in EMNLP 2021.

    End-to-end Neural Coreference Resolution in spaCy
  • Spancat: a new approach for span labeling

    The SpanCategorizer is a new spaCy component that answers the NLP community’s need to have structured annotation for a wide variety of labeled spans, including long phrases, non-named entities, or overlapping annotations.

    Spancat: a new approach for span labeling
  • Implementing a custom trainable component for relation extraction

    This video shows how to apply the new spaCy v3 features as we work our way through implementing a new custom component from scratch. The specific challenge we are setting ourselves here is implementing a custom component to predict relationships between named entities, also called relation extraction.

    Implementing a custom trainable component for relation extraction
  • spaCy v3.0

    spaCy v3.0 is a huge release! It features new transformer-based pipelines that get spaCy’s accuracy right up to the current state-of-the-art, and a new workflow system to help you take projects from prototype to production.

    spaCy v3.0
  • Training a custom Entity Linking model with spaCy 

    This video tutorial shows how to use spaCy to implement and train a custom Entity Linking model that disambiguates textual mentions to unique identifiers.

    Training a custom Entity Linking model with spaCy 
  • Building customizable NLP pipelines with spaCy

    At a Turku.AI meetup, I gave a talk explaining spaCy’s typical usage as well as upcoming features for the v.3 release.

    Building customizable NLP pipelines with spaCy
  • Entity linking for spaCy: Grounding textual mentions

    In this talk, I present my recent work on adding Entity Linking functionality to spaCy, which allows grounding information from text into “real world” unique identifiers defined by a knowledge base (KB).

    Entity linking for spaCy: Grounding textual mentions
  • Entity linking functionality in spaCy

    At spaCy IRL 2019, I presented our ongoing work on the Entity Linking functionality in spaCy, including our efforts to work with a fast and efficient in memory KB, a neural network architecture that encodes both the entity information as well as the sentence context, and the ability to add more features as we’re experimenting.

    Entity linking functionality in spaCy
  • Studying the dynamic rewiring of molecular networks 

    At this Women in Tech event in Antwerp, I’ve given an talk showing how methods from Natural Language Processing, Data Integration and Graph Analysis all contribute to studying the dynamic rewiring of molecular networks.

    Studying the dynamic rewiring of molecular networks 
  • Increasing tokenization speed across spaCy’s core languages

    As oe of my first core contributions to the spaCy open-source NLP library – this PR increases tokenization speed with a factor of 2-3 across all languages.

    Increasing tokenization speed across spaCy’s core languages
  • Developing a cocktail generator that optimizes your drink to your taste

    In the past few months working at Foodpairing in Gent, I’ve been working on a revamped version of our cocktail generator. Testing the algorithms means tasting the cocktails it produces – resulting in a delicious afternoon at work!

    Developing a cocktail generator that optimizes your drink to your taste
  • TEDxBigData: Machine Learning & Natural Language Processing

    I had the opportunity to present at a TEDxBigData event organized by J&J. In this talk, I shared my enthousiasm and love for Natural Language Processing by giving some examples of ambiguities in natural language, as well as showing some amazing examples of what NLP can do in concrete business cases today.

    TEDxBigData: Machine Learning & Natural Language Processing
  • Cell line name recognition in support of the identification of synthetic lethality in cancer from text

    In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources. We further introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.

    Cell line name recognition in support of the identification of synthetic lethality in cancer from text
  • Diffany: an ontology-driven framework to infer, visualise and analyse differential molecular networks

    In this work, we present a generic, ontology-driven framework to infer, visualise and analyse an arbitrary set of condition-specific responses against one reference network. We propose an integrative framework called “Diffany” to standardize differential networks and promote comparability between differential network studies.

    Diffany: an ontology-driven framework to infer, visualise and analyse differential molecular networks
  • The KnownLeaf literature curation system

    We constructed an annotation framework for the curation of the scientific literature studying the molecular mechanisms that control leaf growth and development in Arabidopsis thaliana (Arabidopsis). A total of 283 primary research articles were curated by a community of annotators, yielding 9947 relations monitored for consistency and over 12,500 references to Arabidopsis genes.

    The KnownLeaf literature curation system
  • EVEX in ST’13: Application of a large-scale text mining resource to event extraction and network construction

    In this paper, we describe our participation in the latest BioNLP Shared Task using the large-scale text mining resource EVEX. In the Genia Event Extraction (GE) task, we implemented a re-ranking approach that resulted in the first rank of the official Shared Task results.

    EVEX in ST’13: Application of a large-scale text mining resource to event extraction and network construction
  • Large-scale event extraction from literature with multi-level gene normalization

    We have combined two state-of-the-art text mining components to perform normalization and event extraction on all 1.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This resource is available as the EVEX database under the CC BY-SA license.

    Large-scale event extraction from literature with multi-level gene normalization
  • The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis

    In this study published in Plant Cell, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts.

    The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis
  • PhD thesis: Playing hide and seek on the genomic playground

    My PhD thesis focuses on the field of natural language processing for biomolecular texts, or “BioNLP”. It discuss novel approaches to event extraction and the construction of a large-scale text mining resource called “EVEX”.

    PhD thesis: Playing hide and seek on the genomic playground
  • EVEX: a PubMed-scale resource for homology-based generalization of text mining predictions

    Accounting for lexical variation of gene symbols, we have implemented a disambiguation algorithm that uniquely links the arguments of 11.2 million biomolecular events to well-defined gene families, providing interesting opportunities for query expansion and hypothesis generation. The resulting MySQL database, including all 19.2 million original events as well as their homology-based variants, is publicly available.

    EVEX: a PubMed-scale resource for homology-based generalization of text mining predictions
  • Discriminative and informative features for biomolecular text mining with ensemble feature selection

    In this study published in Bioinformatics, we show how feature selection can be used to improve state-of-the-art text mining algorithms, while at the same time providing insight into the specific properties of the original classification algorithm.

    Discriminative and informative features for biomolecular text mining with ensemble feature selection
  • Analyzing text in search of bio-molecular events

    We participated in the BioNLP Shared Task on Event Extraction with an SVM-based implementation to extract biomolecular events from text. Out of 24 participating NLP teams world-wide, we ranked 5th.

    Analyzing text in search of bio-molecular events
  • Master thesis: Automated information retrieval and data mining for bio-informatics

    For my Master Thesis, I have developed GeneFetch, a system that enables biologists to easily find and browse vital information on any gene or group of genes. To this end, the system uses Text mining, Data mining and semantic web techniques.

    Master thesis: Automated information retrieval and data mining for bio-informatics