Introduction

OxyKodit OxyKodit

To remain competitive in the digital world of the 21st century, it is pivotal to be able to mine textual resources at a large-scale – both in-house sources as well as publicly available documents. A data-driven company exploits Machine Learning (ML) and Natural Language Processing (NLP) techniques to inform decision making and to gain efficiencies across its operations.

OxyKodit helps you unlock the power of your texts and data by implementing tailored solutions. As an expert in these domains, I design and implement custom NLP algorithms fully tailored towards your business use-case. OxyKodit provides tailored solutions to all your data-driven needs, whether you are a small startup in the EU or a multinational in the US.

Email me to start a conversation about how we can achieve your data-driven goals together. I look forward to discussing your project!

Sofie Van Landeghem

Who am I? Sofie_VL

I am a machine learning and NLP engineer who firmly believes in the power of data to transform decision making in industry. I have a Master in Computer Science (software engineering) and a PhD in Sciences (Bioinformatics) – my PhD topic focused on novel ML algorithms for the field of BioNLP, extracting biomolecular events from millions of PubMed articles. I’ve been working in NLP and ML since 2006 and have built up further experience by working on industrial cases in the pharmaceutical industry and the food industry. Since 2019, I have been a core maintainer of spaCy, a popular open-source NLP library created by Explosion.

My freelance work in my one-woman company OxyKodit is project-centric, customized to your specific needs and requirements. Throughout my implementations, I am passionate about quality assurance and testing, introducing proper levels of abstraction, and ensuring code robustness and modularity.

Sofie_VL What can I do for you?

The following flow chart presents the different steps to successfully implement a data-driven algorithm: NLPFlow
The blue circles roughly correspond to the different types of projects that OxyKodit offers. Each can be set up as a separate project, depending on the current phase and requirements of your specific use-case:

  1. NLP: Text mining techniques transform free text to structured knowledge relevant to your business
  2. Data integration: Align the semantics of heterogeneous resources
  3. Proof-of-concept: Determine the feasibility of a proposed data-driven algorithm
  4. ML solution: a full-fledged implementation
  5. Iterate and improve: Review, improve and extend the solution

0. Requirements analysis

  • Perform a thorough analysis of a specific business case, including performance requirements.
  • Determine the correct approach for data annotation, NLP algorithms, data integration etc.
  • Write up a report with detailed advice on the NLP/ML strategy and/or data annotation guidelines.
  • Example projects I have worked on:
    • Analysed a business case involving the extraction of information from news articles, and wrote up a detailed NLP strategy outlining the various NLP components that would be required and how they would interact.
    • Written up extensive annotation guidelines to recognize entities in text, covering both named entities as well as more freely defined phrases. The guidelines included numerous examples and clear rules to allow for an efficient and consistent annotation process.
    • Implemented the data model and helped design the graphical user interface for an annotation framework focused on capturing the molecular mechanisms of leaf growth and development in the Arabidopsis plant.

1. NLP to transform free text into structure information

  • Work with up to millions of documents that contain free-text information.
  • Design NLP algorithms that can process the raw documents and output structured information.
  • Example sources: news, research articles, patents, contracts, clinical trials, physician notes and EMRs, social media, customer requests, …
  • Example NLP components:
    • Word embeddings trained on a domain-specific corpus.
    • Generative AI models including transformers and Large Language Models (LLMs) for text summarization, question answering, chatbots, and various other applications.
    • Recognizing named entities (NER) and relevant spans from text.
    • Entity normalization or linking (EL) to disambiguate ambiguous textual mentions to unique identifiers.
    • Relation extraction and event extraction.
    • Coreference resolution linking named entities to pronouns or other phrases refering to the same person or object.
    • Template generation by analysing common text in historical documents.
    • Multi-linguality and Machine Translation.
    • Document classification (textcat) & (re)ranking.
    • Sentiment analysis (usually cast as a binary decision task) or emotion detection (multi-label text categorization).

2. Data integration

  • Unify heterogeneous datasources according to the semantics of each individual resource.
  • May include text mining data, data lakes and structured knowledge bases as input.
  • Example projects I have worked on:
    • A large-scale data integration and evaluation study performed on the model plant Arabidopsis thaliana showed that 75% of all protein-protein interactions (PPIs) extracted from text were factually correct, though only 35% of them could be found in structured PPI databases. This demonstrates the need to include (curated) knowledge from text into integrative studies to obtain a more complete picture of available domain knowledge.

3. Proof-of-concept (POC) to determine project feasibility

  • Analyse the available data sources in terms of quantity, quality and predictive signal.
  • Implement an annotation framework to produce, in collaboration with the business partners, a realistic training and evaluation dataset.
  • Determine the feasibility of a proposed project / business question by implementing baseline ML and NLP methods.
  • Provide estimates with respect to performance and timelines for a follow-up project.

4. Implement a full-fledged machine learning solution

  • Typically building on top of the results of a POC, in this phase I implement a comprehensive data-driven solution for your specific use-case.
  • The final result may include descriptive data mining results, predictive machine learning models, or even a prescription engine to help guide future decisions.
  • Example projects I have worked on:
    • Implemented a large-scale text mining framework called EVEX to identify biomolecular events in millions of research articles.
    • Created a novel framework Diffany to analyse the rewiring of biomolecular interactions under stress conditions such as plant drought or human cancer.
    • Analysed a set of legal documents to identify similar paragraphs and sentences, and used NLP and clustering techniques to implement a template generator that can significantly reduce required editing time for a new document.
    • Implemented an optimization framework to create hyper-personalized cocktails and mocktails according to user preferences.
    • Designed and implemented an NLP strategy to mine diagnostic reports, identify relevant information and summarize patient characteristics through named entity recognition and relation extraction techniques.
    • Performed a critical evaluation of existing tools to recognize mentions of cell lines in text, and developed to new annotated datasets to further boost development of NLP algorithms in this domain.

5. Iterate and improve

  • Perform rigorous testing and assess the current quality of both the data and the code base.
  • Identify structural errors (if any) in the dataset and/or annotation guidelines.
  • Make the code more robust and more performant in terms of speed and memory usage.
  • Tune the algorithms in terms of predictive performance to make them more accurate and reliable.
  • Iterate on both data and the ML models.
  • Extend the code base with new functionality according to specific feature requests.
  • Example projects I have worked on:
    • Increased the speed of the French parser in the NLP library spaCy with 30%.
    • Analysed an NER dataset from a customer and identified structural ambiguities and conflicts. Refined the annotation guidelines accordingly and trained new models on the curated dataset, obtaining much more robust and accurate results.
    • Experimented with various hyperparameter tuning experiments as well as different architectures to optimize F-score of the ML models.

Blog Highlighted Blogs


For a full overview of all blog posts go here.

 

24 January 2024

Slide showing how to use built-in zero-shot NER with spacy-llm

From quick prototyping with LLMs to more reliable and efficient NLP solutions

At AstraZeneca’s NLP Community of Practice, I talked about how to use Large Language Models for fast prototyping, with a specific focus on mining clinical trials.

20 December 2022

A schema representing an iterative cycle with 5 steps: from product vision to accuracy estimate, to training & evaluation, to labelled data, to annotation scheme, and back to product vision.

How to maximize probability of success for your Machine Learning solution?

This LinkedIn post offers some tips and tricks from personal experience, helping you get the most out of your ML/NLP projects. It’s all about data and iteration!

7 May 2020

Video banner for Youtube

Training a custom Entity Linking model

This video tutorial shows how to use spaCy to implement and train a custom Entity Linking model that disambiguates textual mentions to unique identifiers.