Type of services
As a Natural Language Processing and Machine Learning expert, I implement tailored data-driven solutions for your business needs, including (but certainly not limited to) scanning documents for facts, determining key influencing factors, and training predictive models on top of your in-house data.
The following flow chart presents the different steps to succesfully implement a data-driven algorithm:
The blue circles correspond to the different types of services that OxyKodit offers. Each can be set up as a separate project, depending on the current phase and requirements of your specific use-case:
- NLP: Implement Natural Language Processing techniques to transform free text into structured knowledge
- Data integration: Align the structure and semantics of heterogeneous resources
- POC: Run a quick proof-of-concept to determine the feasibility of a proposed data-driven algorithm
- ML: Implement a full-fledged machine learning solution to your specific business case
- Continuous refinements: Review, improve and extend upon an existing data mining code base
These type of projects are fully aligned with my experience and expertise, having worked more than 12 years in the domain of text mining and machine learning. OxyKodit provides tailored solutions to all your data-driven needs, whether you are a small startup in the EU or a multinational in the US.
- Work with up to millions of documents that contain free-text information
- Design NLP algorithms that can process the raw documents and output structured information
Example sources: news, research articles, patents, contracts, clinical trials, physician notes and EMRs, social media, customer requests, …
Example cases: entity recognition and normalization, relation/event extraction, template generation, text summarization, multi-linguality, sentiment/emotion detection, document classification & ranking, …
Natural Language Processing algorithms run on millions of biomedical research articles can identify detailed relations between important biological entities (top). These findings help construct the p53 signaling pathway (right), a crucial instrument for understanding the development of human cancers. For more details, see Van Landeghem et al., Plos One 2013
- Unify heterogeneous datasources according to the semantics of each individual resource
- May include text mining data, data lakes and structured knowledge bases as input
Anecdotally, a large-scale data integration and evaluation study performed on the model plant A. thaliana showed that 75% of all extracted protein-protein interactions were factually correct, though only 35% of them could be found in structured PPI databases. This result clearly shows that while text mining algorithms may not always be perfect, there is a definite need to include (curated) knowledge from text into integrative studies to obtain a more complete picture. More details can be found in Van Landeghem et al, The Plant Cell, 2013
- Determine the feasibility of a proposed project / business question
- Analyse the available data sources in terms of quantity, quality and predictive signal
- May include developing new text mining components
- Provide estimates with respect to performance and timelines for a follow-up project
As an example of a quick proof-of-concept, I have implemented an image recognition demo using deep learning algorithms with Tensorflow.
- Implement a comprehensive data-driven solution for your specific use-case
- Typically builds on top of the results and estimates of a previous POC
- The final result may include descriptive data mining results, predictive machine learning models, or even a prescription engine to help guide future decisions
Example cases: image recognition, temporal trend analysis, graph/network analyses, time series, customer segmentation, key performance indicators, marketing effectiveness, survey analysis, …
Example projects that I have worked on during the past 12 years:
- Implemented a large-scale text mining pipeline, ran it on millions of articles, stored the results in a database, allowed retrieval through browser and API requests: Van Landeghem et al, Advances in Bioinformatics, 2012
- Created a novel framework to analyse the rewiring of biomolecular interactions under stress conditions such as plant drought or human cancer. Made the algorithms available as an open-source package, a commandline interface and a cytoscape plugin: Diffany code on GitHub
- Implemented various novel Machine Learning algorithms to address challenges across the pharmaceutical value chain
- Analysed a set of legal documents to identify similar paragraphs and sentences, and used NLP and clustering techniques to implement a template generator that can significantly reduce required editing time for a new document.
- Implemented an optimization framework to create hyper-personalized cocktails and mocktails according to user preferences.
- Perform rigourous testing and assess the current quality of both the data and the code base
- Make the code more robust and more performant in terms of speed
- Tune the algorithms in terms of predictive performance
- Extend the code base with new functionality according to specific feature requests
As an example case, I have worked on the open-source NLP library spaCy to analyse the performance of its French parser, and was able to speed it up with 30%: Pull request 3046 on spaCy's GitHub