Even with today’s impressive zero-shot LLM capabilities, the success of any NLP project can be predicted by the quality of the data it’s built on. First and foremost, you need a representative evaluation data set to measure progress and performance throughout the development of your NLP pipeline. Further, once you get to production, you’ll need to balance accuracy versus a whole range of other performance measures, such as reliability, reproducability, interpretability, inference speed and compute power – to name just a few. You might just find that training a smaller, specialized supervised model is more cost efficient and robust than running a black-box LLM.
Either way: data should be front and center throughout the development of your NLP pipeline. But as we all know: bias can creep in. I urge any data scientist to not just look at the performance numbers, but actually dig into the data itself, understand how it was selected for processing in the first place, and whether the data you’ve collected is actually representative of the use-cases you’ll want to apply your model on. This may seem like a given, but in many cases there’s actually a mismatch between the two, and assumptions about data and data processing pipelines are often proven false halfway through the project.
Bias can also creep into your evaluation. You might be artificially boosting your numbers by accidentally leaking information from training to test — this happens in more ways that you can imagine! Setting up an “extrinsic” evaluation will also help you better understand the final target of your NLP pipeline and how it should integrate with downstream requirements. This allows you to step back from the nitty-gritty details of training your ML/NLP model, and keep the bigger picture in mind while iterating over your data model, NLP algorithms and overall solution.
At PyData London, I talked through some use-cases, inspired by various consulting projects from the last decade, and compiled a list of recommendations for running an ML project – focusing mostly on data and evaluation.
Here’s my (biased) list of recommendations, taken from 15+ years of experience in the field:
– Avoid selection bias by formalizing the selection procedure
– Create deterministic, document-level train/dev/test splits
– Carefully design the data model / label scheme
– Write up detailed data guidelines
– Set up a meaningful extrinsic evaluation
– Look at inter-annotator agreement stats and plot a learning curve
– Apply a preliminary model back to the training data
– Manually inspect gold annotations and incorrect predictions
– Make sure you’re climbing the right hill
– Data quality should be front and center!
Happy data mining!
→ Venue: London PyData
→ Video: Recording of presentation
→ Slides: Speakerdeck