Practical Methods for Cleaning Up Your Machine Learning Data

April 28, 2022

Jeff Baumes, Aashish Chaudhary and Dženan Zukić

Garbage in, garbage out. It’s a mantra that is used so frequently that many consider it worthless to even bring up. We apply it to our habits and experiences, the media we consume, and the reports we write. But do we apply it often enough to machine learning models, our favorite problem solving companions? Including invalid or inappropriate data in training or validation could lead to bias and loss of accuracy or precision in your resulting model. In many situations, the potential value of your data can be increased significantly with a quality assurance pass, enough to justify allocating effort to it. In certain fields, such as medical applications, this can be absolutely critical.

Even if you know it is valuable, sifting through large masses of video, images, or text in search of outliers or invalid data can be incredibly time consuming, costly, and let’s face it, tedious and boring. In this article we suggest five approaches you can take to weed out the bad and keep the good in your data without breaking the bank, the clock, or your sanity.

Read everything available about your dataset

If you are curating a new dataset from scratch, this idea won’t help. But if you are gathering your data from other sources, thoroughly read through all the related publications, datasheets, and readme files you can find. Dataset datasheets will hopefully become more common and standardized as the community is beginning to recommend them more earnestly. Are there sections on data selection and preparation? Are there documented examples demonstrating the quality range? If you are confident in the quality guarantees stated, you may be able to cut your QA work down drastically.

Manually sample and make a rough tally of issues

Sometimes you just need to dig in and see what you’re up against. Retrieve a small random set of data samples and look at each one, asking:

Is this data item simply a dud? These are usually the easiest to distinguish. This could be empty text, a blank image, or something that just clearly does not belong.
Is this a real example of the type of data your model is expected to work with? We all know your training data distribution should be as close as reasonably possible to the target usage distribution, otherwise you will be aiming at the wrong target from the start.
Are there artifacts in the data that you are not trying to account for in your model and will likely just confuse it?

After you’ve cataloged the various problems with the data, decide which data issues are concerning enough to address and which ones will not be worth fixing, starting with the most frequent and highest-impact problems. Make careful note of the criteria you intend to use on what’s “in” and “out” of your dataset. When you share or publish your data, this information will be invaluable to others who want to use your dataset. What do you wish was included in a data readme on datasets you’ve used? Use that question as a guide as you document your criteria.

Utilize machine learning

You may be thinking, “My dataset is supposed to be used for creating ML models, not the other way around!” It turns out that machine learning can in fact be a reasonable approach in some situations to help you build or curate your data. Search for research results in your domain for quality assurance AI/ML models. Once you discover one, try it out and see how well it performs on your dataset. If a model works relatively effectively, you can run it on your full dataset, then split the data into groups based on detected quality issues. You can then rapidly sift through large batches of data. It’s easier to make decisions if you look at data that is uniformly likely to have problems (or not have problems) than when jumping between random elements in your dataset.

If you are feeling extra motivated or are an active learning fan, even without an existing model you can feed all the judgments made while assessing early on to build a model to help speed the decision making process as you get further through your data.

Repurpose an existing annotation tool for QA

When you need to start looking through a massive amount of data and don’t want people opening individual files on disk and painstakingly marking their answers in an error-prone spreadsheet, you need to start thinking about using a specialized user interface to capture assessments and organize data.

One interface approach is to look for open annotation tools that you can adapt for your purposes. At face value, the task of annotating data is similar to quality assurance. Quality assurance can be thought of as a special form of annotation, where you are tagging your data based on whether it is acceptable. If you want to run through your data with a simple interface, you may want to look into tools such as Label Studio, CVAT, or DIVE. You may be beholden to their existing feature set unless you partner with the tool creators to optimize it for your data and use case. But in many instances, these tools may be good enough as is for your purposes.

If you can, find a tool that is the closest match for your needs. For example, MIQA is a fully open source system (built by us at Kitware in collaboration with Stanford University, University of Iowa, and KnowledgeVis) made explicitly for quality assurance of medical images, including AI suggestions and double-tier review.

Build a specialized tool

If all else fails and you have a lot of data to assess, a specialized application may be an option to consider. Of course, you need to weigh the cost of building a custom tool to the added cost of performing the quality assurance in an ad hoc, manual manner. A specialized web system can be hosted centrally and accessed by the entire assessment team, and when done right, may greatly increase your throughput and improve the accuracy of judgments about the data. If you make your system open source, it has the added benefit of potentially enhancing the quality of data curated in your entire research field. Here are some practical design choices we’ve found useful when building an efficient QA application:

Start with a simple landing page to quickly see your current tranche of data to review so you can get started right away
Present a single, uncluttered view of each item one at a time, be it image, text, video, or other record, so the dataset can be viewed in its entirety at a glance without additional interaction
Use AI that runs quickly or in real-time to suggest possible deficiencies and speed decision-making
Implement keyboard shortcuts for all interactions (e.g., panning, zooming, scrubbing)
Once a decision has been registered in the system, the app should immediately advance to the next item without additional interaction
Use built-in notifications so that other reviewers are signaled right away if further review is needed

Are you looking for someone to talk to about curating your data for maximum utility?

Unless you are a developer or have software engineers at hand in your organization, you will need to find the right development team to work alongside you if you decide to adapt or construct an accelerated workflow app. Fortunately, that’s just the kind of thing organizations like Kitware like to jump in and help put together.

Kitware’s Data and Analytics team specializes in applying and building solutions for machine learning workflows, including adaptable quality assurance tools such as Medical Image Quality Assurance (MIQA) which has built-in AI smarts and a streamlined UI, and DIVE, a feature-rich video annotation system.

Start a discussion with us to learn what approach might work best for you.

Tags:

CVAT dataset DIVE Label Studio Machine Learning MIQA