What We Don't Talk About When We Talk About Data

— 10 minute read

When I was in school, my idea of data science was mainly influenced by: a) what was taught in class, and b) Kaggle. Little did I know that things turned out to be wildly different from how they are in real life. I learned that you need to do a lot of things before you get to do model.fit(), & there are (still) a bunch of other things you need to do after you finally get to run model.fit() in order to get your data science product to make an impact for its intended users.

Dealing with datasets falls to the former category. Unlike problem sets at school or Kaggle, real life situations don't hand you clean (or near-clean) datasets that you can use right away for your problem. You do have to spend some time crafting one that is suitable for your problem. A lot of thoughts need to go into it because feeding a low-quality dataset vs a high-quality one can give a significant impact for your model performance, perhaps even more than tweaking the algorithms themselves.

Here's the cold hard truth: most of the time, the dataset of your dreams might not exist.

We will often face a situation where we have a very limited dataset or worse, our dataset does not exist yet. This can be caused by a few reasons:

Our company doesn't know they'd need it in the future, so they don't collect them. This is probably less probable in large companies with resources to have a data lake, but there's still possibility that this happens. There's probably not much that you can do other than convincing your team that you need to collect this data & justify the budget & time needed to do it (I'd argue this is also a skill on its own).

They do exist in a data lake, but they need to be transformed before you start using it. Transforming it might become someone's backlog that might not be picked up until the next few sprints. Either way, you won't have that dataset that you want immediately.

We have our data, but they are not labeled. This might be a problem if you are trying to do a supervised learning task. A quick solution might be to hire annotators like people from Mechanical Turk, but this might not be possible if you have a domain-specific data or sensitive data where masking makes the annotation task impossible. I've also seen companies having "Data Labeler" as one of their job openings, but you might have to think whether it makes sense to hire someone full-time to label your data.

Once you have annotators, you might also want to strategize the kind of labels that you need so that they can be useful for future cases, to save cost and time so you don't need to label the same data twice. For example, if you need to label a large set of tweets with "Positive" vs "Negative", you probably need to anticipate future needs by making more granular labels instead (e.g. "Happy", "Sad", etc.).

I've been in situations where to kickstart my project, I have to start labeling some of the data myself. This is either because the dataset does not have any labels at all or because the existing labels are low in quality.

Now I can hear some of you sighing, "do I have to label my own data?!". If you spend more than a few hours on it I don't think it's a good use of your time but the silver lining is I get to know my data better in ways that are probably not possible through plotting graphs. For example, from this exercise I know that humans typically misclassify certain classes so the labels related to these classes are probably not reliable. I also learn that it does not make sense to have class A and B because they are similar. Andrej Karpathy also did this exercise on the ImageNet data & wrote about what he learned in his blog.

Though going through your data manually is an exercise I find to be very helpful, going through 10,000 rows of data by your own probably does not much make sense. My rule-of-thumb is I'd spend three hours on it at the maximum. Sometimes you just have to accept that your labels are scarce and you have to work with it. In such cases, you can:

  • Explore if there is a way to go about it using an unsupervised approach, or
  • Explore self-supervised methods—these are worth exploring if you have some data that are labeled, but you have even more data that are not labeled.

We have our data, but they are weakly labeled. These labels do not necessarily correspond to the target of your model, but you can use them as some sort of a proxy for it. For example, you may not have the data whether a user likes an item or not, but perhaps you can infer that information from the number of times the user views the items.

Not all hope is lost

Though methods such as deep learning oftentimes triumph with a large dataset, working with limited data does not mean that all hope is lost. Just a couple of examples:

  • Transfer learning makes it possible for you to learn with limited data if there is a pre-trained model that can be generalized to your use-case too.
  • Data augmentation strategies are also getting more advanced, & a lot of these are already baked into existing frameworks. For example, FastAI already implements MixUp in their framework. You can also use the results of AutoAugment's policies to see if it works well with your data.
  • Self-supervised learning

I personally find this topic interesting because sigh I can totally relate to the whole very-limited-dataset issue. I'm definitely not the only person who finds this interesting because there's a whole ICLR workshop on the topic too!

But that's not all

Now, even if you already have the data that is pretty close to what you want, there is stil a long way to go...[1]

You still need to do the whole data cleansing process. I'll skip the whole "how" thing because there are already a lot of tutorials out there (& I have some materials on that myself :)) but I just want to reiterate that the process is not as trivial as it sounds. Even something as "simple" as text normalization (if you're using a bag-of-words approach) might not be that simple—does it make sense to remove these stopwords? Does it make sense to remove numbers? What do you do with missing data? Do you impute them with average or eliminate them altogether? How do you treat outliers?

We have the formulas but so much of it comes down to what your problem is. I've been asked these questions a lot & my answer is always "it depends" because it really depends.

If you apply data augmentation strategies to diversify your dataset, you also need to be careful. Are the augmentations that we apply label preserving? Rotating pictures of a coffee might make sense because although it's rotated 180 degrees the picture is still that of a coffee, but when you rotate the picture '9' 180 degrees it's now no longer a '9' but a '6'. Rotating an image might be straightforward (& my examples here are very simple ones), but sometimes you need domain expertise to determine whether an augmentation strategy is label preserving or not, making the task more complex than you might think at first.

You also have to check whether your dataset is reliable. Tricky one I'd say because the definition of "reliable" differs from each case so you really have to define it yourself. Much of this comes down to how well you understand your problem, how well you understand your data, & how careful you are. Some common pitfalls include:

  • Erroneous labels, especially when they are labeled by humans. Going through the data by hand can be helpful for you to get a sense of this—even if you don't manage to go through all of them, you can have an estimate of % of human error.
  • Missing data. Missing data does not only mean fields with empty values. Say that you have a column that says "Province". Out of all your data, there is no value that says "West Java". Does it make sense?
  • Duplicated data. Sounds trivial—we can just do pd.drop_duplicates, no? It depends. If your dataset consists of images, it also depends what kind of duplication you want to remove (do you only want to remove exact duplicates or near-duplicates too?). Also you might want to ask yourself (or your team): why do we have duplicates? Is this expected? Is our data reliable? Is something wrong in the pipeline? Oh, let's go back to the start, sings Coldplay.

It might take ages for us to get us our hands on the "perfect" data for our problem. We may never get it, after all. Sometimes, we'll end up with a weakly labeled dataset. Or a small dataset with the labels that we want. Or a small dataset with labels that we want but we can't really trust. Or a medium, messy dataset. Most of the time, there's always a better dataset that you will never have. C'est la vie.

You can practice this

I personally think building your own side project outside of Kaggle problems can be a great way for you to familiarize yourself with these challenges. You can start with defining your own problem statement & search for datasets that are relevant to your problem instead of the other way around. If the perfect dataset doesn't exist, then it's your time to practice: fetch the data yourself using Twitter API, find ways to use weakly labeled datasets... don't worry if the data is limited—this is a good chance to see what you can do with limited data.

If you know JavaScript, you can building quick prototypes of anything that you want using tools like ml5.js or TensorFlow.js. Otherwise, stick to libraries you're familiar of and use simple classifiers. Here, your focus is observing how different datasets, augmentation strategies, preprocessing strategies, or even splits affect your predictions.

Final notes

In real-world situations, there is a long journey to get to the point where we can confidently say, "OK, now I'm finally going to start training."

Sometimes I find myself to be spending more time on crafting the right datasets than thinking about algorithms, because improving the dataset usually yields larger improvement than messing around with hyperparameters all day. It's also more intuitive: most of the time you do know how to improve your dataset, e.g. after an error analysis of an image classification you might realize that you need more images in low-light conditions. When it's no longer possible to tweak my dataset then that's when I move to other things: tweaking algorithms, hyperparameters, & finally drumrolls running model.fit().


  1. I'm aware that the following points still need to be done if you're doing some Kaggle problems, but I'm leaving these here because I find them important to reiterate. ↩︎