Reading Notes: Thinking with Data

Finding this book feels like finding another friend who has gone through the same thing and would like to tell you that 'hey, your thoughts and concerns are indeed valid; here is some more advice that would be useful for you'.

I wasn’t really sure whether to pick up Max Shron’s Thinking with Data or not until I read the Preface in the book preview, which describes the motivation of the book and outlines the topic discussed in each chapter. Having spent the past few months building a data product from the ground up, these happen to be things that I had to ponder over lately and yet are seldom discussed in other data-related books. Finding this book feels like finding another friend who has gone through the same thing and would like to tell you that “hey, your thoughts and concerns are indeed valid; here is some more advice that would be useful for you”.

In this post I’ll share some of my personal notes and takeaways from the book. The book is organized into several sections, but I think we can break it down into two large themes: scoping and arguments.

Scoping

The motivation for this section is that when working with data, most of us start at the wrong end: we get our data, apply our favorite techniques, evaluate it, and call it a day. Is that all there is to it? Not really. Shron argues that this approach would lead us to shallow arguments and narrow questions.

We need to have a space to think.

I’m guilty as charged. I love thinking. I wrote this a few years ago when I was organizing a school event. I saw the value of thinking—argued for it, in fact:

It might sound time-consuming and it’s easy for us to fall into thinking that finding the fundamental pieces is not really worth it. But speaking from personal experience, going to the wrong direction due to not knowing what actually matters will even be more time-consuming. When that happened, I spent so much time looking for a way to make a detour. Sometimes I couldn’t even afford making a detour at all. Other times I didn’t even know what to fix because I didn’t understand the fundamentals—I didn’t know where the detour was. It’s the worst feeling, knowing that you could have done so much better had you just put some more efforts to think.

But I feel I barely do it anymore these days. The difference was, I was in a role which tasked me to see the big picture; a role that didn’t require me to go into execution mode (going into execution mode would lead me to micromanaging people). I wasn’t taking many classes, so yay for more free time. I had all the time in the world to think, to navigate all the vagueness that comes with the activity of thinking. Fast forward a few years later: in between all of these responsibilities, as I constantly find myself shifting from thinking to executing mode, it doesn’t take long until deep thinking takes the backseat.

Schon acknowledges that this is challenging, which is why he argues that we need to find structure. In terms of thinking about data problems, we can start by creating the scope of the data problem itself. The first part of the book largely builds on the four parts to a project scope that he introduces: context of the project, needs that the project is trying to meet, vision of what success might look like, and the outcome—how the organization will adopt the results and how its effects will be measured.

I won’t go any further about the content of the book itself, instead I will outline some of the takeaways that I get from the first part of the book:

  • Think about the what and so what before the how

Coming from an engineering background, I got tripped up with this a lot. My knee-jerk reaction once I come any potential problem is: how? How to engineer these? How complicated would the tech be? I immediately would go into “execution mode”. While these considerations might have their own merits, things can get convoluted when you think about these things before even knowing the what.

  • Listen to the users

This is something I learned from studying UX (I guess I’m also partial to UX because I spent a year trying to pursue a career in UX before switching to data). When building a data product I believe it is important to sit down and engage with the users as early as possible to understand their needs, pain points, and the possible pitfalls.

  • Write down your ideas to identify flaws

Despite the previously mentioned point, interestingly Shron also points out that it does not mean we shouldn’t generate our own ideas—we should write them down, though the purpose is not to clung onto them but to see their flaws.

  • Start from the needs, not the solutions

Needs prevent us from getting too microscopic and only with one possibility. Having an open ended action-oriented formulation opens up more possibilities. This means do not start with “we need a dashboard” or “we need a predictive model”—these are not needs, these are potential solutions. The solution is usually just one out of a few possible solutions, and different needs may require different approaches.

  • Larger projects may have messier beginnings

Problem definitions for large applied problems are messier than those of toy problems. Information is often incomplete, expectations are miscalibrated… this is something that I used to have trouble accepting: at some point I thought it was my fault for not having everything in order immediately. I have now accepted the fact that real-world data problems do not come with a paragraph of problem statement and a tidy csv dataset.

I guess this is why I found it hard to imagine myself reading this book and internalizing the structure a few years ago, when I was mostly playing around with toy problems. The structure that Shron introduces shines the brightest when we are faced with messy, ambiguous problems.

  • Document every minor intellectual and technical decisions

Why? a) Reproducibility, b) to catch errors.

For me these technical decisions can be something like why are we raising the threshold to X, why are we choosing X over Y.

As much as I love documentations and writing in general, this is one of those things that fall into the “easier-said-than-done” category for me, not because I don’t think it is important but because this is one of the easiest things to procrastinate.

One suggestion that I recently gave to my team is to block 2 hours in every two weeks to update our documentation as a way to “force” us. But then again 2 weeks might be a long time, we may already forget some details when the time comes. (This is a recent suggestion so we haven’t had the chance to evaluate it).

On an unrelated note, documentation for data projects is something that can be a whole other topic on its own. If any of you have suggestions or you already find a framework that works for you - please let me know!

  • Invest in building a good mental library

We need vision to tell us where we are going before we finally do the work to acquire data, perform our experiments, build our models, and so on. When you do it the other way around by starting with the work, you will end up with a narrow vision instead.

Here’s an example of what a vision might look like, taken directly from the book:

“The developers working on the corruption project will get a piece of software that takes in feeds of media sources and rates the chances that a particular politician is being talked about. The staff will set a list of names and affiliations to watch for. The results will be fed into a database, which will feed a dashboard and email alert system.”

Vision largely comes from experience, and as you gain more experience, you will notice that the ideas that we come up with will mostly be variations on things we have seen before. We may not have all the time in the world to build every possible product from the ground up, so it is important to also invest some time in building your mental library. This can be done by reading, talking to other people, following blogs, attending conferences or meetups, and experimenting with new ideas all the time.

Personally, this is an issue I’ve been grappling with. I’ve been trying to avoid consuming too much which results in me reading fewer books and limiting my learning to things that are related to my current work (which was also reflected in my 2019 year-in-review post). The idea of investing in building a good mental library made me revisit my approach, and it made me realize that my definition of “things that are related to my current work” may be influenced by my narrow vision. If I limit myself, I will never break out of that cycle, and I will always circle around the same approaches due to my narrow vision.

So right now, I’m pretty much reverting to my old self: I’ll read and learn whatever I feel like learning—with moderation, of course.

  • Refine your vision by improving your intuition about the problem

The way I see it is that a good mental library is just a subset of the things you can do to improve your intuition. Other approaches include:

  • Kitchen Sink Interrogation: come up with questions. Good questions are new ways to frame a problem. In Kitchen Sink Interrogation, you’re not looking for answers; the focus is on generating the questions, and getting sense of what is known and what is unknown.
  • Work backward from the vision to where we are now. Ask questions that will reveal the “blanks”. This can help us to figure out that a certain task is not feasible before we commit resources to it.
  • Interview other people: they may not have the exact same problem as we do, but other people’s perspectives are invaluable when it comes to building your intuition.
  • Roleplaying: put your feet in the shoes of the users or stakeholders of your product.

My only gripe about this part is that Shron acknowledges that scope evolves over time, yet he does not really discuss scope creep in the context of data projects. I think it’s a missed opportunity, because I think knowing when to say “stop” is just as important as refining and evolving your scope.

Arguments

  • Arguments help us put tools and techniques into their proper place

When we’re working on a data problem, we build up our solutions based on our (limited) observations. Observations alone, however, are not enough—we need to connect our observations to the processes that actually shape our world.

But this sounds like statistics, I thought. We seldom, if ever, have the time and money to collect population data. Thus, we come up with estimations, and we have statistical techniques to ensure that our estimations wouldn’t go too far off from the real world. These tools help us make the case why these estimations are sound.

But even so, there is no single statistical tool that is sufficient: on top of that, we still have to make the case why they are appropriate beyond how well they fit the data. This is just one of the examples of how arguments come into play, but I think this the best example to illustrate why arguments matter in the context of data problems: they help us put tools and techniques into their proper place.

  • Arguments are not only for arguing with other people

Another thing that came to my mind when I hear the word “argument” is arguing with people and shouting our voices on top of each other, probably because the word “argument” triggers the memories of my good old debating days.

But arguments are not merely used to argue with other people. Having a good understanding of arguments can help us get across complicated ideas, make our results more coherent, present our findings in a systematic way, convince ourselves and others that our tools can do things that we expect, make sense of our solutions before releasing them into the wild… and many more.

  • Arguments are centered around claims, but we also need evidence and justification

I think giving evidence when you’re giving an argument is pretty obvious. But one important thing that I personally often miss is justification: why this evidence should compel the audience to believe our claim.

Some types of justifications:

  1. Self-evident: the claim is the evidence. For example: we claim that Tuesdays have the highest average sales in the past year, and our evidence is that we have tabulated sales for each day over the past year. No further justification is needed.
  2. A model, e.g. a regression model as the evidence. When we present a regression model as part of our solution, we implicitly have a sub-claim here: the model is accurate enough. Our justification would be something like cross-validation with accuracy measured in dollars, for example.
  3. Be aware of rebuttals to our arguments

This is probably etched to the minds of everyone who has spent their time debating—rebuttals are important so that we can make a water-tight argument. It’s really helpful to think about rebuttals in the context of data problems, though. Example: when doing a drug trial using randomized controlled trial, we might have justify that we have evidence of causal relationships. Common rebuttals would be something along the line: what if the sample size is too small? What if the randomization is improper? What about other confounding factors?

  • Patterns of reasoning provide us a bag of tricks that we can use to build our arguments

The great thing about studying arguments is that arguments have long been studied, so we can draw inspiration from those prior studies and apply them in the context of data problems. Once you know common patterns of arguments, you will: a) start to notice that these patterns come up a lot over and over again, even when working in data problems, and b) you will be able to come up with rebuttals to your arguments faster.

Of course, we cannot just take the structure of arguments from other disciplines and use them for data problems without changing anything—we need to modify them to fit our needs in terms of data problems. This is the focus of the last 1/3 of the book.

There are many, many examples of these and they are worth going through one by one. By the end of the section I was pretty awed because they definitely have come up a lot when I was working on data problems. I just didn’t know that there are names for such patterns, and knowing the common rebuttals for each pattern definitely helps me a lot in making sure that my arguments are sound.

Here is an example: the general-to-specific pattern refers to beliefs that a general pattern infers results for particular examples. In terms of data this is something like using statistical model to make inferences (building a statistical model, on the other hand, corresponds to a specific-to-general pattern). The common rebuttal would be something along “this example may not have the properties of the general pattern” or “this is probably an outlier”.

Conclusion

I think anyone who wants to be a good data practitioner beyond fitting models should definitely read this book cover-to-cover. However, I think if I had read this book a few years ago fresh out of school with no idea of how a real-world project looks like, I would have just skimmed through it without getting much out of it. I’d still encourage my old self to read it and take notes, but I’d also tell her to revisit it a second time sometime later.

I also think that this book may be too elementary for those who have had trainings and experiences in project/product management or business analysis, so if you’ve already had ample experience in those fields or topics, you may want to skip this one.

Personally, though, I resonate with this book so much and I feel like I’m reading it at the right time too.

June 13, 2020

Reading Notes