Making Your First Open Source Contribution, Part 3: Navigating New Codebases

— 8 minute read

This post is part of a 3-part series titled "Making Your First Open Source Contribution":


You've got your development environment set up, and everything's working wonderfully. Sweet! Time to dive into the codebase... except...

You're lost.

This part is mostly about navigating a new codebase, especially when it's a large one, so the advice below probably applies to any large codebase & not just open source projects. When you're looking to contribute to open source projects though it might take a while to find a project that feels right for you, and as a consequence you'll be meeting new codebases more often.

Navigating new, large codebases can be especially challenging for someone who:

a) is currently in school, with no access to large codebases (this was me!)

b) mostly work in self-initiated or small-sized codebases (me, currently!)

At a glance, this post might only be relevant to contributions that involve writing code, but I've personally used these tips when contributing to the documentation too. Sometimes documentations are coupled with code that it's impossible for you to not touch the project's codebase. Besides, when writing a documentation for a piece of code, you still need to understand what the code does & sometimes how it interacts with other functions in the project.

Use it permalink

You might come across a project that you've never used before, but you want to contribute to it, & that's fine! In fact, although I use the pandas library daily, I don't use all of the functions, so I did find myself working on something that I've never used before.

My first tip is: use it. If it's a new library, go through its tutorials or "getting started" guides, & play around until you're comfortable enough with it.

If it's a new function, check out the documentation, run the examples or use it in a toy problem so you can get a better intuition on the problem you're trying to solve.

Even if it's a function that you have used before, you might need to modify parts that you're not familiar of, e.g. parameters that you've never had to use before so it's probably still useful to do the things mentioned above.

Explore the tests permalink

Sometimes it's not very clear from the documentation what a function is supposed to do. Sometimes there is no documentation at all. If that's the case, usually what I'd do next is explore the tests, especially unit tests, if there are any. Unit tests are great to learn from because they can show you how to correctly invoke a function or show you the expected behavior of a piece of code.

Tests can usually be found in their own folder, such as /tests.

Here's an example from pandas. Let's say that you want to know how to use the function rename_categories for CategoricalIndex & what should happen when you use it.

Test for the rename_categories function.
Test for the rename_categories function.

The test can give you some idea that, okay, if I have the following CategoricalIndex:

CategoricalIndex(list("aabbca"), categories=list("cab"))

And then I apply the rename_categories function:

result = ci.rename_categories(list("efg")))

I'm supposed to get back a:

CategoricalIndex(list("ffggef"), categories=list("efg"))

Find keywords in the issue & use them to find relevant parts in the codebase permalink

I usually extract important keywords in the issue, type that in the search bar of my code editor (I use VS Code) & see what other pieces of code pops up & where.

For example, I worked on an issue where I had to update the index parameter in pandas' to_parquet. The first thing I did was search to_parquet in my code editor to see where the function is:

Searching for to_parquet in the pandas codebase.
Searching for to_parquet in the pandas codebase.

There are a lot of search results including other pieces of code that are calling the function to_parquet, instead of the to_parquet function itself. For this issue, I'm not interested in these other parts of the codebase, so I had to narrow down my search.

I searched for def to_parquet() instead. In Python, the keyword def is the start of a function header, so I can be sure that I will get the locations of the to_parquet function itself. Of course, other programming languages will be different. The key here is sometimes you need to think of some tricks that can help you get better search results.

In this case, there are two  functions that I had to update.
In this case, there are two to_parquet functions that I had to update.

Search for similar issues & PRs permalink

Other people might have made PRs that solved problems that are similar to the one that you're solving right now. You can use the keywords from the issue to search for other similar issues & PRs. A few things that you can learn from reading other issues & PRs:

  • Possibly relevant code & files: if the previous steps didn't work for you, this can help. In GitHub, you can find these by checking out the "Files Changed" tab in the PR. Here is an example.
  • Pointers on what to do: although the PR that I'm looking at is not solving the same exact problem, sometimes they do give clues on what I can do to solve my problem, e.g. an existing helper function that I didn't know about that can simplify my solution.
  • Feedback from maintainers: oftentimes, maintainers request for changes before they approve your PRs. These are well recorded in the thread within the pull request, & there's always a thing or two that I can learn from them.
  • Bugs: a PR can introduce new bugs, which are often discovered after the PR is approved & merged. Learning about these bugs helps me become aware of the kinds of bugs that I may possibly introduce with my PR.

Most projects have platforms where they have discussions regarding the development of the project that are open to public, be it Slack, Gitter, mailing list, or other channels. These are usually listed either in the README or in their contributing guide. You can search for related discussions because it's possible that others have asked similar questions, but of course you can ask your own question as well... which will bring me to my next point.

Ask for help permalink

You might have done all of the above & still get stuck. That's fine! Don't be afraid to ask for pointers - you can do this by raising a question in the relevant issue or asking questions in the dev channel (see above). You might find this scary at first, but if the project you're working on has a Code of Conduct (they better do!), it can be a reminder for you that inappropriate behaviors are not tolerated.

From browsing various repositories & joining communication channels, I also learned that people do ask questions all the time & it's OK! I guess I had this assumption that everyone (but me) knows everything & this also contributed to how I initially perceived open source: intimidating & overwhelming. Seeing how people ask questions & how maintainers positively respond really helps shatter that unrealistic assumption.

Diving into new codebases is not a trivial thing, so if you feel like you're having difficulty making progress, it's totally normal. Even the most experienced programmers still need time to understand a new codebase.

Final notes permalink

One last thing I want to emphasize: you don't have to get it perfect the first time.

Your first contribution—or the ones after, really—does not have to be a pull request that provides a major feature with changes of thousands of lines of code. Your first pull request does not have to be fault-free—sometimes you mess up your git to the point that the only solution you can think of is deleting your repository & redoing your work (we all have been there, haven't we?). It's okay if you forget to write your commit message with the correct prefix per the convention. In fact, you might find that these hiccups still happen in your second, third, fourth... hundredth contribution. You'll find that it's not the end of the world. You'll learn. You'll continue contributing anyway.

The most important thing is to get started, & I hope this 3-part series helps you to do just that. :)