When trying to improve the speed of our code/program, it’s easy to fall into the “should I use multiprocessing/multithreading” trap, only to find that the speed-up isn’t that fast and we think our code is a lost cause. However, there are a few things that we can do before we go there:
Do you really need to run your program for the entire dataset? The simplest example: filter your rows based on some columns. Or try to see if there’s a pattern in which you can easily single out rows that are irrelevant. Patterns are cool. Example: I was trying to pair two strings by way of string matching between the two columns. This was, what, 10000 strings vs 50000 strings. I realize that (in my case) there’s very little chance that two strings with different first three letters are actually relevant. So now I only have to compare 10000 strings vs 100-ish strings. By making sure that you’re only running your program on the data that you need, you’re already doing yourself a huge favor.
Indexing. Ugh, I forget this one a lot, but this is essential. Be careful, though: indexes require additional spaces, and having too many indexes will eat up the space that you have. Choose which fields to be indexed wisely!
Know your language’s/library’s quirks. I guess each programming language/library has its own quirks, so know them well. Example: panda’s
applyis notoriously slow. Funnily enough, a simple hack with cython would yield you the 1/3 of the original speed. Another example: as discussed in my previous post, unlike other programming languages, using Python’s
threadingmodule would most likely get you nowhere thanks to the Global Interpreter Lock (GIL).
That’s all I can think of right now. More to come. :-)