Why I Use Nim instead of Python for Data Processing
Lazy programmers often prefer to substitute computing effort for programming effort. I am just such a programmer. For my research, I often need to design and run algorithms over large datasets ranging into the scale of terabytes. As a fellow at the NIH, I have access to Biowulf, a 100,000+ processor cluster, so it's usually not worth spending a ton of time optimizing single-threaded performance for a single experiment when I can just perform a big MapReduce.
Despite having access to these resources, I increasingly use the Nim programming language for data processing tasks. Nim is under-appreciated in computational science but it is a very capable Python replacement for non-numerical data processing. At a high level, Nim is as easy to write as Python and as fast as C. It's part of a newer generation of languages that have the performance of compiled languages with the expressiveness of dynamic languages. If you know Python, you're 90% of the way to knowing Nim and reaping all of the benefits.
Consider the following simple problem: you have a bunch of DNA sequences (strings which are comprised of four letters, A, T, G, and C), stored in a text file with
> delimited comment lines (i.e. FASTA format). A standard task is computing the percentage of Gs and Cs in the sequence, a metric known as the GC content. In Python, a simple implementation would look like this:
gc = 0 total = 0 for line in open("orthocoronavirinae.fasta"): if line == '>': # ignore comment lines continue for letter in line.rstrip(): if letter == 'C' or letter == 'G': gc += 1 total += 1 print(gc / total)
This code takes 23.43 seconds to run on a 150 MB coronavirus genome dataset using my laptop. As an alternative, I can just use Nim and get a huge speed up essentially for free. Indeed, the Nim implementation is more-or-less identical:
var gc = 0 var total = 0 for line in lines("orthocoronavirinae.fasta"): if line == '>': # ignore comment lines continue for letter in line: if letter == 'C' or letter == 'G': gc += 1 total += 1 echo(gc / total)
These changes, though small in terms of code, are quite large in terms of performance:
|Program||Time||Relative to Nim|
|Python 3.9||23.43 s||30.6x|
|PyPy 7.3||2.54 s||3.3x|
|Nim 1.4 (with
For complete fairness in the comparisons, the Nim compilation process took an additional 702 ms, meaning that it is still faster to do a complete optimized compile and run cycle than even using PyPy. Nim makes this easy since there is an
r command that automatically runs your program after it compiles. Even though Nim is a compiled language, the compilation process is fast enough that it can be used in place of an interpreted one.
Though nearly the same, there are a few differences between the Nim code and the Python code:
- Variables are declared using
var. Nim allows run-time constants to be defined using
letand checked at compile time for any possibility of mutation. It also supports compile-time constants using
linesfunction is used rather than
opento iterate over the contents of the file line-by-line. A nice feature of the
linesfunction is that it automatically strips newline characters such as
CRLFso we no longer need to do
- Nim uses
echois that the Python 2.x era
echo "Hello world!").
To illustrate just how useful Nim is, consider this real-life example I encountered in my research. I had a lot of palindromic artifacts within my data, which was messing up my hunt for new viroids. A hallmark of these artifacts was the presence of long subsequences and their reverse complements (the subsequence backward and with each base replaced with the corresponding pair). To gauge the scope of the problem, I needed to go through each sequence in my dataset, see if it was a likely artifact, and remove it. Let's first look at the standard Python way to do it:
import sys from Bio import SeqIO from Bio.Seq import Seq # iterator for subsequences of length k def kmers(seq, k): for i in range(len(seq) - k + 1): yield seq[i:i+k] for record in SeqIO.parse(sys.argv, "fasta"): unique_kmers = set() palindrome = False for kmer in kmers(record.seq, 25): if kmer.reverse_complement() in unique_kmers: palindrome = True break unique_kmers.add(kmer) if not palindrome: print(record.format("fasta"))
And in Nim:
import os import sets import bioseq # my library, has k-mer iterator and FASTA parsing for record in readFasta[Dna](paramStr(1)): var uniqueKmers = initHashSet[Dna]() var palindrome = false for kmer in kmers(record, 25): if kmer.reverseComplement() in uniqueKmers: palindrome = true break uniqueKmers.incl(kmer) if not palindrome: echo(record.asFasta)
Again, these programs are essentially identical. The salient differences here are:
- Nim uses
- The standard casing style is
snake_case. Nim treats identifiers as equal if they are the same after removing capitalization (except for the first letter) and underscore, which means that you can use whichever style you want.
Despite being equally easy to write, the Nim code is 20x faster. Assuming no significant algorithmic differences, it will be impossible for pure Python to beat pure Nim at raw performance since, under the hood, the Nim compilation process generates a C file which is in turn compiled by the C compiler of your choice. Because Nim generates standard C output, it is compatible with Python: you can call Python from Nim and call Nim from Python.
While Nim makes it easy for experienced Python users to translate their knowledge, where it begins to shine as its own language (rather than a faster Python) is when you write more idiomatic code. I find that Nim programs are usually shorter and easier to write than their Python counterparts as a result of Nim's other great language features, but these features are beyond the scope of a single article. I have used Nim regularly for the past year as a daily driver and continue to be impressed by its performance, simplicity, and elegance. Give Nim a try next time you want to process data: your CPU will thank you.
Benjamin D. Lee is an NIH-OxCam scholar pursuing his doctorate at Oxford University. His research is focused on the computational identification and analysis of novel viruses and virus-like agents to better understand their evolution and origin. Read more →