How fast are modern languages when it comes to parsing files?
The bread-and-butter of most bioinformatics-programmers (at least the ones I know) is writing parsers for different kinds of output. Packages like BioPython/BioPerl/Bio* provide parsers for common output, like for example BLAST provides. But we still have to write parsers for tab-delimited output, and I recently had a short discussion which led me thinking - which language is actually the fastest for that?
My candidates were: C++ (gcc 4.6.3, with Boost-library), Perl 5.14.2, Python 2.7 and 3.2, D, Ruby 1.9.2, Rubinius 2.0.0dev, JRuby 1.6.5 and because it’s new and I wanted to have a look: Julia, 0.0.0+1333823704.r2d0b/2d0bb43e7e.
Pseudo-codes for all languages except D goes roughly like this:
for each line in file_handle:
split the line by tab
You’ll find the code for each language at the end of this post, feel free to criticize! Especially my D-implementation is weird as it apparently doesn’t support iterating over the lines directly.
Now to have a look at how fast each language is! I used a 90mb tab-delimited file with 8 columns for this, I played around with bigger files but the time needed in each language just grew linear anyway.
Here’s how fast each language was: I ran each implementation on the same file on my machine 5 times and took the fastest time because I’m nice. I measured the time using the language’s own time-library or, if not available, Bash’s
Here’s the list ordered by time, all in seconds:
- C++: 0.49
- D: 0.76
- Python2.7: 0.82
- Python3.2: 1.30
- Perl: 1.31
- jRuby: 1.32
- Ruby: 2.19
- Julia: 4.34
- Rubinius: 8.15
I feel it’s unfair to include Julia as it’s a very new language, but I just had to check out these claims of “being very fast”. Since I started comparing the languages I have been in contact with the creators and speed has already greatly increased to what it’s now.
Also surprising to me is the speed-difference Python/Perl: A lot of people told me I should suffer more Perl because it’s so much faster than Python, which it (at least in this case) isn’t! Another surprise is jRuby, being up there with Python3.2. I would take the listing of D with a grain of salt as the code I’ve written for that differs quite a lot from the other implementations, anyone got a better one?
tl;dr: Python is faster than Perl and what happened in Python3.2? Also, Ruby is becoming quite the fast language thanks to jRuby.
tl;ak (too long; already knew): C++ is fast.