After a good five years with Perl, it finally crossed the line and forced me to start moving over to Python. This is a short cautionary tale with two messages:
- No language is suitable for every job
- Languages are only as good as their libraries/modules/packages/etc
I like Perl a great deal; it’s ideal for almost any task that can’t be achieved trivially on the command line with a few pipes and some awk. Perl is fast and easy to write, and it’s perfect for text editing and tying other bits of code together. You could even use it as a flexible wrapper for bash commands and scripts, as I have mentioned before.
However, recently Perl let me down and in a big way. What I wanted to do was thousands of t-tests on thousands of arrays of numbers, reporting back some basic metrics like p-values and similar. The hitch was that the input files were enormous text files (several GB). My language options were either:
- R, with a simple and inbuilt t.test() function should be great, but the input size was a problem. Reading the whole input file into memory was a viable approach as I work on systems with a lot of memory, but it’s not particularly fast or smart to do so. I could also catch up on some R reading and find a way to do the calculations dynamically, but on the other hand…
- Perl has excellent IO capabilities and can produce results as the files are read in, making the code almost as fast as files could be read and written. How very efficient – all I needed was to find a nice t-test module
I figured I could solve the problem faster using Perl, so I went fishing in CPAN (Comprehensive Perl Archive Network), the Perl module repository to pull out a suitable module. The best match was very promising:
Written a decade ago (July 2003), surely the code was mature and this module had been used in anger again and again? The script was quickly completed and set to work, only to fail mysteriously with this this cryptic error:
Can't take log of -0.0439586 at /usr/local/share/perl/5.14.2/Statistics/Distributions.pm line 325
After checking for obvious causes like stray zero values or NAs, some debugging revealed that Perl was failing because I was attempting a t-test on a pair of arrays of length 4710. I wasn’t passing in any illegal values, it was failing purely because the input was too big. There’s no way I’m going to accept that when a desktop PC running R can perform t-tests on arrays orders of magnitude bigger.
I decided to solve the problem in a memory-hungry fashion in R for the time being and to be fair to Perl, haven’t attempted to write the same code in Python, but as Python famously has both NumPy and SciPy which are written specifically to augment the inbuilt maths functions, I’m optimistic that it won’t fail me.
Shame on you, Perl.