In the first version I was going through a bigzip file containing some 30 million Java source code files to convert the code into tokens.
However, the conversion tool had no support to stop/resume the processing. This meant that it would require running the same process 24/7 without any interruptions to complete the task. At the time I was working under a tight schedule and used only a sub-set of the data with some 4 million source code files.
On the first version I'd basically read the contents of a source code file, convert the content into a set of tokens and then write them back on a destination file.
To add a stop/resume feature, one needs to "persist" the memory on disk of the last data which was recorded, in order to know from where to begin on the next re-start. Writing a file to disk doesn't scale under this stress. After some hundred thousands writes it will output "JVM error". This also wears down the HDD.
Another option I tried was MapDB, which is great for persistency with just a few lines of code. However, when calling MapDB to create a checkpoint on file-by-file basis this ended up reducing the performance from 4.000 files/minute to 300 files/minute. Clearly, this would make it difficult to index the whole millions of source code files before 2015.
Writing a single line of text on a file is a bottleneck, even if the line is relatively small.
A very simple solution to this problem was adding a memory cache.
Basically, set a normal string to hold the contents of 1000 token files to be written. Once that threshold is reached, the whole text is written back into the destination file just a single time, instead of 1000.
This simple measure has not only permitted to introduce the checkpoints with MapDB to record the current processing status, as it ended up speeding the process rate to ~16.000 files/minute. Really fast, just wow. I'm still mesmerized by the output speed, some 3 millions already reached in a few hours. The whole set should be converted by tomorrow night.
The tradeoff is memory usage. Some barriers were already in place to avoid processing large sized source code files but let's see how this cache holds on.
You find the relevant code snippet here.