As the year moved along, we kept reducing the disk space required for fingerprints and at the same time kept increasing the total number of fingerprints dispatched with each new edition.
Another multiplier to high-speed and data size was compression. Tests were made to find a compression algorithm that wouldn't need much CPU to decompress and at the same time would reduce disk space. We settled for plain zip compression since it consumed minimal CPU and resulted in a good-enough ratio of 5:1. Meaning that if something was using 5Gb before, now it was only using 1Gb of disk space.
There is an added advantage to this technique besides disk space: now we were able of reading the same data almost 5x faster than before. If before we needed to read 5Gb from the disk, now this requirement got reduced to 1Gb for accessing the same data (discounting CPU load). It then became possible to fit 1Tb of data inside a 240Gb drive, reducing by 4x the needed disk space, while increasing speed by 3x with the same data.
All this comes to the question: How big was triplecheck data last year?
These are the raw numbers:
source files: 519,276,706
artwork files: 157,988,763
other files: 326,038,826
total files: 1,003,304,295
snippet files: 149,843,377
snippet size real: 255 Gb
snippet size compressed: 48 Gb
One billion individual fingerprints for binary files were included. 500 million (50%) of these fingerprints are source code files in 54 different programming languages. Around 15% of these fingerprints are related to artwork and this means icons, png, jpg files. The other files are usually included with software projects, things like .txt documents and such.
Over the year we kept adding snippet detection capabilities to mainstream programming languages. This means the majority of C-based dialects, Java, Python and PHP. On the portable offline edition we were unable to include the full C collection, it was simply too big and there wasn't much demand from customers to have it included (with only one notable customer exception across the year). In terms of qualified individual snippets we are tracking a total of 700 million across 150 million source code files. A qualified snippet is one that contains valid enough logical instructions. We use a metric called "diversity", meaning that a snippet is only accepted when it has a given percentage of logical commands inside. For example: a long switch or IF statement without other relevant code is simply ignored because this is not typically relevant from an originality point of view.
The body of data was built from relevant source code repositories available to public and a selection of websites such as fora, mailing lists and social networks. We are being picky about which files to include on the offline edition and only accept around 300 specific types of files. The collected raw data during 2015 went above 3 trillion binary files and much effort was applied to iterate this archive within weeks instead of months to build relevant fingerprint indexes.
For 2016 the challenge continues. There is a data explosion ongoing. We notice a 200% growth between 2014 and 2015, albeit this might be caused due to our own techniques for gathering data to have improved and no longer being limited by disk space as when first started in 2014. More interesting is remembering that the NIST fingerprints index had a relevant compendium of 20 million fingerprints in 2011 and that now we need technology to handle 50x as much data.
So let's see. This year I think we'll be using the newer models with 512Gb. A big question mark is if we can somehow squeeze more performance by using the built-in GPU that you find on modern computers today. Albeit this is new territory for our context and doesn't exist certainty that moving data between disk, CPU and GPU will bring added performance or be worth the investment. The computation is already light as it is, and not particularly suited (IMHO) for GPU type of processing.
The other field to explore is image recognition. We have one of the biggest archives of miniature artwork (icons and such) that you would find applied in software. There exist cases where the same icon is saved under different formats and right now we are not detecting such cases. The second doubt is if we should pursue this kind of detection because it is a necessary thing (albeit having no doubt it is a cool thing, thought). What I'm sure is that we already doubled the archive compared to last year and that soon we'll be creating new fingerprint indexes. Again starts the optimization to keep speed acceptable. Oh well, data everywhere. :-)