Big Data (at least for me)

5 thoughts
last posted Feb. 5, 2016, 11:51 p.m.

1 earlier thought


The example

Currently, I'm working with some collaborators on processing published proteomics data through an analytical pipeline. The raw data consists of about 100 files, each ranging from 2-5 GB. This is (get the calculator) about 300 GB of raw data. Once it gets processed, it'll expand, and compress, and be refined into a golden nugget of about 5 GB. The problem here is that, if you are like several people I know, you don't have a huge cluster of servers at your disposal to store and process these data. Not to mention, some of the software used in this pipeline only runs on Windows, so there's that restriction.

What we've done is break this data set into 3 chunks, of approximately 100 GB each. Then, using three workstations of about the same power (quad or 8-core, 16-32 GB RAM, lots of hard disk space), we set out to process. The first application will process data for about 3 days. Then the next, about 1-2 days. Finally, a third application will be required to do some very specific data processing to get us to the final condensed result set.

3 later thoughts