I've been thinking a lot about big data lately. I know the definition is somewhat fuzzy, but in my world, it's pretty straight forward: any chunk of data (be it a large scientific data set, a directory of videos or songs, or several large compressed server backups), that needs to be handled or transferred (and in doing so causes pain), is considered "big."
Currently, I'm working with some collaborators on processing published proteomics data through an analytical pipeline. The raw data consists of about 100 files, each ranging from 2-5 GB. This is (get the calculator) about 300 GB of raw data. Once it gets processed, it'll expand, and compress, and be refined into a golden nugget of about 5 GB. The problem here is that, if you are like several people I know, you don't have a huge cluster of servers at your disposal to store and process these data. Not to mention, some of the software used in this pipeline only runs on Windows, so there's that restriction.
What we've done is break this data set into 3 chunks, of approximately 100 GB each. Then, using three workstations of about the same power (quad or 8-core, 16-32 GB RAM, lots of hard disk space), we set out to process. The first application will process data for about 3 days. Then the next, about 1-2 days. Finally, a third application will be required to do some very specific data processing to get us to the final condensed result set.
That doesn't seem to bad, does it? Well, let's explore some of the pitfalls we experienced along the way.
First, the original data set was actually more than twice as large as the subset we decided to process. We took a subset because it still offered scientific intrigue, but was more manageable. These data were downloaded from the public repository onto one workstation. We actually attempted to process much of this data on the one server on which the data resided, but this failed. So then we had to transfer it to various shares and external drives, then to redistribute to the final destinations for processing.
We kicked off these analyses on day 1. Day 2, one of the servers overheated and crashed. This particular machine had an 8-core AMD FX series processor and 32 GB RAM. The problem was, the first application pushed the CPU hard, and it got hot. I actually set up a household AC unit and two fans to help cool it down and circulate the air. This was unfortunate.
While this data set might not be considered "big" to a server farm responsible for thousands of users' data, for me, this is plenty big. My definition of big tends to be more like "some chunk that is hard to manage with the tools at hand, and is difficult and slow to transfer with the bandwidth available." Let's face it -- it's not trivial to just get what you need, when you need it. I pay my ISB a fee to handle the nominal flow of data in and out of my network. If I need more, I'm stuck. Likewise, my main machine is great for my day-to-day, but suffers when it processes something of this magnitude.