The scientific journal Nature recently had an issue devoted to “big data” in which it discussed the impact that Google-like thinking has had on science. Our ability to measure things and record it digitally has exploded in the last decade, and the idea of collecting lots of data systematically is a very powerful one. It has all kinds of cool and interesting implications.
First, and most importantly, I think that people are now seriously thinking about fully instrumenting natural and artificial phenomena of interest. Let’s collect everything we can, put it in a database, and then form hypotheses and test them later. This is a good idea–IF the data can be collected cheaply enough and with sufficient precision to allow some science to happen. The worst case scenario is probably that the data allows us to form hypotheses, but is insufficient to provide proof for these hypotheses. In that case, additional precise experiments may be required. The best case scenario, however, is that the data is sufficient to extract knowledge and “laws” the way Tyco Brahe collected data about planetary movements that allowed the laws of planetary motion to be developed.
Second, as we collect “big data” there will form a market for powerful and general informatics methods for analyzing this data. Clustering and classification algorithms are obvious first ways to partition data. Pattern recognition methods are also likely to be useful. The scatter/gather class of algorithms that allow parallel analysis of large data sets also emerge. Of course, there is always a place for domain-specific methods that take advantage of the known structure of the data, but we will conduct initial explorations almost always with general purpose algorithms.
Third, data-intensive computing may drive the computing industry to provide new solutions. The computers we have now are not always optimized for disk-based analysis of data. They do great when everything is in RAM, but virtual memory solutions often break down in data-intensive applications. The rise of “big data” has made cluster computing attractive. It also may drive a new generation of hardware. I am not aware of any GPU (graphical processing units) implementations of data-intensive codes, but I am very interested in trying to catalyze these. [GPUs have outpaced CPUs in computation speed and are now gaining attention in the scientific computing community. It turns out that the video game market created such great pressure on performance that the GPUs evolved to be very powerful very quickly. Now, scientists, such as my colleague Vijay Pande, are showing that these can be used for serious computation]. Most GPU implementations have taken advantage of the GPU speed (floating point operations per second, FLOPS), but relatively low data capacity. Maybe someone can be clever and enable GPU-based computing
Finally, “big data” should provide us enough information to seriously consider building full computational models of many natural phenomena. In biology, for example, we have never been able to contemplate a full model of a cell or an organ because these models simply had more parameters than we could reasonably estimate. The availability of high-throughput data collection and storage technologies may lead to data stores sufficient to estimate these parameters, and create computational simulations of nature (or parts of nature) that are robust and can not only interpolate but also extrapolate.
So, lots of excitement. Also some pain–the next generation DNA sequencing machines are producing so much data that many biological labs that thought their informatics algorithms and hardware/software platforms were “just fine” have discovered that they are woefully inadequate. The biologists who use these machines are not interested in innovating informatics methods (and certainly not innovating hardware solutions)–they just want to do biology! However, some are being forced to pause and rethink their computational infrastructure.
As an informatician, I love big data. I have always told my students in bioinformatics that their best friend is a biologist with more than 7 data points. 7 petabytes is even better.