Obviously this is the way computing is going and it's going to bring a lot of benefits.


"The aim is to build a kind of cloud computing appliance. A rack of 64 19-inch boards would have over 16TB of memory and over a million cores."

One thing that everyone needs to remember is that not every interesting problem can be parallelised easily or at all. Climate modelling is difficult problem that, when broken into smaller and smaller pieces, requires more and more inter-process communication. That inter-process discussion is what eventually swamps the computations. It doesn't take very many processes to overwhelm the computations. It's not likely that present-day climate models could be run with tens or hundreds of thousands of processors and get significant wall-clock time increases. It would be a waste of CPU time waiting for results from other parts of the model. Now, if everyone has millions of CPUs to throw at every problem, and if all of these cores don't cost us anything extra in economic costs (buying the hardware, powering it, and cooling it) then maybe it makes sense to just use more cores. Embedding the CPUs in the memory is interesting. CPU to memory bandwidth is an important hardware performance limitation for the climate model we use at UVic. Maybe there is a significant improvement there.

I am not a computer scientist. I wonder if by starting from the assumption that there will be an embarrassingly large number of CPUs available, the code in a climate model could be rewritten in such a way as to take better advantage of being split into many, many subdomains. That would take significant work for any of the climate modelling groups working now and would require some understanding of how results of the old and new versions of the model would be compared. Add in the fact that, even with a long mean time between failures of any one CPU, say several years, when you have millions of them they are going to be failures every few minutes! What do you do with a climate model that you hope will give bit-wise identical results to a previous version, when sub-processes that the entire rest of the model may depend on, fail at a rate of one every few minutes. How can you actually get any of the integration to complete?

I think what will be required is a new kind of understanding of failure-tolerant modelling. Perhaps every part of the computation will have to be done redundantly on two or three processors. I'm imagining the entire model, running on a million processors (say) but then also running in two other such hardware devices as well. Perhaps the operating system could manage that transparently so that the user doesn't have to plan for it. When two of three instances of a particular computation agree, the operating system accepts that result and allows that process to finish, passing on the results to the other processes that need it. However, will this kind of redundant computation add yet another layer of inter-process communication that will again swamp the potential benefits of more CPUs? Or perhaps we need to give up on getting bitwise identical results from different instances of the entire integration. If we go that route we'll have to do every model integration many times over to build up a statistically significant way to give assurance that our ensemble of results reflects the true solution to the equations (and, one hopes, reality) represented by the model. We already do build up ensembles in order to try to understand how models built in different ways compare to each and reality so maybe this is the better route to go.

The future seems to hold the promise of vast networks of computation devices that can be harnessed in tandem to work on complex problems. It's going to take a lot of resources to move climate models into hardware like that, if it is possible at all.

[Edit -- It occurs to me that I didn't even mention Amdahl's Law, which has been around for a long time already and obviously needs to be considered as well. Not being a CS graduate I'm not sure if the points I've raised are really covered by this law already. I await enlightenment.


[Edit -- 2012-03-12 -- In an interview with Wired Magazine George Dyson says ... "Vacuum tubes in the early machines had an extremely high failure rate, and von Neumann and Turing both spent a lot of time thinking about how to tolerate or even take advantage of that. If you had unreliable tubes, you couldn’t be sure you had the correct answer, so you had to run a problem at least twice to make sure you got the same result. Turing and von Neumann both believed the future belonged to nondeterministic computation and statistical, probabilistic codes. We’ve seen some success recently in using that kind of computation on very hard problems like language translation and facial recognition. We are now finally getting back to what was envisioned at the beginning."

Shared publicly