Enhancing performance of computer thoroughbreds

Prof Jack Dongarra of the Innovative Computing Laboratory in the United States speaks to Karlin Lillington about the future of the supercomputing world.

Few people know the exact state of the supercomputing world as well as Prof Jack Dongarra, director of the Innovative Computing Laboratory at the University of Tennessee, and working at University College Dublin this year as a Science Foundation Ireland Walton Visitor award winner.

Along with two colleagues, he maintains a list of the top 500 high-performance systems in the world, benchmarked annually and recalculated every six months.

So he was well positioned to offer a consideration of where high-performance computing has been, and where it's going, during a lecture last week as part of the Royal Irish Academy's public lecture series.

High-performance systems, or supercomputers, are machines designed to accomplish heavyweight computing tasks at very high speeds, typically for scientific, military or industrial research and data-crunching.

These thoroughbreds of the computing world take performance as their distinguishing quality, meaning they need to be able to crunch through data at very high speed, which in computing terms means being able to perform a phenomenal number of floating point (or mathematical) operations per second (measured as 'flops').

Just how many? "Today, we have machines that can carry out trillions of operations per second," says Prof Dongarra, in his lecture entitled "Supercomputers and Clusters and Grids, Oh My!" - giving a nod toward The Wizard of Oz and our tendency to be somewhat intimidated by the supercomputing world.

With so much focus in recent years on the swift improvements in memory, speed and performance of PCs, one might think that the development drive hasshifted away from these big machines - and increasingly, not sole computers, but supercomputing systems formed by 'clusters' of machines.

But, as Prof Dongarra notes, over the past 10 years, the power and range of the systems in the Top 500 list has actually increased at a greater rate thaMoore's famous law, which predicts that computing power and capability will double every 18 months.

He offers a comparison: in 1993, the number one machine on the list of 500 could perform at the rate of 59.7 Gflops (a gigaflop is a billion flops), while the system at the 500th spot rolled in at 422 Mflops (megaflops, or a thousand flops).

Today's front runner on the list comes in at a phenomenal 35.8 Tflops (teraflops, or trillions of flops), while the last place system clocks 403 Gflops.

The front-running mammoth resides in Japan, was built by computer firm NEC, and takes up three floors of its own building (the top floor holds just the air-conditioning units to cool the system's 5,120 processors).

The computer's 'footprint' - or actual room it occupies - is equal to the area of four tennis courts.

Called the "Earth-Simulator", Prof Dongarra describes it with admirable understatement as "a very impressive computer". This "tour de force of engineering" costs $6 million (€4.8 million) a year simply to run.

A surprising shift in the computing landscape for Prof Dongarra has been the growing predominance on his list of clusters - a single supercomputing system made up of a number of clustered machines.

"I had this notion that clusters were bottom-feeders," he notes. "But now, they dominate the list." In 1999, almost no clusters appeared on the Top 500.But from that point on, their usage rises steeply when graphed, and now, 210 of the 500 list are clusters.

One reason for their popularity is that clustering together several smaller, less-expensive machines offers considerable cost savings on purchasing a full-fledged supercomputer, he says.

But a cluster, or its increasingly popular relative, a grid - a network of low-cost machines harnessed together, often across rooms or even geographies, to perform supercomputing tasks - cannot provide the same high level of performance as a supercomputer, Prof Dongarra says.

"When we start looking at how quickly we can move data in a machine, we start to see the differences," he notes.

He divides computers into those that can perform "capability" computing and cluster computing. Capability machines are extremely high-performance supercomputers used for scientific computing, which have a special purpose. Such machines need to be able to push data through with little "latency" or delay.

Clusters cannot perform at such levels because of the delay in routing information through a system built of several machines. The management tasks slow down throughput, and he cites examples of the efficiency with which the different kinds of system can perform calculations.

The NEC computer achieves 87.5 per cent efficiency, he says, whereas a huge cluster of Apple G5 computers recently assembled at Virginia Tech (which is placed number three on the list) operates at 60.9 per cent efficiency.

On the other hand, clusters are relatively cheap, and put a certain level of high-performance computing within the reach of a greater range of users, he says.

This is reflected in the fact that industry, as opposed to the scientific or military communities, is increasingly turning to high-performance systems.

He cites the example of Vodafone - the only Irish system on the list. American retailer Wal-mart also makes the list with a supercomputing system.

And a computing cluster that would certainly make the top 10, he says, doesn't appear on the list because his team can't run the benchmarking software to measure its performance. The system is that of search engine Google; the Google people said they'd love to be benchmarked but couldn't take the whole network offline to users in order to run the benchmark.

Given the speed at which supercomputing has been moving, Prof Dongarra predicts a supercomputer will hit a mind-boggling petaflop speed by 2009 - that's a thousand trillion floating point operations per second. Even a lowly laptop will likely be pushing through data at a teraflop by 2012, he says.

He expects the field to be pushed ahead by projects such as the one at the US Defense and Research Projects Agency (DARPA), which is funding a supercomputing development competition to the tune of around $200 million over several years.

Grid projects are also underway around the world, which will further push the capability of these giant clusters of machines.

Still, he believes grids will be restricted by issues of scalability, reliability and security.

The challenge for high-performance computing actually comes from an unexpected place, he says: "The real crisis with high-performance computing is the software."

Software development and coding haven't changed much, while hardware capabilities have, he says, and for supercomputers to continue to roar ahead, programmers will need to find an entirely new way to think about software design for the systems.

Note: The Top 500 list can be found at www.top500.org www.techno-culture.com weblog: http://weblog.techno-culture.com