23 Years Of Supercomputer Innovation 1993-2016
Picture 1
There are hundreds of supercomputers in the world that help to drive
scientific discoveries, but only one of these systems can be the fastest
computer in the world. Competition between nations working to build the
best supercomputer has driven technological development and lead to the
creation of ever faster computer hardware.
In this article, we look at the fastest supercomputers in the world over
time.
Update, 62916, 1225pm PT Added information on Shenwei Sunwei Taihulight,
which is now the undisputed fastest supercomputer in the world.
June 1993: CM-5/1024
Picture 2
The TOP500 ranking of supercomputers was first published in June, 1993. At
that time, the most powerful computer in the world was a CM-5 located in
the University of California's Los Alamos National Laboratory, managed by
the US Department of energy, and manufactured by Thinking Machine.
The CM-5 / 1024 was composed of 1024 SuperSPARC processors operating at 32
MHz. The theoretical computational power of this system was 131 GFlops,
but reached less than half that (59.7 GFlops) under the LINPACK benchmark
used to determine the TOP500 rankings. The CM5 also served another purpose
in 1993, when it was chosen by Steven Spielberg's production team to
"embody" the brain of the control room in the film Jurassic Park (the five
black towers with red lights).
June 1994: XP/S 140
Picture 3
In June 1994, the CM-5 was dethroned by the Intel XP/S 140 Paragon.
This supercomputer, purchased by Sandia National Laboratories in New
Mexico, employed 3,680 Intel i860 XP processors, one of the few chips
which implemented a RISC instruction set manufactured by Intel. The i860
was innovative for its time, incorporating a 32-bit arithmetic unit and a
64-bit floating point unit (FPU). Each processor had access to 32 32-bit
registers, which could also be used as 16 64-bit registers or 128 8-bit
registers. The set of instructions executable by the FPU also included
SIMD type instructions that laid the foundation for the future MMX
instruction set used in Intel's Pentium line of products.
Each i860 XP processor, designed to run at 40 - 50 MHz, delivered a gross
0.05 GFlops of computational power. The theoretical power of the XP/S 140
was 184 GFlops, and in practice it reached 143.4 GFlops in Linpack.
November 1994: Japan Takes The Win
Picture 4
In November 1994, Japan replaced the US on top of the TOP500 with the
Numerical Wind Tunnel, a supercomputer manufactured by Fujitsu for the
National Aerospace Laboratory of Japan.
This machine marks a change in tactics from the world's previously most
powerful supercomputers, in that it draws its power from only 140 vector,
not scalar processors. These processors were composed of 121 individual
cores, arranged in a matrix of 11 x 11, and each chip had a dedicated
function. Each processor also contained four independent pipelines, and
was capable of handling two Multiply-Add instructions per cycle. A
"processor" by itself consumed 3000 W, and required water cooling.
Running at 105 MHz, these processors were particularly well suited to
simulate the flow of fluid. Each CPU delivered a theoretical 1.7 GFlops of
computational power. This added up to over 238 GFlops of theoretical
processing power, making the Numerical Wind Tunnel the first computer to
break the 200 GFlops bar, although its performance in Linpack was slightly
lower (124 GFlops, then 170 GFlops, and finally 192 GFlops).
June 1996: Hitachi Beats Fujitsu
Picture 5
The following year, Japan increased its standing in the TOP500 by
introducing the SR2201 / 1024. This supercomputer was built by Hitachi for
the University of Tokyo. This new system surpassed Fujitsu's Numerical
Wind Tunnel computer, giving Japan the top two spots of the TOP500, and
dropping the U.S. into third.
Unlike the Numerical Wind Tunnel, this system reverted to the use of
scalar processors, and utilized the HARP-1E CPU based on the PA-RISC 1.1
architecture. The SR2201 / 1024 contained a total of 1024 of these CPUs
clocked at 150 MHz, each theoretically capable of 300 MFlops of
computational power, giving the SR2201 / 1024 an accumulated theoretical
computational work force of 300 GFlops. The HARP-1E also introduced a
mechanism called Pseudo Vector Processing to preload data directly into
CPU registers without going through the cache. Thanks to this feature,
among other things, the performance of the SR2201 / 1024 was exceptional
for its time period. Under Linpack GFlops the SR2201 / 1024 reached 232.4
GFlops, 72% of its theoretical power.
June 1997: The Teraflop Threshold Is Vanquished
Picture 6
To take back technological leadership from Japan, the United States
launched the Accelerated Strategic Computing Initiative (or ASCI) in 1992.
The first successful project of this program was the development of the
ASCI Red, a supercomputer built by Intel for the Sandia Lab, the same
facility which owned the Intel XP/S 140. The ASCI Red impressed people
around the world, as it was the first computer in history to cross the
teraflop barrier.
With its 7,264 Pentium Pro processors operating at 200 MHz, it possessed a
theoretical 1.453 TFlops of computational power and generated 1.068 TFlops
under Linpack. The ASCI Red was one of the first supercomputers to use
mass production components, and with its modular and scalable
architecture, the ASCI Red to stayed listed in the TOP500 for 8 years.
June 1998: ASCI Red 1.1
Picture 7
In June 1998, ASCI Red was expanded to incorporate an additional 1888
Pentium Pro processors. Although it took the lead on the TOP500 in 1997,
at that time it was only 75 percent complete. Now finished, with 9152
Pentium Pro CPUs clocked at 200 MHz, the system was theoretically capable
of 1830 GFlops, and managed to reach 1338 GFlops under Linpack.
June 1999: ASCI Red 2.0
Picture 8
In 1999, Intel updated the ASCI Red by replacing the older Pentium Pro
processors with Pentium II OverDrive CPUs, which used the Socket 8
interface. In addition to a refined architecture and a higher clock speed
- 200 MHz on the Pentium Pro vs 333 MHz on the Pentium II Overdrive, Intel
took this opportunity to also increase the number of CPUs from 9152 to
9472. These improvements multiplied the theoretical computational power of
the ASCI Red by a factor of 1.7, pushing it past 3.1 TFlops, but in
practice, the system was only able to achieve 58 percent of its
theoretical performance, topping out at 2.121 TFlops.
June 2000: ASCI Red 2.1
Picture 9
After its rise to the top, the ASCI Red would stay on top of the TOP500
for another three years. Eventually the system would see another increase
in CPU core count, climbing to a total of 9,632 processors. Its
theoretical performance would top out at 3.207 TFlops, and under the
Linpack test it would ultimately achieve 2,379 TFlops of computational
power. In its final configuration, the ASCI Red occupied an area of 230
square meters and consumed 850 kilowatts of power, not including the
energy required for cooling. The ASCI Red would remain in operation and on
the TOP500 as one of the world's fastest super computers until it was
retired in 2005, then decommissioned in 2006.
June 2001: ASCI White
Picture 10
Eventually the ASCI Red was dethroned by a supercomputer specifically
designed to replace it; the ASCI White. This new supercomputer was
installed in the heart of Lawrence Livermore National Laboratory. At half
strength, the system became operational in November 2000, and was
completed in June 2001.
Unlike the ASCI Red which was built by Intel, the ASCI White was IBM's
chance to shine. ASCI White derived its power from 8192 IBM Power3
processors clocked at 375 MHz. ASCI White represents a new trend among
supercomputers, adopting a cluster. A cluster architecture is a collection
of individual nodes connected together to work as a single system. Today,
clustering is used by 85 percent of the supercomputers listed on the
TOP500.
ASCI White actually includes 512 RS/6000 SP servers, each containing 16
CPUs. Each CPU was capable of 1.5 GFlops of processing power, which made
ASCI White theoretically capable of reaching 12.3 TFlops. It's real-world
performance was considerably lower, only reaching 7.2 TFlops under Linpack
(7.3 TFlops from 2003).
ASCI White required 3,000 kW of power to operate, with an additional 3,000
kW consumed by the cooling system.
June 2002: Earth Simulator
Picture 11
In June 2002, the TOP 500 was shook up by the Earth Simulator. Built at
the Earth Simulator Center in Yokohama, the Earth Simulator ran circles
around ASCI Red and ASCI White. The system managed to achieve 87.5 percent
of its theoretical performance, landing at 35.86 TFlops in Linpack;
roughly five times more than ASCI White was capable of. The Earth
Simulator, dedicated to climate simulations, was constructed from
specially designed NEC superscalar processors; each containing a 4-way
super-scalar unit and a vector unit. The system components were clocked
either at 500 MHz and 1 GHz. Each CPU was capable of 8 GFlops of
theoretical processing power and consumed 140 W. The Earth Simulator was
organized into 640 nodes with 8 processors each, with each node consuming
10 kilowatts of power.
June 2003: ASCI Q And Alpha EV6
Picture 12
Earth Simulator was so far ahead that it continued to lead the Top500
until June 2004. Meanwhile, competitors continued to fight for the number
two spot in the list. In June 2003, the number two spot belonged to the
ASCI Q. This system was built by HP at the Los Alamos National Laboratory.
Plans for the ASCI Q originally included three segments, each containing
1024 HP AlphaServer SC45 servers. The TOP500, however, only shows the
machine with 2 segments. Each server contains two Alpha 21264 processor
clocked at 1.25 GHz. The total theoretical capacity of the system was 20.5
TFlops, which resulted in 13.9 TFlops in Linpack.
The Intruder: System X, AKA Big Mac
Picture 13
During the summer of 2003, Virginia Tech University decided to build a
"low-price" supercomputer from public machines. System X (or Big Mac as it
was called) was comprised of 1100 Apple PowerMac G5 systems, each equipped
with two PowerPC 970 CPUs clocked at 2.3 GHz, working as a single system.
The construction of the Big Mac, took only three months and cost 5.2
million dollars. Significantly cheaper than the 400 million dollar Earth
Simulator. In November 2003, Big Mac was ranked as the third fastest
supercomputer on the TOP500, with 10.3 TFlops of processing power
demonstrated on Linpack. The Big Mac was updated in 2004 by replacing its
PowerMac with Xserve, which Boosted its processing power to 12.25 TFlops.
November 2004: Blue Gene/L
Picture 14
In September 2004, the Earth Simulator was finally defeated by IBM's
BlueGene/L. It reached 36 TFlops while still under construction. When it
was completed in November 2004, it amounted to 70.7 TFlops of processing
power, twice that of the Earth Simulator. In June 2005, the BlueGene/L was
extended and reached an exceptional 136.8 TFlops in Linpack, almost four
times more than the Earth Simulator. The BlueGene/L was then the first
supercomputer to pass the 100 TFlops bar.
To achieve this record, IBM employed 65,536 PowerPC 440 processors clocked
at 700 MHz. The processors used were not considered to be relatively
powerful, but they were compact and consume relatively little power, which
allowed IBM to install two of them together a on small card (above) and
plugged into a motherboard inside of a rack. The BlueGene/L was shown to
have excellent performance: reaching 75 percent of its theoretical power
under Linpack.
June 2006: BlueGene/L 2.0
Picture 15
In late 2005, the Blue Gene/L at Lawrence Livermore National Laboratory
doubled the number of processors to 131,072. As a result the BlueGene/L
2.0 easily held the number one spot in the TOP500. Under Linpack the
BlueGene/L recorded 280.6 TFlops of performance under Linpack. Thanks to
IBM's use of small energy efficient chips, this configuration of the
BlueGene/L consumed only 1.2 MW of power.
At the time, the BlueGene/L was the only supercomputer to exceed 100
TFlops, with the runner up on the TOP500 peaked at 91.3 TFlops. Note that
also in June 2006, the Tera 10 French supercomputer ranked 6th with 42.9
TFlops in Linpack.
June 2007: Jaguar
Picture 16
The Blue Gene/L stayed on top as the fastest supercomputer for another two
years. Although no other system could match its performance, other
supercomputers did edge closer and managed to pass the 100 TFlops mark. In
June 2007, both the Jaguar (No. 2) and the Re Storm (No. 3) surpassed the
100 TFlops mark. The Jaguar, which had been constantly updated since 2005,
comprised of Cray XT3 and XT4 servers. It marks the entry of AMD into the
big league, as these systems used Opteron dual-core 2.6 GHz processors. In
total, at the time Jaguar contained 23,016 cores and reached 101.7 TFlops
in Linpack.
June 2008: Roadrunner
Picture 17
Beep! Beep! In June 2008 IBM succeeds the BlueGene/L with the IBM
Roadrunner. The supercomputer does its nickname justice, as it was the
first supercomputer in history to exceed the petaflop threshold. It was
also a technological breakthrough, as the first hybrid supercomputer,
taking advantage of two significantly different processor architectures.
The Roadrunner contained a total of 122,400 cores split between IBM and
AMD processors. The 6,562 AMD64 Opteron dual-core processors operated at
1.8 GHz and were capable of handling traditional x86 software. Each
Opteron core was paired with one PowerXCell 8i 3200 core clocked at 3.2
GHz, which is comprised of 1 PPE and 8 SPE. These IBM processors are
related to the ones used inside of the Xbox 360 and Playstation 3. In this
configuration the PowerXCell 8i processors were used as coprocessors,
which the Opteron cores could leverage for additional processing power.
The cumulative theoretical power of the Roadrunner was 1.38 PFlops. Its
performance under Linpack reached 1.03 PFlops, placing it on top of the
TOP500.
One of the advantages of the hybrid architecture was greater energy
efficiency. Roadrunner consumed only 2.35 MW of power, and thus was
capable of 437 MFLOPS / W. The system weighed 227 tons and occupied an
area of 483 m2 in the Los Alamos laboratory.
June 2009: Roadrunner
Picture 18
Just like ASCI Red and BlueGene/L before it, the Roadrunner retained
leadership of the TOP500 for several months, and experienced updates to
push its computational power higher. In November 2008, the total number of
computing cores increased to 129,600, and performance under Linpack jumped
to 1.1 PFlops.
This slight increase was just enough for Roadrunner to remain as the
fastest super computer in the world. The runner up, the updated Jaguar
using Cray XT5 servers in place of the older XT3 and XT4 systems, had
achieved 1.059 PFlops under Linpack. Jaguar and Roadrunner were the only
two supercomputers with computational power in excess of one petaflop.
June 2010: Jaguar 3.0
Picture 19
In November 2009, Jaguar finally managed to dislodge the Roadrunner and
become the fastest supercomputer in the world. It was composed of two
"partitions" of Cray servers. The old section comprised of 7,832 Cray XT4
servers, each containing a quad-core Opteron 1354 Budapest processor
clocked at 2.1 GHz. The new section was made up of 18,868 Cray XT5 servers
each containing two hex-core Opteron 2435 Istanbul processors clocked at
2.6 GHz.
Its theoretical power was estimated at 2.33 PFlops, and resulted in 1.76
PFlops under Linpack. Unlike the Roadrunner, Jaguar was not particularly
energy efficient, and it consumed about 7 MW of power (253 MFlops / W).
2010: China Enters The Race With GPU Power
Picture 20
In 2010, China entered the race with two supercomputers competing to be
the fastest in the world. In June 2010, the Nebulae had the highest
theoretical power out of the TOP500 super computers, estimated at 2.98
PFlops, but its real world performance under Linpack remained below that
of the Jaguar. Then, in November 2010, the Tianhe-1A displaced both the
Jaguar and Nebulae, taking the lead in both theoretical power and Linpack
performance.
This system was theoretically capable of 4.7 PFlops, but only reached 2.57
PFlops under Linpack.
Both the Tianhe-1A and Nebulae draw much of their processing power from
the use of GPUs for general purpose processing. Similar to Roadrunner,
these systems are considered to be hybrid supercomputers, as they combine
x86 Intel Xeon X5600 processors (X5650 in Nebulae, X5670 in Tianhe-1A)
with NVIDIA Tesla GPUs (C2050 for Nebulae, M2050 for Tianhe-1A). This
gained wide spread recognition of GPGPU.
As a result of this hybrid configuration, these Chinese supercomputers
displayed excellent efficiency. The Tianhe-1A consumed only 4 MW, and thus
achieved 640 MFlops of performance per watt.
June 2011: K Computer
Picture 21
In June 2011, Japan took over the performance crown with the Fujitsu K
Computer installed in the Riken Advanced Institute of Computational
Sciences.
The Fujitsu K Computer is one of the few machines to demonstrated real
world performance relatively close to its theoretical power. The system
was comprised of 68,544 SPARC64 VIIIfx octa-core processors, adding up to
a total of 548,352 cores. Unlike the Tianhe-1A, it does not rely on GPUs
for GPGPU. It was capable of 8.16 PFlops of computational power.
Although the K Supercomputer was considerably faster than the Tianhe-1A,
it also consumed significantly more power, 9899 kW compared to Tianhe-1A's
4,000 kW. The efficiency was therefore notably worse than the Tianhe-1A,
and the problem did not improve when Fujitsu added additional cores that
propelled K Computer to 705,024 cores with a power consumption over 12,650
kW.
June 2011 marked another significant event in the TOP500, as for the first
time, the top ten supercomputers in the world possessed computational
power in excess of one petaflop.
June 2012: Sequoia BlueGene/Q
Picture 22
In June 2012, the Sequoia BlueGene/Q became the first supercomputer to
surpass 1.5 million cores. Despite the fact that it has more than twice
the number of cores of the K Computer, it consumed almost half the power
(7890 kW).
The system was comprised of 16-core PowerPC processors clocked at 1.6 GHz,
and was the first supercomputer to exceed 20 PFlops of theoretical
computational power. In practice, the system achieved 16 PFlops. The
machine was installed in a national laboratory belonging to the US
Department of Energy. It is also important because it marks the return of
the United States to the top of the TOP500 list.
November 2012: Cray XK7 (Titan)
Picture 23
In November 2012, IBM was again beaten by the Cray XK7-based Titan. This
system contained almost 300,000 Opteron 6274 processors and more than
260,000 K20x NVIDIA GPUs. This system marked the first time AMD would be
used in the world's fastest supercomputer since the Jaguar 3.0
supercomputer that dominated the list in June 2010.
Its theoretical processing power did not surpass the BlueGene/Q, but its
practical performance was rated at 17.6 Pflops, edging out the BlueGene/Q.
It consumed roughly 8209 kW of power. It was installed in the US
Department of Energy's Oak Ridge National Laboratory.
Another significant event in the top 10 super computers in November 2012
was the entry of the Xeon Phi.
June 2013: Tianhe-2 (MilkyWay-2)
Picture 24
In June 2013, China took back the lead with a supercomputer that broke
several records. The Tianhe-2 exceeded 50 PFlops of theoretical
computational power (54.9 PFlops). It also exceeded 33 Pflops of real
world performance under Linpack, nearly double what the second place Cray
XK7 was capable of.
To achieve this performance, Tianhe-2 uses approximately 3.12 million
cores, breaking the record for the most CPU cores in a supercomputer. It
also proved to be the most power hungry super computer, consuming in
excess of 17,000 kw (17,808 kw).
Tianhe-2 is installed in the National University of Defense Technology.
The system was a surprise to everyone when it began operations two years
early. Each node in the Tianhe-2 is comprised of two 12-core Xeon E5-2692
processors clocked at 2.2 GHz, and three Xeon Phi 31S1P compute cards
which deliver the majority of the performance. It persists as the world's
fastest supercomputer today.
June 2016: Sunway TaihuLight
Picture 25
In June 2016, the Tianhe-2 was overtaken by China’s new Sunway TaihuLight
as the world’s fastest supercomputer. Following the Tianhe-2’s
development, the U.S. government restricted the sale of server-grade Intel
processors in China in an attempt to give the U.S. time to build a new
supercomputer capable of surpassing the Tianhe-2. As a result, China was
unable to obtain significant numbers of Intel processors to upgrade the
Tianhe-2 or build a successor, so instead the TaihuLight uses ShenWei RISC
CPUs developed by the National Research Center of Parallel Computer
Engineering and Technology (NRCPC) organization in China.
The TaihuLight contains 40,960 ShenWei SW26010 processors, one inside of
each supercomputer node. Each SW26010 contains 260 cores, which results in
a total of 10,649,600 cores in the TaihuLight. The supercomputer has a
peak theoretical processing power of approximately 125 petaflops, and it
scores 93 petaflops under Linpack, making it roughly three times faster
than the Tianhe-2. It is also incredibly efficient compared to the
Tianhe-2, as it consumes just 15.3 megawatts of power, a full 2.5
megawatts less than the Tianhe-2 while performing three times the work.
June 2018
At 200 petaflops, the US once again owns the world's fastest
supercomputer
Summit is powered by more than 27,000 Nvidia GPUs
By Shawn Knight on June 8, 2018, 4:00 PM 18 comments
Fun fact: At 200 petaflops, if every person on Earth completed one
calculation
per second, it would take one year to do what Summit can do in one second.
The US will soon lay claim to ownership of the world’s fastest
supercomputer,
a title it was stripped of by China in 2013.
On Friday, the world’s fastest supercomputer – Summit – made its debut at
the
Oak Ridge National Laboratory in Oak Ridge, Tennessee. The monster machine
packs a staggering 27,648 Volta Tensor Core GPUs and 9,216 CPUs into 5,600
square feet of cabinet space that’s similar in size to two tennis courts.
It weighs nearly as much as a commercial jetliner and is connected by 185
miles of fiber optic cables.
Oh, and it’s fast. Very fast.
According to Nvidia, the machine can perform 200 quadrillion
floating-point
operations per second (FLOPS). By comparison, China’s Sunway TaihuLight –
officially the world’s fastest supercomputer – has a benchmark rating of
93 petaflops.
The machine, built for the US Department of Energy, will assist scientists
with research in the fields of materials discovery, high-energy physics,
healthcare and more.
Nvidia CEO Jensen Huang described Summit as the world’s largest AI
supercomputer,
a machine that learns. “Its software will write software, amazing software
that
no human can write.”