





## Aim High To Petascale... and Beyond!

Stephen Pawlowski Intel Senior Fellow CTO, GM Architecture & Planning Digital Enterprise Group

> Intel Developer FORUM

#### **Risk Factors**

Today's presentations contain forward-looking statements. All statements made that are not historical facts are subject to a number of risks and uncertainties, and actual results may differ materially. Please refer to our most recent Earnings Release and our most recent Form 10-Q or 10-K filing available on our website for more information on the risk factors that could cause actual results to differ.





#### **Definitions**

High Performance Computing (HPC) - A collection of hardware systems, software tools, languages and generic programming approaches which make previously unfeasible applications possible and which is available at an appropriate price.

**Peta Scale Computing –** "(the) wide spread use of systems that deliver sustained applications performance a level above a PFlop/s." - Horst D Simon, LBNL 8/24/2006





### Supercomputing Datacenters Today

Supercomputers Solve Complex Problems

- 60% of US profits are in financial services
- As many as ten top tier firms with IT Budgets > \$2B/year
- Performance Needs:
  - 60,000 Transactions Per Second
  - 30ms average latency
  - 32GB real time database
- Good Scale out Application
  - Application is amenable to parallelization both data & thread
  - Requires Double Precision FP, high memory BW, ...



Sample Application: Collateralized Debt Obligation (CDO) Monte Carlo Simulation

400,000 underlying assets (400 branches, ~1,000

Monte Carlo Simulation considers external factors loans each) and their interrelationships

400B scenarios (simplified).

- For each asset there are 1M paths. 1,000 prepay \* 1,000 loss
- Future goal to compare CDO to another CDO deal (CDO<sup>2</sup>) 160T scenarios (400B<sup>2</sup>). Considering CDO^3

Wall Street: Building 2-4 Data Centers Per Company by 2010

Intel Developer FORUM



#### Yesterday, Today and Tomorrow in HPC

ENIAC 20 Numbers in Main Memory





ASCI Red
(word fastest on top500 till 2000)
First Teraflop Computer,
9298 Intel Pentium® II Xeon Processors

CDC 6600 - First successful Supercomputer 9MFlops





Intel ENDÉAVOR

464 Intel® Xeon® Processors 5100 series, 6.85

Teraflop MP Linpack, #68 on top500

~2008 Beyond







PetaScale Platforms



Yesterday's Supercomputing is Today's Personal Computing













### ... together provide a powerful environment to rapidly develop solutions for computationally challenging problems

- Tyan\* Personal Supercomputer
- Intel Quad-Core Xeon Processor
- Mellanox\* High Performance InfiniBand Interconnect
- Microsoft\* Windows Compute Cluster Server
- Wolfram\* gridMathematica Supercomputing Environment





#### Performance Results

















#### Performance

















Intel Development FOR

2008: Peak and Linpack PetaFlop — 2011: Sustained

Source: HPC - www.top500.org, June 2006; PC - Intel



# Let's Talk About ... Challenges and Outlook in Building a Petascale Machine







#### **Processor Performance**



Reaching Petascale with ~100,000 Processors in 2010\*





#### **Multi-threaded Cores**

All Large Core

Mixed Large and Small Care

All Small Core

Energy Efficient Petascale with Multi-threaded Cores

FORUM

(Intel)

#### Increasing Throughput through Parallelism

Amdahl's Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N\*)



Single Core Performance
Relative Performance



System Performance



\* N = number of cores

#### Inter-chip Interconnect Challenges Bandwidth, Link Bandwidth and Power





#### **Energy Iso-Bandwidth**



#### **Interconnect Area Iso-Bandwidth**







#### **Increasing Processor Performance**

Through Multi-threaded Cores
Flops





Reaching Petascale with ~5,000 Processors



#### Inter-Chip Interconnect Performance Today



#### In the Box

#### Copper-based links:

- FB DIMM 4GB/s per DIMM channel
- PCl express, gen. 1 2.5Gb/s
- PCl express, gen. 2 5Gb/s
- Intel Front-side Bus 17GB/s





#### Out of the Box

#### Copper-based links:

- Infiniband DDR x4 20 Gb/s
- 1G Ethernet 1Gb/s

#### Optical-based links:

- OC-192 (long haul optical) 10Gb/s
- 10G Ethernet 10Gb/s





#### Memory Performance for Balanced Computing

Byte: Flop Ratio Has Been Consistent and Steady

<sub>0.7</sub> Bytes Per FLOP



FORULT

Continuing the Trend for Petascale Performance



## Increasing Memory Bandwidth to Keep Pace



#### 3D Memory Stacking

Power and IO Signals Go
Through DRAM to CPU
Thin DRAM Die
Through DRAM Vias



FORUM



## PCI Express to Meet I/O Demand Performance, Bandwidth and Functionality





Tracking Moore's Law

Source: Intel



## But ... How about Power and Reliability





#### Power and Cooling Cost Today

10°+ 9 + Power Delivery + \$14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.6\footnote{14.



DATACENTER ENERGY LABEL"



Assume: 9MW system power\*, 90% power delivery efficiency, cooling Co-efficiency of Performance (COP)=1.5

\*Source: HPC Wire "A Petaflop Before its Time," June 28, 2006





#### Managing Power and Cooling Efficiency



#### Silicon:

Moore's law, Strained silicon, Transistor leakage control techniques, Clock gating

#### Processor:

Policy-based power allocation Multi-threaded cores

#### **System Power Delivery:**

Fine grain power management, Ultra fine grain power management

#### **Facilities:**

Air cooling and liquid cooling options
Vertical integration of cooling solutions

Intel Developer FORUM

Power Management: From Transistors to Facilities



## Reliability Challenge Billions of Transistors

Soft Error FIT/Chip (Logic & Mem)



Assume: Chip size stays roughly-the same with each generation.

Soft Errors or Single Event Upsets (SEU) are caused by charge collection following an energetic particle strike.

- FIT/bit (mem cell): expected to be roughly constant
- Moore's law: increasing the bit count exponentially: 2x every 2 years

An exponential growth in FIT/chip





Soft Error: One of Many Challenges



### 100s & 1000s of Processors An Example of Datacenter Growth ...



Intel Developer FORUM Compute Capacity Growth is Accelerating



#### Reliable Systems With Unreliable Components

Architectural Techniques

Micro Solutions

Macro Solutions

Parity SECDED ECC π bit Lockstepping Redundant multithreading (RMT) Redundant multi-core CPU

Circuit Techniques

Device Param Tuning

Rad-hard Cell Creation

Process Techniques

State-of-Art Processes



Reducing Single-Bit Soft Errors



#### Do we need more than Petascale?

Parallel and programmable
10s to 100s cores and threads
100s and 1000s of processors
100s Gbps chip-to-chip signaling
10s to 100s billions transistors
Dynamic self-test, detect, reconfigure, & adapt

Laser Optics:
Molecular Dynamics in Biology:
Aerodynamic Design:

Computational Cosmology:

Turbulence in Physics:

Computational Chemistry:

1 Petaflops

10 Petaflops

20 Petaflops

1 Exaflops

10 Exaflops

100 Exaflops

1 Zettaflops

Source: Dr. Steve Chen, "The Growing HPC Momentum in China", June 30th, 2006, Dresden, Germany





Reaching Petascale with Energy Efficiency







## 

Intel Developer FORUM



