

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007



## Fundamentals of Computer Design

| Pe     | entium 3 (Coppermine)                                                                                               | Cray YMP      |  |
|--------|---------------------------------------------------------------------------------------------------------------------|---------------|--|
| Туре   | Desktop                                                                                                             | Supercomputer |  |
| Year   | 2000                                                                                                                | 1988          |  |
| Clock  | 1130 MHz                                                                                                            | 167 MHz       |  |
| MIPS   | > 1000 MIPS                                                                                                         | < 50 MIPS     |  |
| Cache  | 256 KB                                                                                                              | 0.25 KB       |  |
| Memory | 512 MB                                                                                                              | 256 MB        |  |
| Cost   | \$2000                                                                                                              | \$1,000,000   |  |
|        |                                                                                                                     |               |  |
|        | 07 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical a<br>Jniversity, Textbook figures © Elsevier Science 1990, 1 |               |  |
|        |                                                                                                                     |               |  |





- Architecture: Instruction Level Parallelism (ILP), caches, RISC
- RISC

•

- Now: 20% / year
  - 1. power and heat
  - 2. running out of ILP to exploit
  - 3. memory latency
    - Reduce clock frequency
    - Multicore

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007 5

7

#### Classes of computers

| Feature                              | Desktop                                                                   | Server                                      | Embedded                                                            |
|--------------------------------------|---------------------------------------------------------------------------|---------------------------------------------|---------------------------------------------------------------------|
| Price of system                      | \$500-\$5000                                                              | \$5000-<br>\$5,000,000                      | \$10-\$100,000<br>(including network<br>routers at the high<br>end) |
| Price of<br>microprocessor<br>module | \$50-\$500<br>(per processor)                                             | \$200-\$10,000<br>(per processor)           | \$0.01-\$100<br>(per processor)                                     |
| Critical system<br>design issues     | Price-performance,<br>graphics<br>performance                             | Throughput,<br>availability,<br>scalability | Price, power<br>consumption,<br>application-specific<br>performance |
|                                      | 007 V. Hayward, T. Arbel, W. Gros<br>I University, Textbook figures © Els |                                             |                                                                     |

## **Defining Computer Architecture**

- *Implementation*: how an abstract description is turned into hardware.
- The *instruction set architecture* (ISA) is such abstraction.
  - Specification of the functional behavior of a processor
  - HW / SW interface.
  - Assembly language, instruction format, addressing modes, programming model
  - ADD R1, R2, R3
- Examples of ISAs
  - MIPS64, ARM, PowerPC

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

## **Defining Computer Architecture**

- · ISA vs. Computer Architecture
  - Old definition of computer architecture
     = instruction set design
    - Other aspects of computer design called implementation
    - Insinuates implementation is uninteresting or less
       challenging
  - Our view is computer architecture >> ISA
  - Architect's job much more than instruction set design; technical hurdles today more challenging than those in instruction set design

## **Defining Computer Architecture**

- Organization: high level aspects of the design
  - memory, bus structure, pipeline, cache, branch predictors...etc.
  - design of these sometimes called "microarchitecture".
- It is possible to *implement* the same instruction set architecture using different organizations, resulting in different systems.
  - E.x. Different Bus, memory organization, pipeline structure and so-on. E.g. AMD Opteron 64 and Intel Pentium 4 (same ISA)

9

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

## **Defining Computer Architecture**

- *Hardware* refers to the specifics of an implementation.
  - For example the Pentium 4 and the Mobile Pentium have different hardware.
  - Different clock frequencies, different memory systems
  - Other differences: fabrication technology, packaging, clock, etc...
- It is even possible to *emulate* a function normally carried out in hardware (say floating point calculations) using software (lists of instructions).

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

| Functional requirements          | Typical features required or supported                                                                                                                                       |
|----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Application area                 | Target of computer                                                                                                                                                           |
| General-purpose desktop          | Balanced performance for a range of tasks, including interactive performance for graphics, video, and audio (Ct 2,3,5, App. B)                                               |
| Scientific desktops and servers  | High-performance floating point and graphics (App. I)                                                                                                                        |
| Commercial servers               | Support for databases and transaction processing; enhancements for reliability and availability; support of<br>scalability (Ch. 4, App. B, E)                                |
| Embedded computing               | Often requires special support for graphics or video (or other application-specific extension); power limitations<br>and power control may be required (Ch. 2, 3, 5, App. B) |
| Level of software compatibility  | Determines amount of existing software for computer                                                                                                                          |
| At programming language          | Most flexible for designer; need new compiler (Ch. 4, App. B)                                                                                                                |
| Object code or binary compatible | Instruction set architecture is completely defined-little flexibility-but no investment needed in software or porting<br>programs                                            |
| Operating system requirements    | Necessary features to support chosen OS (Ch. 5, App. E)                                                                                                                      |
| Size of address space            | Very important feature (Ch. 5); may limit applications                                                                                                                       |
| Memory management                | Required for modern OS; may be paged or segmented (Ch. 5)                                                                                                                    |
| Protection                       | Different OS and application needs; page vs. Segment; virtual machines (Ch. 5)                                                                                               |
| Standards                        | Certain standards may be required by marketplace                                                                                                                             |
| Floating point                   | Format and arithmetic: IEEE754 standard (App. I), special arithmetic for graphics or signal processing                                                                       |
| I/O interfaces                   | For I/O devices: Serial ATA, Serial Attach SCSI, PCI Express (Ch. 6, App. E)                                                                                                 |
| Operating systems                | UNIX, Windows, Linux, CISCO IOS                                                                                                                                              |
| Networks                         | Support required for different networks: Ethernet, Infiniband (App. E)                                                                                                       |
| Programming languages            | Languages (ANSI C, C++, Java, FORTRAN) affect instruction set (App. B)                                                                                                       |

# Trends in Technology Valid over long time periods, e.g. ISA can last decades Integrated circuit logic technology Transistor Density: ~ 35%

• Die size: ~ 10-20%
 ⇒Transistors per chip: ~ 40-55%
 • Semiconductor DRAM
 • Capacity ~ 40%
 • Magnetic disk technology
 • Density ~ 30% (2004)
 • Network technology

12

#### Performance Trends: Bandwidth over Latency

- Compare ~1980 Archaic vs. ~2000 Modern
- Performance Milestones in each technology
- Compare for Bandwidth vs. Latency improvements in performance over time
- Bandwidth: number of events per unit time
  - E.g., M bits / second over network, M bytes / second from disk
- · Latency: elapsed time for a single event
  - E.g., one-way network delay in microseconds, average disk access time in milliseconds

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

13

#### Performance Trends: Bandwidth over Latency

- Disks: Archaic v. Modern
- Seagate 373453, 2003 • CDC Wren I, 1983 • 15000 RPM (4X) • 3600 RPM • 73.4 GBytes (2500X) 0.03 GBytes capacity Tracks/Inch: 64000 (80X) Tracks/Inch: 800 • Bits/Inch: 533,000 (60X) • Bits/Inch: 9550 • Four 2.5" platters • Three 5.25" platters (in 3.5" form factor) • Bandwidth: Bandwidth: 86 MBytes/sec (140X) 0.6 MBytes/sec • Latency: 5.7 ms (8X) • Latency: 48.3 ms · Cache: 8 MBytes Cache: none © 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007 14



| Performance Trends                                                                      | s: E | Bandwidth over La                                              | tency   |
|-----------------------------------------------------------------------------------------|------|----------------------------------------------------------------|---------|
| Memory: Archaic                                                                         | v. N | Vodern                                                         |         |
| 1980 DRAM<br>(asynchronous)<br>0.06 Mbits/chip                                          | • 2  | 2000 Double Data Rate S<br>(clocked) DRAM<br>256.00 Mbits/chip | (4000X) |
| 64,000 xtors, 35 mm <sup>2</sup>                                                        | • 2  | 256,000,000 xtors, 204 n                                       | nm²     |
| 16-bit data bus per module<br>16 pins/chip                                              | -,   | 64-bit data bus per<br>DIMM, 66 pins/chip                      | (4X)    |
| 13 Mbytes/sec                                                                           | • '  | 1600 Mbytes/sec                                                | (120X)  |
| Latency: 225 ns                                                                         | • 1  | Latency: 52 ns                                                 | (4X)    |
| (no block transfer)                                                                     | • [  | Block transfers (page mo                                       | de)     |
|                                                                                         |      |                                                                |         |
|                                                                                         |      |                                                                |         |
| © 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W<br>McGill University, Textbook figures |      |                                                                | 16      |









## Trends in Technology

- Scaling of Transistor Performance and Wires
  - Feature size (min size of transistor or wire)
    - 1971: 10 µm
    - 2001: 0.18 μm
    - 2002: 0.13 µm
    - 2003: 0.10 μm
    - 2007: 65 nm
  - Quadratic increase in density, linear increase in performance

23

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

## Trends in Technology

- Scaling of Transistor Performance and Wires
  - $\Rightarrow$  Architectural improvement!
    - 8, 16, 32, 64 bit architectures (buses, ALU's)
    - · Pipelines and caches
    - Transistor performance benefits from smaller resistance and capacitance.
  - Interconnect propagation delay major problem.
    - e.g. Pentium 4 accounts for propagation of signals across chip.



- Trends in Power in Integrated Circuits
  - Dynamic power:  $P_{dynamic}$  = 1/2 x f x C x V<sup>2</sup>
    - f = clock frequency, C = capacitance
    - V = voltage (5V-> 3.3V-> < 1 V)
  - 3.2 GHz Pentium 4 Extreme Edition
     →135 Watts: limits of air cooling
  - Temperature diodes used to regulate voltage and clock frequency
  - Shutdown parts of chip
  - Portable computing requires low power.
  - Static Power (leakage): P<sub>static</sub> = Current <sub>static</sub> x V (25%!!!!!)

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007



## **Trends in Technology**

- Example:
- Some microprocessors have adjustable voltage. A 15% reduction in voltage results in a 15% reduction in frequency.
- What is the impact on the dynamic power?

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

27

## **Trends in Technology**

- Other "Famous" Predictions
  - "There is no reason for any individual to have a computer in his home."
    - Kenneth H. Olson, President of DEC, Convention of the World Future Society, 1977
  - "640 kilobytes (of computer memory) ought to be enough for anybody."

Bill Gates, Founder and head of Microsoft, 1981

## **Trends in Cost**

- What is the nature of the cost-performance tradeoff?
- Driven by cost of components

   one important aspect is their change over time.
- Time and volume
- Manufacturing learning curve –best measured by yield (# good chips / total # of chips made)

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

29

## **Trends in Cost**

- Yield improves with time.
- Doubling the yield halves the cost.
  - Example: DRAM chips have strange business behaviors because of rapid changes (40 % / year drop in price per megabyte over long term) – sometimes sell at loss!
- Microprocessors are less predictable.
  - Roughly cost decreases 10% for volume doubling (learning curve improves with each chip made, efficiency increase, amortize development costs)
  - Expansion of low-end market has produced "commoditization" with fierce competition and razorthin margins.
- © 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007





## Cost of an Integrated Circuit

#### • Manufacturing Steps.

- Silicon Crystal Growth extracted from molten silicon bath
- Processed (cleaned to very high level of purity) into cylinder
- Cylinder sliced to make wafers
- Wafers cleaned, polished and chemically processed
- Long sequence of steps involving deposit and removal of substances to etch the circuit according to patterns specified by optical masks.
- Dies cut, tested and packaged.









© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

37

39

## Cost of an Integrated Circuit

- Example:
  - Find the die yield for dies: i) 1.5 cm on a side, and ii) 1.0 cm on a side

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

## Cost of an Integrated Circuit

- Example:
  - \$2000 / wafer, 350 raw dies / wafer
  - 60% good dies, \$80 to test wafer,
  - \$4/unit to package and final test, 97% final test yield
- DieCost =
- DieTestCost (good dies) =
- ICCost =

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

## Cost of an Integrated Circuit

|                          |      |      | Wafer<br>cost |     |     | Dies/<br>wafer | Yield  | Die Cost       |
|--------------------------|------|------|---------------|-----|-----|----------------|--------|----------------|
| 386DX                    | 2    | 0.90 | \$900         | 1.0 | 43  | 360            | 71%    | \$4            |
| 486DX2                   | 3    | 0.80 | \$1200        | 1.0 | 81  | 181            | 54%    | \$12           |
| PowerPC 6                | 01 4 | 0.80 | \$1700        | 1.3 | 121 | 115            | 28%    | \$53           |
| HP PA 710                | 03   | 0.80 | \$1300        | 1.0 | 196 | 66             | 27%    | \$73           |
| DEC Alpha                | 3    | 0.70 | \$1500        | 1.2 | 234 | 53             | 19%    | \$149          |
| SuperSPAR                | RC 3 | 0.70 | \$1700        | 1.6 | 256 | 48             | 13%    | \$272          |
| Pentium                  | 3    | 0.80 | \$1500        | 1.5 | 296 | 40             | 9%     | \$417          |
| – From "Es<br>Microproce |      |      |               |     |     | y Linle        | y Gwei | cs252/Patterso |

## Cost of an Integrated Circuit

#### · Dependability

- Historically, integrated circuits were very reliable, error rates inside chips was very low.
- 65nm and smaller: transient and permanent faults
- Service Level Agreement (SLA) provided pays customer a penalty if the system is down more than a certain number of hours a month
  - Two states:
    - Service accomplishment (if service is delivers as specified)
    - Service interruption (if service is different from SLA)
    - Failures (state 1->2) and restorations (states 2->1).
- © 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

## Cost of an Integrated Circuit

- Quantify: Reliability and Availability
  - Module Reliability: Mean Time To Failure (MTTF)
  - MTTF^-1 → failure rate (Failures in Time or FIT) reported as failures per billion hours of operation
    - e.g. MTTF = 1,000,000 hours  $\rightarrow$  10^9/10^6 = 1000 FIT
  - Service time: Mean Time To Repair (MTTR)

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007 42

## Cost of an Integrated Circuit

- <u>Quantify: Reliability and Availability</u>
   Mean Time Between Failures (MTBF)
   MTBF = MTTF + MTTR
- Module Availability:
   Module availability = MTTF / MTBF

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007 43

Cost of an Integrated Circuit

- Example: Disk subsystem
  - 10 disks, each rated at 1,000,000 hour MTTF
  - 1 SCSI controller, 500,000 hour MTTF
  - 1 power supply, 200,000 hour MTTF
  - 1 fan, 200,000 hour MTTF
  - 1 SCSI cable, 1,000,000 hour MTTF
  - Assume that failures are independent. Assume exponentially distributed lifetimes (the age of a module is not important in the probability of failure, hence the overall failure rate of the collection is the sum of the failure rates of the modules)
  - Compute the MTTF of the disk subsystem.



#### Measuring and Reporting Performance

• Relative performance: "X is n times faster than Y"

 $n = \frac{\text{ExecTime}_{Y}}{\text{ExecTime}_{X}} = \frac{1/\text{Performance}_{Y}}{1/\text{Performance}_{X}} = \frac{\text{Performance}_{X}}{\text{Performance}_{Y}}$ 

- · How do you measure "time"?
- The only consistent and reliable measure of performance is the *execution time of real programs.*

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007 46

#### Measuring and Reporting Performance

#### • Different "Times"

- Wall-clock time is the elapsed time to complete a task. This includes I/O, memory, OS overhead, ..., everything. With multiprogramming and multitasking (as in UNIX), for a given task, it changes with the load.
- *CPU time* is the time spent by the CPU on behalf of one task. It is subdivided into *user CPU time* (time spent by the CPU running user code) and *system CPU time* (time spent running OS code on the behalf of the user).
   2003 2002 V Haved T Adel W Gross Deut of Enclose and Computer Enclosed

#### © 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

#### **Measuring and Reporting Performance**

- Different "Times"
  - The UNIX time command reports all three.
    - For example
    - prompt\$ time sleep 5
    - 0.00u 0.02s 0:05.02 0.3%
  - tells us that the sleep 5 command spent (almost) no CPU time to run, 20 milliseconds to execute OS code and that the elapsed was 5.02 s (per the definition of sleep). It also says that (0.0+0.02)/5.02=0.3% of the elapsed time was to do some work.

#### **Measuring and Reporting Performance**

#### • What is a task?

- We want to be able to predict the performance of a computer: how do we evaluate performance?
  - 1. *Real applications*: C compiler (if you are code developer), TeX if you are a typesetter, Photoshop if you are a graphic designer, Spice if you are an electronic engineer, MatLab, and so-on.
  - 2. *Kernels*: A general principle about computing (Knuth) is that programs tend to spend most of their time in a very small portion of the code. (For example, the integrator routine in MathLab, searching and sorting while compiling, manipulating matrices in scientific code).

49

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

#### Measuring and Reporting Performance

- 3. *Toy benchmarks*: Small and interesting programs, Sieve of Eratosthenes (prime numbers), Towers of Hanoi, Puzzles, Quicksort.
- 4. Synthetic Benchmarks: Attempt of reproduce the load of a set of programs.
- Benchmark suites: a collection of benchmark applications.
  - SPEC (Standard Performance Evaluation Committee, www.spec.org) is a consortium dedicated to the design of documented benchmarks suites.







time than others? This will bias the arithmetic mean towards those programs, so we can choose weights:

WeightedArithmeticMean =  $\sum_{i=1}^{n} weight_i time_i$ 

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

53

55



 The SPEC consortium is composed of competing companies who might have their own choice of favorite weights, so one approach is to choose weights to equalize running times:

Weight 
$$_{i} = \frac{1}{\text{Time}_{i} \times \sum_{i=1}^{n} \left(\frac{1}{\text{Time}_{i}}\right)}$$

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007 54

#### Measuring and Reporting Performance

#### • Ratios

- Another approach is to normalize execution times relative to a reference computer (SPECRatio = exec time ref / exec time).
- i.e. if the SPECRatio n of computer A on a benchmark is 1.25 times higher than computer B then
  - 1.25 = SPECRatioA / SPECRatioB =(tref/tA)/(tref/tB)= =tB / tA = perf A / perf B
- The choice of reference computer is not important.

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

#### **Measuring and Reporting Performance**

- Ratios
  - Normalized times to a reference machine must be averaged geometrically:

GeometricM ean = 
$$\sqrt[n]{\prod_{i=1}^{n} sample_{i}}$$

- In the case of SPEC, sample, is the SPECRatio
- Note that the geometric mean of the ratios is equal to the ratio of the geometric means and the relative results do not depend on the machine taken as reference.

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

56



#### Measuring and Reporting Performance

#### • Figure 1.14

- Example on page 37: The geometric means are calculated from data in Figure 1.14 for an Opteron and an Itanium 2. The Itanium 2's mean is higher than the Opteron's (27.12 vs 20.86) but the standard deviation for the Itanium 2 is much higher (1.93 vs. 1.38) indicating that the results differ more widely from the mean and are therefore likely less predictable.

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

#### **Quantitative Principles of Computer Design**

- Take advantage of parallelism
- · Principle of locality
- Focus on the common case
  - Parallelism is related to performing many operations simultaneously. It is applicable to single processor or to memory management as well. In fact, modern CPUs can executes 10's of instructions simultaneously and perform many memory transactions simultaneously. It is applicable to basic circuits (such as carry-look-ahead adders) to entire systems (many CPU or hard drives operating simultaneously).
  - One example of parallelism at the instruction level that we will exploit is called pipelining. We will overlap the execution of several instructions that the total time to execute the sequence is reduced.

59

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

#### **Quantitative Principles of Computer Design**

- The principle of locality:
  - Typically, a program spends 90% of its time executing only 10% of the code. Most programs are, by definition, highly structured. They rarely use data and instructions in a completely random fashion.
  - Using this principle, we will be able to predict which instructions and data a program will use in the near future with reasonable accuracy based on its accesses in the recent past.

#### **Quantitative Principles of Computer Design**

#### • The principle of locality:

- Temporal locality states that recently accessed items are likely to be accessed again in the near future. Spatial locality says that items who's addresses (in memory) are near one another tend to be referenced close together in time.
- Make the common case fast.
  - The impact of an improvement to a system is higher if the occurrence is frequent. So, when making choices in design favor the frequent case over the infrequent ones.

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007 61



## **Quantitative Principles of Computer Design** • Define *FractionEnhanced*, the fraction of computation time concerned by an enhancement. This fraction is sped up by *SpeedUpEnhanced*. $ExecTimeEnhanced = ExecTimeBase \times \left((1 - FractionEnhanced) + \frac{FractionEnhanced}{SpeedUpEnhanced}\right)$ $SpeedUp = \frac{1}{(1 - FractionEnhanced) + \frac{FractionEnhanced}{SpeedUpEnhanced}}$ • Remember, FractionEnhanced is the fraction of time that converted to use an enhancement, NOT the fraction of time that the enhanced portion takes after the enhancement.

#### **Quantitative Principles of Computer Design**

- · Computer example:
  - Consider the enhancement to a processor for Web serving. New CPU is 10 x faster than original for Web serving. Original CPU busy with computation 40% of time and is waiting for I/O 60% of time. What is overall speedup gained by enhancement?
  - Only spends 40% doing work so limited by that amount.

#### **Quantitative Principles of Computer Design**

#### · Reliability example

 The power supply of the disk subsystem in the last reliability example was improved from 200,000 hour to 830,000,000 hour MTTF (4150x better) by adding a redundant power supply. What is the reliability improvement by adding the redundant power supply?

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

#### Quantitative Principles of Computer Design

- Amdahl's Law express the low of diminishing returns
  - Incremental improvement in speedup gained by an improvement of just a portion of the computation diminishes as improvements are added.
- Corollary

65

67

 If an enhancement is only usable for a fraction of a task, we can't speed up the task by more then the reciprocal of 1 minus that fraction.

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

## Quantitative Principles of Computer Design • CPU Performance

## Total CPU time for a task (CPU is a clock driven sequential circuit): CPU\_Time = ClockCycles × ClockCycleTime = ClockCycles × 1 ClockRate

- Programs (tasks) are made of lists of instructions:

 $ClockCyclesPerInstruction = CPI = \frac{ClockCycles}{InstructionCount}$ 

CPU Time = InstructionCount × CPI × ClockCycleTime

 $CPU\_Time = \frac{Instructions}{Program} \times \frac{ClockCycles}{Instructions} \times \frac{Seconds}{ClockCycle}$ 

© 2002, 2003, 2004, 2007 V. Hayward, T. Arbel, W. Gross, Dept. of Electrical and Computer Engineering, McGill University, Textbook figures © Elsevier Science 1990, 1996, 2007

### **Quantitative Principles of Computer Design**

- $\frac{CPU Performance}{CPU_Time_{=} \frac{Instructions}{Program} \times \frac{ClockCycles}{Instructions} \times \frac{Seconds}{ClockCycle}}$ 
  - Significance:
    - First factor is a function of the ISA and of the compiler technology.
    - Second factor is a function of the organization and of the compiler.
    - Third factor is a function of the organization and of the technology.
- Faced with a giant tradeoff! The art of computer design is contained in this formula.



| - |         |      | notion o | of instru | er operations<br>ction mix, which<br>classes of instru | ,<br>is th |
|---|---------|------|----------|-----------|--------------------------------------------------------|------------|
|   |         |      |          |           | n benchmark.                                           | cuor       |
| - | Ex. gcc |      |          |           |                                                        |            |
|   | Ор      | Freq | CPli     |           | Term                                                   |            |
|   | ALU     | 50%  | 1        |           | 0.5                                                    |            |
|   | Load    | 20%  | 2        |           | 0.4                                                    |            |
|   | Store   | 10%  | 2        |           | 0.2                                                    |            |
|   | Branch  | 20%  | 2        |           | 0.4                                                    |            |
|   |         |      |          | CPI=      | 1.5                                                    |            |

Γ