$\qquad$
$\qquad$

Course 304-425B -- Computer Organization and Architecture
Final examination
April 18, 2000, 9:00-- 12:00

Examiner: Prof. V. Hayward

Associate Examiner: Prof. K. Khordoc


## INSTRUCTIONS

- This is a closed book examination. Calculators and up to two sheets of notes are allowed.
- Explain every result concisely when asked. Marks will be given for clear, concise solutions.
- State any assumption required for an answer if it is not clear in the text of the question.
- This exam has 12 pages including this one. It has 7 sections for 24 questions (including a bonus question) indicated by the bullet sign (•). The marks add up to 100 .
- Please sign this paper at the top of the page, write your name and student number legibly there.
- Put your answers in the space provided and keep all the pages together.


## PLEASE NOTE CAREFULLY

- Make sure that the signed paper in its entirety is handed in (along with all signed exam books) at the end of examination.
- Make sure that the answers are put in the space provided, answers in any other location will not be marked.
- You have approximately 180 minutes to complete the exam.


## Section 1: Performance ( 12 points)

- Apply Amdahl's law to compute the speed-up factor for a machine to which an enhancement is added to improve some mode of execution by a factor 10 . This mode is used $50 \%$ of the time, measured as a percenta of the original exec time.
(4 points)

$$
\begin{aligned}
& T_{e}=T_{u}\left[(1-F E)+\frac{F_{E}}{S U_{E}}\right] \\
& S U=\frac{T_{u}}{T_{e}}=\frac{1}{(1-F E)+\frac{F_{E}}{S U_{E}}}=\frac{1}{.5+\frac{.5}{10}}=\frac{1}{.55}=1.82
\end{aligned}
$$

- Derive a variant of Amdahl's law to compute the speed-up factor for a machine to which an enhancement is added to improve some mode of execution by a factor 10 . However in this question, the mode is used $50 \%$. the time measured as a percentage of the enhanced exec time.
(4 points)
Hint: start from the definition of speed-up: Speed_ up $=\frac{\text { ExecTime }_{\text {unenhanced }}}{\text { ExecTime }_{\text {enhanced }}}$, in short: $S U=\frac{T_{u}}{T_{e}}$.

$$
\begin{aligned}
S U=\frac{T_{u}}{T_{e}} & =\frac{1}{T_{e}} T_{e}\left[\left(1-F_{E}\right)+F_{E} S U_{E}\right] \\
& =.5+5=5.5
\end{aligned}
$$

- Assume that we have a Load/Store machine which behaves with a perfect cache as follows:

| ALU ops | $40 \%$ | 1 clock cycle |
| :--- | :--- | :--- |
| Load/Stores | $30 \%$ | 2 clock cycles |
| Branches and others | $30 \%$ | 2 clock cycles |

The machine is modified to add new ALU instructions that have one source operand in memory. These new register-memory instructions have a clock cycle count of 2. The total number of ALU operations, branches, ar others instruction remains the same, of course, but the number of loads and stores is divided by two. Is this enhancement worth implementing?

$$
\begin{array}{lcccc}
\text { CC } \\
\text { NEW CLASSES } & \text { ALUOP1 ALUOP2 } & L / S & B O \\
\text { NEW CPI'S } & : & 1 & 2 & 2 \\
\text { NEW FO's } & : 4-.15 & \frac{.15}{.85} & \frac{.15}{.85} & \frac{.3}{.85} \\
\text { NEW IC } & : .85 I C_{\text {OLD }} & & &
\end{array}
$$

$$
\frac{\operatorname{TMENEW}^{T}}{\operatorname{TimEOAD}^{.85}}(.25+.15 \times 2+.15 \times 2+.3 \times 2)=1.45
$$

Timing analysis reveals that the memory cycles in the standard DLX pipeline are the limiting factor for clock c time improvement. One design option is to split the memory cycles in an attempt to increase the clock rate. Th often called super pipelining and is illustrated in the diagram below. Complete instruction fetches takes two sti IA and IB. In the first stage, the memory addressed is specified, in the second, the instruction is read out. same technique is applied to the MEM stage, now split into a MA and a MB stage. The new design is fully pipeli This is symbolically represented by introducing two new pipe registers.


- Assuming full bypassing/forwarding (including to and from the memory) use the chart below to repor timing diagram for this code. . Also note that the branch must stall. Show the branch behavior to be del branch. Suppose that the jump instructions benefit from branch folding and that there is a hit. (10 points)

| LOOP: | LW | R1, O (R2) |
| :--- | :--- | :--- |
|  | SW | R1, 0 (R3) |
|  | BEQZ | R1, OUT |
|  | ADDI | R2, R2, 4 |
|  | ADDI | R3, R3, 4 |
|  | J | LOOP |

OUT:

| $L W$ | $I A$ | $I B$ | $I D$ | $E X$ | $M A$ | $M B$ | $W B$ |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| $S W$ |  | $I A$ | $I B$ | $I D$ | $E X$ | $S$ | $M A$ | $M B$ | $W B$ |  |  |  |  |  |  |  |  |  |  |  |  |
| $B E Q Z$ |  |  | $I A$ | $I B$ | $S$ | $S$ | $I D$ | $E X$ | $M A$ | $M B$ | $W B$ |  |  |  |  |  |  |  |  |  |  |
| $A D D I$ |  |  |  | $I A$ | $S$ | $S$ | $I B$ | $I D$ | $E X$ | $M A$ | $M B$ | $W B$ |  |  |  |  |  |  |  |  |  |
| $A D D I$ |  |  |  |  |  |  | $I A$ | $I B$ | $I D$ | $E X$ | $M A$ | $M B$ | $W B$ |  |  |  |  |  |  |  |  |
| $J$ |  |  |  |  |  |  |  | $I A$ | $I B$ | $I D$ |  |  |  |  |  |  |  |  |  |  |  |
| $L W$ |  |  |  |  |  |  |  | $I X$ | $M A$ | $I D$ | $E X$ | $M A$ | $M B$ | $W B$ |  |  |  |  |  |  |  |
| $S W$ |  |  |  |  |  |  |  |  | IIA IB | $I$ | $I D$ | $E X$ | $S$ | $M A$ | $M B$ | $W B$ |  |  |  |  |  |



STANDARD DLX PIPELINE
Recall that there are four basic techniques to handle branches in a pipeline like DLX's:
(A) flush (or freeze) a number of instructions after the branch; (B) static prediction such as "predict-not-taken (C) delayed branch which creates "delay slots"; (D) delayed branch with canceling.

Consider now the following sequence to compute the double of the absolute value of a number in memory:


Show the timing of this sequence for the DLX pipeline assuming full forwarding and bypassing hardware assuming a register read and a write in the same clock cycle implicitly "forwards" through the register file ( first and then read). Use the chart to show the timing of instructions starting at instruction SLTI when the br, is taken. Fill-in the two blank entries according to the case. (note: a similar question was given last term, hower is NOT the same question).

- (B)"predict-not-taken":

| LW | IF | ID | EX | TE | WB |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| SLTI |  | IF | ID | S | EX | TE | WB |  |  |  |  |  |  |  |  |  |  |  |  |
| BEQZ |  |  | IF | 5 | 5 | ID | Ex | ME | WB |  |  |  |  |  |  |  |  |  |  |
| 908 |  |  |  |  |  | IF | ID |  |  |  |  |  |  |  |  |  |  |  |  |
| $A D D$ |  |  |  |  |  |  | IF | ID | Ex | TE | WB |  |  |  |  |  |  |  |  |

- Assuming now that the machine can detect nazaras, has forwarding naraware, ana uses delayed branches (c C). Schedule the following code, to minimize the stalls.

(A) (D) DATA HAZARD
(B),(E) branch delay (DElayed branch): (B) OK, (E) needs
(C) LOAD HAZARD

MANY GTRATEGES TO FILL THE SLOTS, EG.

- rove the "adds's" up
- rove the "load" before baez: fills two slots.
- MOVE THE "AND" BEFORE BNEZ
- Double number in pelay slot (Knowledge of semant EXAMPLE
LOOP: SGT R4,RI,RG
LW RR, O(R3)
LOAD NUMBER REGARDLESS OF
BNEZ RA, OUT
SLTI RH, RI, 0
DO TEST IN DELAY SLOT
ADD RS, RB,
FILL WITH APP
BEQZ $R 4$, SKIP
ADDI $R 1, R I, 1$
$S \cup B \quad R 2, R o, R 2$
SKIP: $A D D \quad R 2, R 2, R 2$
SW $O(R 3), R 2$
J LOOP
OUT: AND R2,RO,RO
FILL SLOT WITH API

Consider the pipeline below. The integer units can be controlled to carry out any types of integer instructions an the FP units any types of floating point operations.


| LOOP: | LD | F2,0(R1) |
| :---: | :---: | :---: |
|  | MULTD | F4, F2, F0 |
|  | LD | F6, 0 (R2) |
|  | ADDD | F6,F4.F6 |
|  | SD | O(R2), F6 |
|  | ADDI | R1, R1, \#8 |
|  | ADDI | R2, R2, \#8 |
|  | SGTI | R3, R1, Defrre |
|  | BEQZ | R3, LOOP |

Consider the code at the left which implement the vector operation: $\mathrm{Y}=\mathrm{a}^{*} \mathrm{X}+\mathrm{Y}$ where X and Y are vector arrays.

Assume the latencies are 0 for all integer operations including loads, 4 for additions, 6 for the multiplication regardless of the instruction using result. The Common Data Bus is written and read on the same cycle and ( support multiple data transfers (so there is no structural hazard there).

Use "Mem[10+Reg[R1]]" to denote, for example, the value fetched by first load, "Reg[R1]" to denote the value to be held in register R1, and to denote the value 8 .

To illustrate operation, the table below indicates the status of the pipeline once the instructions of the first iterati have issued (that is at clock cycle 8), starting from a blank state.

|  | Instruction Status including the CC counts spent in each stage |  |  |
| :--- | :--- | :--- | :--- |
| Instruction | Issue | Execute | Write Result |
| LD | 1 | 2 | 3 |
| MULTD | 2 | $3--7$ | 8 |
| LD | 3 | 4 | 5 |
| ADDD | 4 | $8--? ? 9-10$ | $? ?$ |
| SD | 5 | $? ? 6$ | $? ?$ |
| ADDI | 6 | 7 | 8 |
| ADDI | 7 | 8 | $? ? 9$ |
| SGTI | 8 | $? 9$ | $? ?$ |

- Indicate in the table below the state of the reservation stations.
(4 points)

- Indicate in the table below the status of the registers.

|  |  | Register Status |  |  |  |  |  |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Field | F 0 | F 2 | F 4 | F 6 | $\ldots$. | R 1 | R 2 | R 3 | $\ldots$ |
| Qi |  |  | FPO | FPO 2 |  | INT | NT 2 | INT 3 |  |

- Indicate in the table below the state of the store buffers
(2 points)

| Field | Store 1 | Store Buffers |  |  |
| :--- | :---: | :---: | :---: | :---: |
| Qi | FPU 2 |  |  |  |
| Busy | $Y$ |  |  |  |
| Address | $0-[82]$ |  |  |  |

Ignoring the branch delay, now show the new state of the machine, one clock cycle later (this means a new load has been issued).

- Indicate in the table below the state of the reservation stations.
(4 points)

- Indicate in the table below the status of the registers.

| Field | F0 | F2 | F4 | F6 | Register Status |  |  |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Qi |  | LD1 |  | FDU2 | $\ldots$ | $R 1$ | R2 | R3 | $\ldots$ |

- Indicate in the table below the state of the store buffers

| Field | Store 1 | Store Buffers | Store 2 |
| :--- | :---: | :---: | :---: |
| Qi | FPU 2 |  |  |
| Busy | $y$ |  |  |
| Address | $0+[R 2]$ |  |  |

Section 4. Unrolling (5 points)
Consider a standard FP pipeline as in the mid-term:


Consider again the loop of the previous question:


- Unroll this loop twice and schedule it for minimal execution time on average when run on the pipeline above Ignore the branch delay and assume that all branches are correctly predicted.

2009: LD $F 2, O(R 1)$
HuctD F4, F2, FO
LD FF, O(R2)
ADD FF, F4,F6
SD $O(R 2), F 6$
LD FR, 8 (RI)
HOLD FlO, FE, FO
$\angle D \quad F 12,8\left(R_{2}\right)$
ADD $F R, F 10, F 12$
ADD ${ }^{8(R 2) \text {, } \# 16}$
ADDI \#16
SaTI
BERT
LOOP
Two STEPS


## Section 5: Branch predictors (15 pomes)

Consider this infinite loop and its assembly code translation

```
a = 1;
b = 1;
while (1) { /* for ever */
    if (a == 0)
        a = 1;
    else
        a = 0;
    if (a != 0)
        b}=0\mathrm{ ;
    if (b == 0)
        b = 1;
```



```
\}
```

In the table below, the successive values of $a$ and $b$ are listed. Notice the period two. The sequence of taken (T) an not taken ( N ) branch outcomes is also given in the table below.

| $a$ |  |  |
| :--- | :--- | :--- |
| 1 | 1 | $B 1$ outcome: $T$ |
| 0 | 1 | $B 2$ outcome: $T$ |
| 0 | 1 | $B 3$ outcome: $T$ |
| 0 | 1 | $B 1$ outcome: $N$ |
| 1 | 1 | $B 2$ outcome: $N$ |
| 1 | 0 | $B 3$ outcome: $N$ |
| 1 | 1 | $B 1$ outcome: $T$ |
| 0 | 1 | $B 2$ outcome: $T$ |

- A machine has a 2-bit branch predictor mechanism. What is the performance of this predictor while execution this code in the steady state in terms of correct predictions(s) per iteration? A concise explanation must be given to get the marks.
EACH BRANCH ( $B 1, B 2, B 3$ ) HAS $A$ TWO BT T PREDICTOR
BI SEQUENCE: $T, N, T, N \ldots$
BL SEQUENCE: $T, N, T, N \ldots$
BU SEQUENCE: T,N,T,N...
ONE CORRECT PREDICTION OUT OF TWO: $50 \%$ 1/2
- A machine has a $(1,1)$ correlating branch predictor. What is its performance while executing the same code the steady state in terms of correct predictions) per iteration? Fill the table below to get the marks. ( 5 point:
LAST BRANCH NOT TAKEN - LAST BRANCH TAKEN


Average number of correct predictions?
for the code below, supposing that there is a hit in the buffer (that is: predicted taken), but the prediction is incorrect.

| 1. | SLTI R5, R1, 0 | \I compare R1 with 0 |  |  |
| :--- | :--- | :--- | :--- | :--- |
| 2. | iNEZ RS, SKIP | II if R1 >= 0 skip |  |  |
| 3. | SKIP: | SUBI RI, RN, RI | MU LT RI, RI, RI | II double |


| SLTI | IF | ID | EX | ME | WB |  |  |  |  |  |  |  |  |  |  |  |  |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |

## Section 6: Loop level parallelism ( 15 points)

Consider this loop:

```
for (i = 1; i < 100; ++i) {
    a[i - 1] = c[i - 1] + n; /* S1 */
    b[i] =m + c[i]; /* S2 */
    a[i] = a[i] + b[i]; /* S3 */
\}
```

- List all the dependencies: output dependencies, anti-dependencies, and true data dependencies and indicate each dependency the pair of statements and which are "loop carried"
(5 points)
Output Dependencies:
Anti Dependencies:
loop carried
LOop carried
$a[i] \rightarrow a[i-1]$
sAmE


## Data Dependencies $52-53$

Rewrite the loop so it becomes parallel. Solve this problem in two different ways:

- First use software renaming, not changing the structure of the loop:

LOOP CARRIED DEPENDENGES INVOLVE A [G] THAT IS a [i-1] OF NEXT ALWAYS WRITTEN, NEVER READ. CALL IT TEMPCi]
for ( ) \&
$\operatorname{LEMP}[i-1]=c[i-1]+n$
$b[i]=m+c[i]$

- Second transform the loop without renaming so it becomes parallel:

$$
\left.\begin{array}{rl}
a[0]= & c[0]+n \\
f o r(i= & 1 ; i<99 ;++i)\{ \\
& b[i]=m+c[i] ; \\
& a[i]=a[i]+b[i] ; \\
& a[i]=c[i]+n ; \\
\} & b[0]=
\end{array}\right) m+c[92 ; 10\}+b[4] ;
$$

## Section 7: Memory Hierarchy (18 points)

A cache system has $B$ blocks of $N$ words and total storage capacity $L$ (for valid bits, tags, and data) measured in bits. Recall that the degree of associativity $A$ is defined as the number of blocks per set. Assume further that the memory address space is $2^{Z}$ and that the memory is word addressed (each word has $W$ bits). Call $H$ the hit time, $M$ the miss rate, and $P$ the miss penalty measured in clock cycles.

Consider now this contrived but interesting example (read the whole section before starting). The benchmark test is to visit (read only) all the addresses in the address space exactly once.

- Calculate the AMAT of the cache system for this test as a function of $B, N, W, Z, H$, and $P$ starting from a blank cache (all the valid bits are off). In developing the formula, take the case of a direct mapped cache, or equivalently $A=1$, that is $M$ sets.
(6 points)

- Work out the result for $B=16, N=4, A=1, Z=32, W=32, H=2$, and $P=20$.

1 MISS, 3 HITS IN CLOCK CYCLES

$$
\begin{aligned}
& 4 \times 2+20 \text { FOR EACH BLOCK OF } 4 \text { WORD } \\
& \text { AVERAGE ACCESS TIME }=\frac{28}{4}=7 \mathrm{CC} .
\end{aligned}
$$

Note that these last two questions are independent. You can solve the numerical example by reasoning it out and then derive the formula, or you can develop the formula first and then plug the numbers in.

- Bonus question!: Solve the same problems for $A=2$
( $5+5$ points
MORE ASSOCIATIVITY DOES NOT CHANGE RESULT.

ZPU requests this sequence of addresses:3AC54230, A35C2340, and 57BF2344. If there is a miss, ate a replacement by showing which tag gets changed and assume the blocks continue to hold the same In any case, indicate below the values returned to the CPU.

PHYS. ADDR. 3AC5 244; INDEX: 2 ; TAG: 3 ACE; MISS; RETURN $2 . .2$ ORD. 40: PHYS. ADDR. A35C123; INDEX: 1; TAG:A35C1; MISS; RETURN 1.I. OR E..E 44: PHYS. ADDR. 57 BF 244 ; INDEX: 2; TAG:57BF2; MISS; RETURN D..D OR $2 . .2$




Translation Lookaside Buffer


Cache Addresses Format



Note: this means that all the bytes in each block have the same value. In this case: "8"


FL

