| Last Name: | SAMPLE |   |
|------------|--------|---|
|            |        | - |

First Name: \_\_\_\_\_

Signature:\_\_\_\_\_

# Course 304-425B -- Computer Organization and Architecture

## Final examination

December 21, 2001, 14:00 -- 17:00

Examiner: Prof. V. Hayward Associate Examiner: Prof. J. Ferrie

# INSTRUCTIONS

- □ This is a closed book examination. Calculators and one or two sheets of notes are allowed.
- Explain every result concisely when asked. Marks will be given for clear, concise solutions.
- State any assumption required for an answer if it is not clear in the text of the question.
- This exam has 14 pages including this one. It has 7 sections for 25 questions. Each question is indicated by a bullet sign (•) and carries four (4) marks. The marks add up to 100.
- Please sign this paper at the top of the page, write your name and student number legibly there.
- Put your answers in the space provided and keep all the pages together.

## PLEASE NOTE CAREFULLY

- Make sure that the signed paper in its entirety is handed in (along with all signed exam books) at the end of the examination.
- Make sure that the answers are put in the space provided, answers in any other location will not be marked.
- You have approximately 180 minutes to complete the exam.

### PROBLEMS SECTION 1: Performance (5 questions)

 Consider adding an enhancement to a machine that would divide the number of load/store instructions by two. All other instructions would be left unchanged. However, the clock cycle of the machine would be 7% longer and the average CPI for all instructions would change from 1.1 clock cycle, before enhancement, to 1.2 after enhancement. What is the minimum percentage of load/stores in the instruction mix produced by an application would be required to make the enhancement worth implementing, assuming it is free of cost?

TIME NEW  $\leq$  TIME OLD  $\Rightarrow$  IC 11 CC  $\leq$  IC 1.2  $\left(1 - \frac{\alpha}{2}\right)$  1.07 CC  $1.1 \leq 1.2 \left(1 - \frac{\alpha}{2}\right) 1.07 \Rightarrow \frac{\alpha}{2} \geq \left(1 - \frac{11}{12107}\right)$  $\Rightarrow \chi > 0.286$ 

 Consider now a five stage pipeline which causes 20% of instructions to stall for one cycle on average. Consider the memory to be perfect, that is, there is no memory stalls. Assuming that the clock rate is four times faster than a non-pipelined version of the same CPU, what is the effective improvement in performance?



• Consider a single pipeline machine like DLX. For a given benchmark suite, measurements show that 10% of the instructions are unconditional jumps. A first enhancement is proposed in the form of a "jump target buffer" and a second in the form of "jump folding" temember that the jump target address is calculated in the ID stage and the buffer updated in the EX stage. For these enhancement, the hit rate is 90%. The average CPI of the base machine is estimated to be 1.5 for all other instructions. All other aspects of the machine are assumed to be unchanged. Compute the expected speed-up due to the implementation of:

• a "jump target buffer" NORMAL DLX  $CPI_0LD = 0.1 * 2 + 0.9 * 1.5 = 1.55$   $CPI_NEW = 0.1 (0.9 * 1 + 0.1 \times 2) + 0.9 * 1.5$ = 0.11 + 1.35 = 1.46Su =  $\frac{1.55}{1.46} = 1.061$ • "jump folding" CPI NEW = 0.1 (0 + 0.1 x2) + 0.9 + 1.5

2

 $SU = \frac{1.55}{137} = 1.13$ 

= 0.02 + 1.35 = 1.37

NOT APPROVED

· Consider now a machine with a virtual memory and two levels of cache. The performance figures are:

L1: the hit time is 1 cc and the miss rate is 5%

L2: the hit time is 10 cc and the miss rate is 2%

MM: the hit time for a block is 40 cc, the miss rate 0.0001%, and miss penalty is 10,000,000 cc

Compute the global average memory access time (AMAT)

| AMATi = | $\mu_{T_{\ell}}$ | + MR; MP; | $MP_i = AMAT_{i+1}$    |     |
|---------|------------------|-----------|------------------------|-----|
| Ат      | ANY              | LEVEL     | MISS PENALTY DETERMINE | Bγ  |
|         |                  |           | ACCESS TO HIGHER LEVE  | EL. |

| AMAT = | 1+ | 0.05 | (10 + | 0.02 (40 + | 10-6 | 107)) |
|--------|----|------|-------|------------|------|-------|
| =      | 1+ | 0.05 | (10+  | 0.02 50)   |      |       |
| Ξ      | 1+ | 0.05 | 11 =  | 1.55       |      |       |

We focus now on the performance of a DLX style machine with one level of cache that has a 1 cc hit time, a 5% miss rate, and a miss penalty of 16 cc. We improve this cache so it becomes non blocking and is such that the CPU can restart as soon as the first word arrives from the MM 4 cc after request. This improvement concerns the instruction fetches only. The data cache remains blocking and 30% of all instructions are loads. The data cache is "write through" but has a write buffer that eliminates data stalls for stores. What is the speed up introduced by the non blocking feature?

 $CPT_{AVE} = CPI_BASE + FETCH-MISS-RATE + PENALTY + FREQ-4/s + DATA MISS-RATE + PENALTY$ BLOCHING CASE $<math>CPI_{AVE} = 1.0 + 0.05 \ 16 + 0.3 \ 0.05 \ 16 = 2.04$ NON BLOCHING CASE  $CPI_{AVE} = 1.0 + 0.05 \ 4 + 0.3 \ 0.05 \ 16 = 1.44$ 

$$SU = \frac{2.04}{1.44} = 1.42$$

END OF SECTION 1 (5 questions)

# SECTION 2: Integer Pipelining (4 questions)

Timing analysis reveals that the memory cycles in the standard DLX pipeline are the limiting factor for clock cycle time improvement. The machine is super-pipelined. Complete instruction fetches takes two stages IA and IB. In the first stage, the memory addressed is specified, in the second, the instruction is read out. The same technique is applied to the MEM stage, now split into a MA and a MB stage. The new design is fully pipelined but has This is symbolically represented by indicating new pipe registers.





2a + 1 - c· We need a sequence of pipelined DLX code to compute:

```
2b+1-2c otherwise
```

if  $2a+1 \ge 0$ 

Upon entry, a is stored in register R1 and the result is left in register R1, b is in R2, c is in R3. The pipeline handles all structural and data hazards. The numbers processed are expected to be equally likely to yield one case or the other. Control hazards are handled via "delayed branch with canceling". Here is a semioptimized sequence:

|      |      | R4, |     |    | 2a   |
|------|------|-----|-----|----|------|
|      | ADDI | R4, | R4, | 1  | 22+1 |
|      | SGE  | R5, | R4, | R0 | <0 ? |
|      | BNEZ | R5, | SKI | Р  |      |
|      | ADD  | R3, | R3, |    | 20   |
|      | ADD  | R4, | R2, | R2 | 26   |
|      | ADDI | R4, | R4, | 1  | 26+1 |
| SKIP | SUB  | R1, | R4, | R3 |      |
|      |      |     |     |    |      |

SK

Use speculative execution to write an optimized sequence. The machine does not have hardware speculative features such as boosted instructions or poison bits. (Hint: do not try to manipulate the expressions compute them as given).

| PELAY SLOT 445 TWO PLACES.<br>HOVE CONTROL PEPENDENT INSTRUCTION<br>BEFORE BRANCH TO SPECULATIVELY<br>CALCULATE 26, FOR EXAMPLE.<br>ADD R4, R1, R1<br>ADPI R4, R4, 1 29+1 | ANOTHER POSSIBILITY<br>ADD $R4, R1, R1$<br>ADD $R4, R4, R1$<br>SGE $R5, R4, R0$<br>SUB $R1, R4, R2$ $29+1-C$ ?                |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| SGE R5, R4, R0<br>ADD R6, R2, R2 RENAME 26<br>BUEZ R5, SICIP<br>ADD R3, R3, R3 20<br>ADDE R4, R6, 1 26+1<br>(IP: SUB R1, R4, R3                                           | SUB RI, R4, R2 29+1-C (<br>BNEZ R5, SKIP<br>ADD R3, R3, R3 EC<br>ADD R4, R2, R2 26<br>ADD R4, R4, 1 26+1-2C<br>SUB R1, R4, R2 |
| <b>NOT APPROVED</b> 4                                                                                                                                                     | SKIP: OTHER CODE                                                                                                              |

 Rewrite the untransformed sequence above, assuming now that the pipeline has no bypassing and no canceling hardware at all, inserting NOPs (no operation instructions) to ensure correct execution.

| 3 CASES<br>ALU -> ALU | TA IB ID EX MAMB WB<br>X X X X IA IB ID  | 4 > ADD<br>ADD                           | R4 ,R1 , R1<br>R4 ,R4 , 1                           |
|-----------------------|------------------------------------------|------------------------------------------|-----------------------------------------------------|
| ALU -> BR             | TA IB ID EX MA MB WB<br>X X X X IA IB ID | 4 ><br>56E<br>4 ><br>BNEZ<br>2 >         | R5, R4, R0<br>R5, SKIP                              |
| BR PELAY              | TA IB ID<br>X X IA TB                    | APD<br>4 > APD<br>4 > ADDI<br>4 ><br>SUB | R3, R3, R3<br>R4, R2, R2<br>R4, R4, 1<br>R1, R4, R3 |

We now consider the CPU to be fully bypassed and to handle branches with pure delayed branch. Write an
optimized sequence that executes correctly.

| NETINAL<br>BRANCH                                              | SEQUENCE                                                                                 | NOTHING TO HOVE IN SLOT<br>FROM BEFORE BRANCH                                                                                                                                                       | ANOTHER POSSIBILITY                                                                                                                                                            |
|----------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| APP<br>APDI<br>SGE<br>BNEZ<br>NOP<br>NOP<br>ADD<br>APD<br>APPI | R4, R1, R1<br>R4, R4, 1 2a+1<br>R5, R4, R0<br>R5, SUIP<br>R3, R3, R3 2C<br>R4, R2, R2 2b | BRING INSTRUCTION FROM<br>FALL THROUGH SEQUENCE<br>IF MISTAGEN EXEC OK<br>MUST OREATE A "TEMP"<br>ADDI R4, R1, R1<br>ADDI R4, R4, 1<br>SGE R5, R4, RO<br>BNEZ R5, SKIP<br>ADD R6, R2, R2 25 IN TEMP | ADD R4, R1, R1<br>ADDI R4, R4, R1<br>9GE R5, R4, R0<br>BNEZ R5, SKIP<br>SUB R1, R4, R3<br>ADD R3, R3, R3<br>ADD R4, R2, R2<br>ADDI R4, R4, R<br>SUB R1, R4, R3<br>P OTHER COPE |
|                                                                |                                                                                          |                                                                                                                                                                                                     |                                                                                                                                                                                |

• For the pipeline diagramed above with full bypassing, including around the MA MB stages, above what is the load latency for these two cases?

| ◇<br>LW<br>ADD | R1, 0(R2)<br>R1, R1, R1 | EX MA MB<br>X X EX     | 2 OYCLE |
|----------------|-------------------------|------------------------|---------|
| ◇<br>LW<br>SW  | R1, 8(R2)<br>R1, 0(R3)  | EX MA MB<br>Y EX MA MB | 1 CYCLE |

END OF SECTION 2 (4 questions, 9 so-far)

5

#### SECTION 3: Loop Transformations With FP Code (3 questions)

The problem is evaluate the benefit of loop unrolling. The FPUs are fully pipelined and bypassed. However the FP division is not pipelined. In case of write contention (one write per clock cycle) the earliest instruction has priority and stalls the contending instruction(s). Ignore the branch delay and assume the following data:

| Loa<br>com<br>FP / | d double<br>pletion of a<br>ADD<br>MULT          |                                 | PU operations                                                  | any FP<br>Store [<br>FPU op<br>FPU op | ction using result<br>PU operation<br>Double<br>peration<br>peration<br>peration | Latency<br>1<br>0<br>3<br>4<br>15 | in | clock | cycles |
|--------------------|--------------------------------------------------|---------------------------------|----------------------------------------------------------------|---------------------------------------|----------------------------------------------------------------------------------|-----------------------------------|----|-------|--------|
| LOC                | P: LD<br>MULTD<br>SD<br>LD<br>ADDD<br>DIVD<br>SD | F4,<br>F4,<br>F4,<br>F6,<br>F6, | 0(R1)<br>F2, F0<br>0(R1)<br>0(R2)<br>F4, F4<br>F6, F0<br>0(R2) |                                       | A[I] = A[I] * c<br>B[I] = (B[I] + B[I])                                          | / c                               |    |       |        |
|                    | ADDI<br>ADDI<br>ADDI<br>BNEZ                     | R1,<br>R2,<br>R5,               | R1, 8<br>R2, 8<br>R5, -1<br>LOOP                               |                                       | INTEGER<br>INDEX AND<br>LOOP<br>CONTROL                                          |                                   |    |       |        |

 Unroll the loop twice, and use other transformation to reduce stalls. For conciseness, ignore the integer code and schedule the FP instruction as if it did not exist.

OBSERVATIONS: 2 INDEPENDENT CALCULATIONS 30 CC, IE Z DIVD DIVD NOT PIPELINED, BEST WE CAN HOPE 15 BASIC LOOP STRATEGY : START WITH B SOFTWARE PIPELINE CAN AND SCHEPULE SECOND LOOP LD VERSION OR EVEN ORIGINAL ONE ATT THAT STALLS. LOTS OF IN THE 1 POSSIBILITIES, BUT THE DIVDS A[1] \* C MULTD OVERLAP. CANNOT F4,0(R2) BFI 4 LD It/ SD A[=] LD F4, dR2) B2I] F10, 8(R2) LD. F10,8(R2) BLEHI LD. LD BLIJ APDD F6, F4, F4 1 APPD F6, F4, F4 ADDD FIG, FID, FID 7 APDD BEJ+BEJ ADPD F16, F10, F10 F2, O(RI) LD A (77 3 LD F2, O(RI) ASE3 ASIT 4p F12,8(R1) ()/c DIVD LD F12, 8(RI) ALIHIJ SP F14,-16(R2) BSI-27 15 DIVD FIG, FG, FO AVD FIL, FG, FO MULTO FL, FZ, FO BJEJ SD HUAD F4, F2, FO 31 MULTD FID, FIZ, FO Cc. HUAD F4, F12, FD 15 SP F6, O(R2) IN OTHER WORPS, SD FG, C(R2) SP F16, 8(R2) SP F16,8(R2) 7 FETCHES ANP DIVD F20, F16, FO 24 STALLS SP F20, -8 (R2) B[I-D F14,0(R2) F20,8(R2) SD 15 PIVD FZO, F16, FD SP · Calculate the number of clock cycles to perform the basic computation, comparing the untransformed loop with the unrolled one: 30 CYCLES

47 CYCLES GIVE OR TAKE A FEW

б

TWO LOOPS.

NOT APPROVED

FOR

- Suppose now that the machine is a super-scalar, dual issue machine with two pipelines: one to process integer, branches and all load/stores instructions, and the other to process all the FP instructions.
  - Use the chart below to show a schedule that maximizes throughput of one single loop accounting for integer code. Assuming that the branch is folded and that there is a hit. You may use more registers if needed.

|          | Integer Instruction   | FP Instruction                        |                  |
|----------|-----------------------|---------------------------------------|------------------|
| cc1 200P | LD \$8, 0(R2)         |                                       |                  |
| CC2      | LD F2, OCRI)          |                                       | ASSUMING THAT    |
| CC3      | APPI RI, RI, 8        | APPD F6, F8, F8                       | AAZARD DETECTION |
| CC4      | APPI R2, R2, 8        | HULD FL, F2, FD                       | AUTOMATICALLY    |
| CC5      | ADDI R5, R5, -1       | DIVD F6, F6, F0                       |                  |
| CC6      | SD F4, -8 (RI)        |                                       | DELAYS EXECS     |
| CC7      | SD F6, -8 (R2)        |                                       |                  |
| CC8      | BNEZ RS, LOOP         |                                       |                  |
| 600      |                       |                                       |                  |
| CC10     | SD F6, -8(RL)         | APD F6, F8, F8                        |                  |
| CC11     | Sp F4, -8.(RI)        | MULTO F4, F2, FO                      |                  |
| CC12     | 4D F8,0(R2)           | PIVD F6, F6, FO                       |                  |
| CC13     | LD F2; O(RI)          |                                       |                  |
| CC14     | APDI R5, R5, -1       |                                       |                  |
| CC15     | ADDI RI, RI, I        |                                       |                  |
| CC16     | APDI R2, R2, I        |                                       | 1                |
| CC17     | BNEZ R'S, SHIP        |                                       |                  |
| CC18     |                       |                                       |                  |
| CC19     |                       |                                       |                  |
| CC20     | WITH SOFT PIPELININ G |                                       |                  |
| CC21     |                       |                                       |                  |
| CC22     |                       |                                       |                  |
| CC23     |                       |                                       |                  |
| CC24     |                       |                                       |                  |
| CC25     |                       |                                       |                  |
| CC26     |                       |                                       |                  |
| CC27     |                       | · · · · · · · · · · · · · · · · · · · |                  |
| CC29     |                       |                                       |                  |
| CC30     |                       |                                       |                  |

Second Explain in a few words the transformation(s) that you have used:

- · RENAMED F4 TO F8 TO AVOID OUT AUT DEPENDENCY
- PERFORME DIVD AS FARLY AS POSSIBLE TO ALLOW FOR MORE OVERLAP.

END OF SECTION 3 (3 questions, 12 so-far)

7

### SECTION 4:Tomasulo's Algorithm (4 questions)

Consider the pipeline below. The integer unit can be controlled to carry out any type of integer instructions. It has one FP ADD/SUB unit and one MULT/DIV unit. The load and store buffers have four entries each. The reservation stations have three entries.



Assume that the latencies are 0 for all integer operations, 1 for loads, 4 for the FP add/sub's, 6 for the multiplication and 25 for the divides, regardless of the instruction using the result. All units are fully pipelined. The Common Data Bus is written at the end of the last clock cycle of an operation. An operation that depends on the written value starts at the next clock cycle. The Common Data bus cannot support multiple data transfers. In case of contention for the bus, the instruction the earliest issued instruction has priority.

Consider now the following sequence of code assuming that a new instruction is fetched at each clock cycle and issued at the next clock cycle if there is free entry in the reservations or in the buffers. A table to indicate timing is partially filled.

· Complete this table first:

|        | fetched                        | Instr.                         | Operands    | Issue                   | Exec         | Write                                                                |
|--------|--------------------------------|--------------------------------|-------------|-------------------------|--------------|----------------------------------------------------------------------|
|        | CC1                            | LD                             | F2. 0(R1)   | CC2                     | CC3-4        | CC4                                                                  |
|        | CC2                            | LD                             | F4. 4(R1)   | CC3                     | CC4-5        | CC5                                                                  |
|        | CC3                            | MULTD                          | F2, F2, F4  | CC4                     | CC6-12       | CC12                                                                 |
|        | CC4                            | DIVD                           | F4. F2. F6  | CC5                     | CCB-02-38    | CC 38                                                                |
|        | CC5                            | SUBD                           | F6. F4. F2  | CC6                     | cc39-Cc43    | CC 43                                                                |
|        | CC6                            | ADDD                           | F6, F2, F2  | CC7                     | CC13-CC17    | CCIT                                                                 |
|        | CC7                            | ADDI                           | R1, R1, 8   | CC8                     | CC9          | CC9                                                                  |
|        | CC8                            | SD                             | F4,8(R1)    | CCA                     | CC 39 💥 🚍    | NEVER                                                                |
|        |                                |                                |             |                         | <b>▼</b> ▼ √ |                                                                      |
| AT CC7 | LCAPS<br>HUZTD<br>DIVD<br>SVBD | COMPLETER<br>EXECUTE<br>ISSUED | S WAITING F | OR HULTP<br>OR PIVD ANT | CC 14        | HULTD COMPLETED<br>AVD EXECUTES<br>SUBD ISSUED, WAITING-<br>FOR PIVD |
|        | APDD                           | ISSUEP                         | WAITING- F  | OR MULTD                |              | APPD EXECUTES<br>APDI COMPLETED                                      |
|        |                                |                                |             | 8                       |              | SD ISSUEP, WAITING FOR                                               |

Use "Mem[Reg[R1]]" to denote, for example, the value fetched by the first load, "Reg[R1]" to denote the value held in register R1, and #8 to denote the value 8.

Indicate in the tables below the state of the reservation stations and of the registers during clock cycle 7.

| Name       | Busy |       | Vj            | Vk               | Qj         | Qk         |
|------------|------|-------|---------------|------------------|------------|------------|
| MULT/DIV-1 | V    | HULTD | MAN [REG[RI]] | METT [REG[RI]+4] | 1          |            |
| MULT/DIV-2 | V    | DIVP  |               | REG [FG]         | TULT/OV-1  |            |
| MULT/DIV-3 |      |       |               |                  |            |            |
| ADD/SUB-1  | V    | SUBP  |               |                  | MULT/PIV-Z | HULT/DIV-1 |
| ADD/SUB-2  | V    | APPD  |               |                  | HULT/DIV-1 | HULT/PIV-1 |
| ADD/SUB-3  | 100  |       |               |                  |            |            |
| INT-1      |      |       |               |                  |            |            |
| INT-2      |      |       |               |                  |            |            |
| INT-3      |      |       |               |                  |            |            |

| Field | · F0 | F2         | F4         | F6        | F8 | R1 | R2 | R3 | R4 |  |
|-------|------|------------|------------|-----------|----|----|----|----|----|--|
| Qi    |      | MULT/DIV-I | HULT/DIV-2 | APD/SUB-2 |    |    |    |    |    |  |

Indicate in the table below the state of the reservation stations and of the store buffers during clock cycle 14.

| Name       | Busy | Op.  | Vj       | Vk       | Qj         | Qk |
|------------|------|------|----------|----------|------------|----|
| MULT/DIV-1 |      |      |          |          |            |    |
| MULT/DIV-2 | V    | PIVD | REG EF2] | REG [F6] |            |    |
| MULT/DIV-3 | 2    | 3    |          |          | = 1= 2     |    |
| ADD/SUB-1  |      | SUBD |          |          | MULT/DIV-2 |    |
| ADD/SUB-2  |      | ADDD | REG[F2]. | REG [F2] |            |    |
| ADD/SUB-3  |      |      |          |          |            |    |
| INT-1      |      |      |          |          |            |    |
| INT-2      |      |      |          |          |            |    |
| INT-3      |      |      |          |          |            |    |

| Field   | Store-1    | Store-2 | Store-3 | Store-4 |
|---------|------------|---------|---------|---------|
| Qi      | HULT/PIV-2 |         |         |         |
| Busy    |            |         |         |         |
| Address | 8+ REGERI] |         |         |         |

Explain in 20 words or less the motivation and the function of a reorder buffer:

• MAKES IT POSSIBLE TO EXCUTE INSTRUCTIONS SPECULATIVELY, I.E. BEFORE A BRANCH OUT COME IS KNOWN, HOLDING RESULTS BEFORE WRITTING THEN. • (INSTRUCTIONS HAVE AN EXTRA "COMMIT" STEP).

END OF SECTION 4 (4 questions, 16 so-far)

9

## SECTION 5: Branch predictors (2 questions)

Consider this nested loop: for (i = 0; i < 100; ++i) for (j = 0; j < 10; ++j) if (j % 2 == 0) if (i % 2 == 0) a[j] = 0; else a[j] = 1;

A machine has a 2-bit branch predictor. Estimate how many correct and how many incorrect predictions are
made while executing this code, assuming a 100% hit rate and no clashes in the predictor table (Hint: start with
the inner most code and work your answer step by step, giving just a number will earn you no marks).

| PISPREPICTONS                                               | CORRECT PREDICTIONS |
|-------------------------------------------------------------|---------------------|
| INNER "IF" VISITED 500 THES WITH PATTERN TITLE NUNNIN T 100 | 400                 |
| FIRST "IF" VISITED 1000 TIMES WITH PATTERN TNTN 500         | 500                 |
| INNER "FOR" VISITED 1000 TIMES EXITS 100 TIMES 100          | 900                 |
| OUTER "FOR" VISITED 100 TIMES EXITS ONCE 1                  | 99                  |
| 2600 VISITS OF WHICH 701                                    | 1899                |
| MISPRE DICTIONS                                             | CORRECT             |

Recall that a (1,2) predictor refers to a correlating predictor that implements correlation by choosing among two different 2-bit predictors. We consider a certain sequence of code that causes three branches B1, B2, B3 to be visited with the outcomes listed in the table (T taken, N not taken, hint, there is no obvious pattern). Each branch is visited 4 times. Assume that prior entering the sequence, all the predictors are in the state that would result from a series of taken branches and this applies to both predictors of each branch.

|             | Branch      | B1             | В2           | В3             | В2             | B1  | В3   | B1             | В3   | B2             | в3   | B1    | B2   |
|-------------|-------------|----------------|--------------|----------------|----------------|-----|------|----------------|------|----------------|------|-------|------|
| LAGT BRANXH | Outcome     | w <sup>N</sup> | RT           | ₩ <sup>N</sup> | w <sup>N</sup> | • N | R T  | ₩ <sup>N</sup> | R T  | w <sup>N</sup> | р T  | RN    | RT   |
| TAKEN       | Predictor 1 | PTO            | ρτο          | PTOJ           | Ρτο            | PT1 | PTI  | PNDU           | PTI  | PTO            | PTI  | PNO 1 | PTI  |
| NOT TAKEN   | Predictor 2 | PTO            | PTO)<br>PTO) | PTO            | PTO 2          | PTO | PTON | PT1            | PTON | PTI            | PTON | PTI   | PTOD |

• Work out the number of correct and incorrect predictions for each of the three branches. This can be done by filling in the table above. There is no need to restate the details of 2-bits predictors but nevertheless indicate the functions you assigned to Predictor 1 and to Predictor 2. To clarify your work, use the symbols PTO, PT1, PNO, PN1 to refer to the states of a 2-bit predictor and specify concisely what they mean.

DEF'S

6 RIGHT, 6 WRONG.

END OF SECTION 5 (2 questions, 18 so-far)

SECTION 6: Loop level parallelism (3 questions)

Consider this loop:

Rewrite the loop so it becomes parallel. Solve this problem in two different ways:

· First use software renaming, not changing the structure of the loop:

LOOP CAPALED LOUTRY DEPENDENCY SI OVERWAITES SB VIA ali], ali-il LOOP CARRIED ANTI-DEPENDENCY SB HUST-READ BEFORE SI WRITES HOWEVER all-il Always WRITTEN, NEVER READ, CALL IT T [i-i] for (-) f T[i-i] = c[i-i] + n; EACH ITERATION. IS b[i] = m + c[i]; INDEPENDENT FROM a[i] = a[i] + b[i]; THE PREVIOUS ONE g

Second transform the loop without renaming the variables so it becomes parallel:

ONLY ONE DATA DEPENDENCE,  $S2 \Rightarrow S3$ , START WITH SECOND STATEMENT a[0] = c[0] + n;  $for (i = 1; i < 99; ++i) {$  b[i] = m + c[i]; q[i] = a[i] + b[i]; a[i] = c[i] + m; a[i] = c[i] + m; consider now this loop: $<math>for (i = 1; i < 100; ++i) {$  a[a] = a[i] + n;  $for (i = 1; i < 100; ++i) {$  a[a] = a[a] + n;  $for (i = 1; i < 100; ++i) {$  n = a[i] + n; $for (i = 1; i < 100; ++i) {$ 

· Can it be transformed to become parallel? Explain.

}

```
NOMINALLY NO, BECAUSE EACH ITERATION DEPENDS ON
THE PREVIOUS ONE.
```

```
HOWEVER, SINCE ET'S A RUNNING SUM, WITH
PROFOUND TRANSFORMATIONS, PARTIAL SUMS COULD BE
COMPUTED IN PARALLEL.
```

END OF SECTION 6 (3 questions, 21 question so-far)

# SECTION 7: Memory Hierarchy (4 questions)

Consider this hierarchical memory design, along with the following facts: It is word addressed and the words are 32 bit long. The physical memory has a capacity of 1 Megabyte. The cache blocks have 4 words. The page size is 1 kByte



Set Associative Cache (4 sets of 4 blocks), Write Back, Write Allocate

Fill all the missing field size specifications (< >) on the diagram above.

• Compute the total size of the page table: # PAGES \* SIZE OF PAGE TABLE ENTRY SIZE (PTE) = SIZE (PERIT BITS) + SIZE (DIRTY BIT) + SIZE (TAG) + SIZE (PHYS. P. #)

$$= 3 + 1 + 24 + 10$$

$$= 48 \text{ BITS } 2 5 \text{ BYTES}$$

$$= 48 \text{ BITS } 2 5 \text{ BYTES}$$

$$= 48 \text{ BITS } 2 5 \text{ BYTES}$$

$$= 2^{32} = 2^{24} = 16 \text{ Megs}$$

$$= 2^{8} = 2^{8} = 2^{8} = 2^{8} = 16 \text{ Megs}$$

SIZE (PAGE TABLE) = 80 Mbytes

- Consider a benchmark such that all the addresses in the address space are sequentially read once for the machine described above. It starts from a blank cache.

NO LOCALITY, SO ONE CACHE MISS PER BLOCK ADDRESS SPACE =  $2^{32}$ ; SIZE OF BLOCK =  $2^2$  WORDS  $\Rightarrow 2^{30}$  CACHE MISSES

Owner of the number of page faults?

SIMILARLY, PAGE SIZE IS 2 WORDS

=> 2<sup>24</sup> PAGE FAULTS

· Answer these yes/no questions

- "A victim cache" is a method to reduce the miss rate in caches [yes/no]: YES
- $\diamond$  Software pipelining is used to minimize the number of page faults [yes/no]: NO (ILP)
- "Interleaved Memory" makes it possible to pipeline block replacement in caches [yes/no]: YES

## END OF SECTION 7 (4 questions)

AND

END OF EXAM (and end of the 25 questions)