| Last Name:_ | SAMPLE | First Name: | | |--------------|--------|-------------|--| | | | | | | Student ID:_ | | Signature: | | #### Course 304-425B -- Computer Organization and Architecture Final examination December 21, 2001, 14:00 -- 17:00 Examiner: Prof. V. Hayward Associate Examiner: Prof. J. Ferrie #### INSTRUCTIONS - □ This is a closed book examination. Calculators and one or two sheets of notes are allowed. - □ Explain every result concisely when asked. Marks will be given for clear, concise solutions. - ☐ State any assumption required for an answer if it is not clear in the text of the question. - □ This exam has 14 pages including this one. It has 7 sections for 25 questions. Each question is indicated by a bullet sign (•) and carries four (4) marks. The marks add up to 100. - □ Please sign this paper at the top of the page, write your name and student number legibly there. - Put your answers in the space provided and keep all the pages together. #### PLEASE NOTE CAREFULLY - □ Make sure that the signed paper in its entirety is handed in (along with all signed exam books) at the end of the examination. - Make sure that the answers are put in the space provided, answers in any other location will not be marked. - ☐ You have approximately 180 minutes to complete the exam. # PROBLEMS SECTION 1: Performance (5 questions) • Consider adding an enhancement to a machine that would divide the number of load/store instructions by two. All other instructions would be left unchanged. However, the clock cycle of the machine would be 7% longer and the average CPI for all instructions would change from 1.1 clock cycle, before enhancement, to 1.2 after enhancement. What is the minimum percentage of load/stores in the instruction mix produced by an application would be required to make the enhancement worth implementing, assuming it is free of cost? TIME NEW $$\leq$$ TIME OLD $\Rightarrow$ IC 11 CC $\leq$ IC 1.2 $\left(1-\frac{\alpha}{2}\right)$ 1.07 CC $1.1 \leq 1.2 \left(1-\frac{\alpha}{2}\right)$ 1.07 $\Rightarrow$ $\frac{\alpha}{2} \geqslant \left(1-\frac{11}{1.2 \cdot 1.07}\right)$ $\Rightarrow$ $\alpha \geqslant 0.286$ Consider now a five stage pipeline which causes 20% of instructions to stall for one cycle on average. Consider the memory to be perfect, that is, there is no memory stalls. Assuming that the clock rate is four times faster than a non-pipelined version of the same CPU, what is the effective improvement in performance? ONE STAGE TIME = IC CPI CC TIME = IC CPI CC 1.2 SU = $$\frac{1}{5}\frac{1}{4}$$ 1.2 = 16.7 5 INSTRUCTIONS CLOCK CYCLE TIME OVERLAP REDUCED • Consider a single pipeline machine like DLX. For a given benchmark suite, measurements show that 10% of the instructions are unconditional jumps. A first enhancement is proposed in the form of a "jump target buffer" and a second in the form of "jump folding". Remember that the jump target address is calculated in the ID stage and the buffer updated in the EX stage. For these enhancement, the hit rate is 90%. The average CPI of the base machine is estimated to be 1.5 for all other instructions. All other aspects of the machine are assumed to be unchanged. Compute the expected speed-up due to the implementation of: o a "jump target buffer" NORMAL DLX $$CPI_OLD = 0.1 * 2 + 0.9 * 1.5 = 1.55$$ $CPI_NEW = 0.1 (0.9 * 1 + 0.1 \times 2) + 0.9 * 1.5$ $CPI_NEW = 0.1 (0.9 * 1 + 0.1 \times 2) + 0.9 * 1.5$ $CPI_NEW = 0.1 (0 + 0.1 \times 2) + 0.9 * 1.5$ $CPI_NEW = 0.1 (0 + 0.1 \times 2) + 0.9 * 1.5$ $CPI_NEW = 0.1 (0 + 0.1 \times 2) + 0.9 * 1.5$ $CPI_NEW = 0.1 (0 + 0.1 \times 2) + 0.9 * 1.5$ $CPI_NEW = 0.1 (0 + 0.1 \times 2) + 0.9 * 1.5$ $CPI_NEW = 0.1 (0 + 0.1 \times 2) + 0.9 * 1.5$ · Consider now a machine with a virtual memory and two levels of cache. The performance figures are: L1: the hit time is 1 cc and the miss rate is 5% L2: the hit time is 10 cc and the miss rate is 2% MM: the hit time for a block is 40 cc, the miss rate 0.0001%, and miss penalty is 10,000,000 cc Compute the global average memory access time (AMAT) AMAT = $$1 + 0.05 (10 + 0.02 (40 + 10^{-6} 10^{7}))$$ = $1 + 0.05 (10 + 0.02 50)$ = $1 + 0.05 11 = 1.55$ • We focus now on the performance of a DLX style machine with one level of cache that has a 1 cc hit time, a 5% miss rate, and a miss penalty of 16 cc. We improve this cache so it becomes non blocking and is such that the CPU can restart as soon as the first word arrives from the MM 4 cc after request. This improvement concerns the instruction fetches only. The data cache remains blocking and 30% of all instructions are loads. The data cache is "write through" but has a write buffer that eliminates data stalls for stores. What is the speed up introduced by the non blocking feature? BLOCKING CASE $$SU = \frac{2.04}{1.44} = 1.42$$ # SECTION 2: Integer Pipelining (4 questions) Timing analysis reveals that the memory cycles in the standard DLX pipeline are the limiting factor for clock cycle time improvement. The machine is super-pipelined. Complete instruction fetches takes two stages IA and IB. In the first stage, the memory addressed is specified, in the second, the instruction is read out. The same technique is applied to the MEM stage, now split into a MA and a MB stage. The new design is fully pipelined but has This is symbolically represented by indicating new pipe registers. • We need a sequence of pipelined DLX code to compute: $\begin{cases} 2a+1-c & \text{if } 2a+1 \ge 0 \\ 2b+1-2c & \text{otherwise} \end{cases}$ Upon entry, a is stored in register R1 and the result is left in register R1, b is in R2, c is in R3. The pipeline handles all structural and data hazards. The numbers processed are expected to be equally likely to yield one case or the other. Control hazards are handled via "delayed branch with canceling". Here is a semi-optimized sequence: ``` R4, R1, R1 ADD R4, R4, 1 22+1 ADDI R5, R4, R0 <0 SGE BNEZ R5, SKIP ADD R3, R3, R3 2 C 26 ADD R4, R2, R2 26+1 ADDI R4, R4, 1 SKIP SUB R1, R4, R3 ``` Use speculative execution to write an optimized sequence. The machine does not have hardware speculative features such as boosted instructions or poison bits. (Hint: do not try to manipulate the expressions compute them as given). HAS TWO PLACES DELAY 920TMSTRUCTION HOVE CONTROL PEPENDENT BEFORE BRANCH TO SPECULATIVELY CALCULATE 26, FOR EXAMPLE. ADD R4, RI, RI ADPI R4, R4, 1 29+1 SGER5, R4, RO R6, R2, R2 RENAME 26 ADD BNFZ R5, SKIP R3, R3, R3 App 2C ADDI R4, R6, I 26+1 SUB RI, R4, R3 SKIP: ANOTHER PossiBILITY 124, RI, RI ADD R4, R4, 1 ADD 29+1 R5, R4, RO SGE RI, R4, R2 2a+1-C ട്യെ RE, SKIP BNEZ R3, R3,R3 ADD 20 R4, R2, R2 26 ADD R4, R4, 1 APPE 2b+1-2C R1, R4, R2 SUB SKIP: OTHER CODE Rewrite the untransformed sequence above, assuming now that the pipeline has no bypassing and no canceling hardware at all, inserting NOPs (no operation instructions) to ensure correct execution. We now consider the CPU to be fully bypassed and to handle branches with pure delayed branch. Write an optimized sequence that executes correctly. | opin | inced sequence and execute. | s conceny. | A COUNTY OF THE PARTY OF THE PARTY | |---------|-----------------------------|----------------------------------------------------------------------------------------|------------------------------------| | MATINAL | SEQUENCE | NOTHING TO YOUR IN SLOT | ANOTHER POSSIBILITY | | BRANCH | DELAYED | FROM BEFORE BRANCH. | | | APP | R4, R1, R1 | BRING INSTRUCTION FROM FALL THROUGH SEQUENCE IF HISTAMEN EXEC OK. MUST GREATE A "TEMP" | ADD R4, R1, R1 | | APDI | R4, R4, A 29+1 | | APPI R4, R4, I | | SGE | R5, R4, RO | | 9GE R5, R4, RO | | BNEZ | RS, SUIP | ADD R4, R1, R1 | BNEZ R5, SKIP | | Nop | | ADDI R4, R4, 1 | SUB R1, R4, R3 | | Nop | | SGE R5, R4, RO | ADD R3, R3, R3 | | ADD | R3, R3, R3 2C | ADD RG, RZ, RZ 26 IN TEMP | APP R4, R2, R2 | | ADD | R4, R2, R2 2b | | APPI R4, R4, I | | APDI | R4, R4, 1 2b+1 | | SUB R1, R4, R3 | | SUB | | ADD R3, R3, R3<br>ADDI R4, R6, 1 SKIP<br>SUB R1, R4, R3 | OTHER COPE | For the pipeline diagramed above with full bypassing, including around the MA MB stages, above what is the load latency for these two cases? SKIP END OF SECTION 2 (4 questions, 9 so-far) #### SECTION 3: Loop Transformations With FP Code (3 questions) The problem is evaluate the benefit of loop unrolling. The FPUs are fully pipelined and bypassed. However the FP division is not pipelined. In case of write contention (one write per clock cycle) the earliest instruction has priority and stalls the contending instruction(s). Ignore the branch delay and assume the following data: | Instruction producing result Load double completion of any FPU operations FP ADD FP MULT FP DIV | Instruction using result any FPU operation Store Double FPU operation FPU operation FPU operation | Latency 1 0 3 4 15 | in clock | cycles | |------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|--------------------|----------|--------| | LOOP: LD F2, 0(R1) MULTD F4, F2, F0 SD F4, 0(R1) LD F4, 0(R2) ADDD F6, F4, F4 DIVD F6, F6, F0 SD F6, 0(R2) | A[I] = A[I] * c B[I] = (B[I] + B[I]) / | c | | | | ADDI R1, R1, 8<br>ADDI R2, R2, 8<br>ADDI R5, R5, -1<br>BNEZ R3, LOOP | INTEGER INDEX AND LOOP CONTROL | | | | Unroll the loop twice, and use other transformation to reduce stalls. For conciseness, ignore the integer code and schedule the FP instruction as if it did not exist. ``` OBSERVATIONS: 2 INDEPENDENT CALCULATIONS DIVD NOT PIPELINED, BEST WE CAN HOPE BASIC LOOP STRATEGY: START WITH B SOFTWARE PIPELINE CAN AND SCHEPULE SECOND LOOP LD VERSION OR EVEN ORIGINAL ONE THAT STALLS. LOTS OF IN THE POSSIBILITIES, BUT THE DIVES MULTO OVERLAP. CANNOT F4, O(R2) BŒI 4 I+/ SD A[=] LD F4, dR2) F10, 8(R2) Blitti LD F10,8(R2) LD BLIJ APPD F6, F4, F4 ADDD F6, F4, F4 Appid F16, F10, F10 B[E]+B[E] ADPO F16, F10, F10 F2, O(RI) LD 3 LD F2, OCRI) AIII 40 F/2, 8(RI) DIVD LD F12, 8(R1) ALIHI SP F14,-16 (R2) B [I-2] 15 DIVD F14, F6, FO AND F4, F6, F0 HULTO F4, F2, F0 HUAD F4, F2, FO MULTO FIO, FIZ, FO MUAD F4, F12, FO SP F6, O(RZ) IN OTHER WORPS, SD F6, C(R2) SP F16, 8(RZ) 7 FETCHES F16, 8(RZ) ANP Y DIVD F20, F16, FO 24 STALLS B [I-1] SP F20, -8 (R2) F14 1 O(RZ) F20, 8 (RZ) SD DIVD F20, F16, F0 SP Calculate the number of clock cycles to perform the basic computation, comparing the untransformed loop with the unrolled one: 30 CYCLES 47 CYCLES GIVE OR TAKE FOR TWO LOOPS ``` A FEW - Suppose now that the machine is a super-scalar, dual issue machine with two pipelines: one to process integer, branches and all load/stores instructions, and the other to process all the FP instructions. - Use the chart below to show a schedule that maximizes throughput of one single loop accounting for integer code. Assuming that the branch is folded and that there is a hit. You may use more registers if needed. | | Integer Instruction | FP Instruction | |----------|----------------------|------------------| | cc1 Loop | LD #8, O(R2) | | | CC2 | LD F2, O(R1) | | | CC3 | APPI RI, RI, 8 | APPD F6, F8, F8 | | CC4 | ADDI RZ, RZ, 8 | MULLD F4, F2, FO | | CC5 | ADDI R5' R5' -1 | DIVD F6, F6, F0 | | CC6 | SD F4, -8 (RI) | , , | | CC7 | SD F6, -8 (R2) | | | CC8 | BNEZ RS, LOOP | | | CC9 | | | | CC10 | SD F6, -8 (R1) | ADD F6, F8, F8 | | CC11 | Sp F4, -8 (R1) | MULTO F4, F2, FO | | CC12 | 4D F8,0(R2) | DIVD F6, F6, F0 | | CC13 | LD F2, 0 (RI) | | | CC14 | ADDI R5, R5 -1 | | | CC15 | ADDI RI, RI, I | | | CC16 | APRI R2, R2, I | | | CC17 | BNEZ RS, SKIP | | | CC18 | | | | CC19 | | | | CC20 | WITH SOFT PIPELINING | • | | CC21 | | | | CC22 | | | | CC23 | | | | CC24 | | | | CC25 | | | | CC26 | | | | CC27 | | | | CC29 | | | | CC30 | | | ASSUMING THAT AARARD RETECTION AUTOMATICALLY DELAYS EXECS - Explain in a few words the transformation(s) that you have used: - · RENAMED F4 TO F8 TO AVOID OUT PUT DEPENDENCY - O PERFORME DIVD AS FARLY AS POSSIBLE TO ALLOW FOR MORE OVERLAP. ## SECTION 4:Tomasulo's Algorithm (4 questions) Consider the pipeline below. The integer unit can be controlled to carry out any type of integer instructions. It has one FP ADD/SUB unit and one MULT/DIV unit. The load and store buffers have four entries each. The reservation stations have three entries. Assume that the latencies are 0 for all integer operations, 1 for loads, 4 for the FP add/sub's, 6 for the multiplication and 25 for the divides, regardless of the instruction using the result. All units are fully pipelined. The Common Data Bus is written at the end of the last clock cycle of an operation. An operation that depends on the written value starts at the next clock cycle. The Common Data bus cannot support multiple data transfers. In case of contention for the bus, the instruction the earliest issued instruction has priority. Consider now the following sequence of code assuming that a new instruction is fetched at each clock cycle and issued at the next clock cycle if there is free entry in the reservations or in the buffers. A table to indicate timing is partially filled. #### · Complete this table first: | fetched | Instr. | Operands | Issue | Exec | Write | |---------|--------|------------|-------|-----------|-------| | CC1 | LD | F2. 0(R1) | CC2 | CC3-4 | CC4 | | CC2 | LD | F4. 4(R1) | CC3 | CC4-5 | CC5 | | CC3 | MULTD | F2, F2, F4 | CC4 | CC6-12 | CC12 | | CC4 | DIVD | F4. F2. F6 | CC5 | CB-C38 | cc 38 | | CC5 | SUBD | F6. F4. F2 | CC6 | cc39-Cc43 | cc 43 | | CC6 | ADDD | F6, F2, F2 | CC7 | CC13-CC17 | CC17 | | CC7 | ADDI | R1, R1, 8 | CC8 | CC9 | CC9 | | CC8 | SD | F4,8(R1) | CC 9 | cc 39 | NEVER | | | | | | | | | | | | | | | | AT CC7 | LOADS<br>HULTID<br>DIVD<br>SVBD | COMPLETED<br>EXECUTES<br>ISSUED, WAITING<br>ISSUED, WAITING | FOR<br>FOR | HULTD<br>DIVD AND MULTD | CC 14 | MULTO COMPLETED AND EXECUTES SUBD ISSUED, WAITING- FOR DIND | |--------|---------------------------------|-------------------------------------------------------------|------------|-------------------------|-------|-------------------------------------------------------------| | | APAD | ISSUEP, WAITING | FOR | MULTID | | APAD EXECUTES<br>APAT COMPLETED | | | | | | 8 | | SD ISSUEP, WHITNE FOR | Use "Mem[Reg[R1]]" to denote, for example, the value fetched by the first load, "Reg[R1]" to denote the value held in register R1, and #8 to denote the value 8. Indicate in the tables below the state of the reservation stations and of the registers during clock cycle 7. | Name | Busy | | Vj | Vk | Qj | Qk | |------------|------|-------|--------------|-----------------|------------|------------| | MULT/DIV-1 | V | HULTD | MAT[REG[RI]] | HETI[REG[RI]+4] | l | | | MULT/DIV-2 | V | DIVD | | REG [FG] | HULT/OIV-1 | | | MULT/DIV-3 | | | | | | | | ADD/SUB-1 | V | SUBP | | | MULT/DIV-Z | MULT/DIV-1 | | ADD/SUB-2 | V | APAD | | | HULT/DIV-1 | MULT/DIV-1 | | ADD/SUB-3 | | | | | | | | INT-1 | | | | | | | | INT-2 | | | | | | - | | INT-3 | | | | | | | | Field | F0 | F2 | F4 | F6 | F8 | R1 | R2 | R3 | R4 | |-------|----|------------|------------|-----------|----|----|----|----|----| | Qi | | MULT/DIV-1 | HULT/DIV-2 | ADD/SUB-2 | | | | | | Indicate in the table below the state of the reservation stations and of the store buffers during clock cycle 14. | Name | Busy | | Vj | Vk | Qj | Qk | |------------|------|------|----------|----------|------------|---------------------------------------| | MULT/DIV-1 | | | | | | | | MULT/DIV-2 | V | DIVD | PEG [F2] | REG [F6] | | | | MULT/DIV-3 | | * | | | -/ | | | ADD/SUB-1 | | SUBD | | | MULT/DIV-2 | | | ADD/SUB-2 | | ADDD | REG[F2]. | REG [F2] | | | | ADD/SUB-3 | | | | | | · · · · · · · · · · · · · · · · · · · | | INT-1 | | | | | | | | INT-2 | | | | | | | | INT-3 | | | | | | | | Field | Store-1 | Store-2 | Store-3 | Store-4 | |---------|------------|---------|---------|---------| | Qi | HULT/PIV-2 | | | | | Busy | | | | | | Address | 8+ REGERI] | | | | - Explain in 20 words or less the motivation and the function of a reorder buffer: - TAVES IT POSSIBLE TO EXCUTE INSTRUCTIONS SPECULATIVELY, I.E. BEFORE A BRANCH OUT COME IS KNOWN, HOLDING RESULTS BEFORE WRITING THEM. (INSTRUCTIONS HAVE AN EXTRA "COMMIT" STEP), ## SECTION 5: Branch predictors (2 questions) Consider this nested loop: for (i = 0; i < 100; ++i) $$\frac{100}{100}$$ for (j = 0; j < 10; ++j) $\frac{1000}{100}$ if (j % 2 == 0) $\frac{1000}{100}$ if (i % 2 == 0) $\frac{500}{100}$ a[j] = 0; else A machine has a 2-bit branch predictor. Estimate how many correct and how many incorrect predictions are made while executing this code, assuming a 100% hit rate and no clashes in the predictor table (Hint: start with the inner most code and work your answer step by step, giving just a number will earn you no marks). | · · · · · · · · · · · · · · · · · · · | HIS PREPICTIONS | CORRECT PREPICTIONS | |--------------------------------------------------------|-----------------|------------------------| | INVER "IF" VISITED 500 THES WITH PATTERN TITT NUNNN T. | 100 | 400 | | FIRST "IF" VISITED 1000 TIMES WITH PATTERN THIN | 500 | 500 | | INNER "FOR" VISITED 1000 TIMES EXITS 100 TIMES | 100 | 900 | | OUTER "FOR" VISITED 100 TIMES EXITS ONCE | 1 | 94 | | 2600 VISITS OF WHICH | 701 | 1899 | | THE | SPREDICTIONS | CORRECT<br>PREDICTIONS | Recall that a (1,2) predictor refers to a correlating predictor that implements correlation by choosing among two different 2-bit predictors. We consider a certain sequence of code that causes three branches B1, B2, B3 to be visited with the outcomes listed in the table (T taken, N not taken, hint, there is no obvious pattern). Each branch is visited 4 times. Assume that prior entering the sequence, all the predictors are in the state that would result from a series of taken branches and this applies to both predictors of each branch. LAGI BRANCH TAKEN NOT TAKEN | | Branch | | B1 | B2 | В3 | В2 | В1 | В3 | B1 | В3 | В2 | В3 | B1 | B2 | |---|-------------|---|-----|------|-------|----------------|-----|-------|----------------|------|------|-------|-------|-----| | - | Outcome | , | w N | R T | w N | w <sup>N</sup> | w N | R T | w <sup>N</sup> | R T | w N | T | P N | R T | | | Predictor 1 | | PTO | Рто | PTO 1 | PTO | PT1 | PTI | PNO | PTI | PTO+ | PTI | PNO 1 | PTI | | | Predictor 2 | | PTO | PTO) | PTO | PTI 2 | PTO | PTOIL | PT1 | PTON | PTI | PTO 1 | PTI | PTO | Work out the number of correct and incorrect predictions for each of the three branches. This can be done by filling in the table above. There is no need to restate the details of 2-bits predictors but nevertheless indicate the functions you assigned to Predictor 1 and to Predictor 2. To clarify your work, use the symbols PTO, PT1, PNO, PN1 to refer to the states of a 2-bit predictor and specify concisely what they mean. DEF'S END OF SECTION 5 (2 questions, 18 so-far) ## SECTION 6: Loop level parallelism (3 questions) Consider this loop: Rewrite the loop so it becomes parallel. Solve this problem in two different ways: · First use software renaming, not changing the structure of the loop: ``` LOOP CAPALED COUTN'T DEPENDENCY SI OVERWRITES SI VIA ali], ali-i] LOOP CARRIED ANTI-DEPENDENCY SI HUST READ BEFORE SI WRITES HOWEVER ali-i] Always written, Never READ, CALL IT T(i-i) for ( - ) { T(i-i] = c(i-i] + m; EACH ITERATION. IS b(i) = m + c(i); INDEPENDENT FROM a(i) = a(i) + b(i); THE PREVIOUS ONE ``` Second transform the loop without renaming the variables so it becomes parallel: ``` ONLY ONE DATA DEPENDENCE, S2 \rightarrow S3 START WITH SECOND STATEMENT a[0] = c[0] + m; OR a[0] = c[0] + m; OR a[0] = c[0] + m; a[0] = a[0] + b[0]; b[0]. Consider now this loop: Consider now this loop: a[0] = c[0] + m; a[0] = a[0] + b[0]; a[0] = a[0] + b[0]. ``` · Can it be transformed to become parallel? Explain. ``` NOMINALLY NO, BECAUSE EACH ITERATION DEPENDS ON THE PREVIOUS DNE. HOWEVER, SINCE TT'S A RUNNING SUM, WITH PROFOUND TRANSFORMATIONS, PARTIAL SUMS COULD BE COMPUTED IN PARALLEL. ``` END OF SECTION 6 (3 questions, 21 question so-far) ## SECTION 7: Memory Hierarchy (4 questions) Consider this hierarchical memory design, along with the following facts: It is word addressed and the words are 32 bit long. The physical memory has a capacity of 1 Megabyte. The cache blocks have 4 words. The page size is 1 kByte Set Associative Cache (4 sets of 4 blocks), Write Back, Write Allocate Fill all the missing field size specifications (< >) on the diagram above. - Consider a benchmark such that all the addresses in the address space are sequentially read once for the machine described above. It starts from a blank cache. - What is the number of cache misses? NO LOCALITY, SO ONE CACHE MISS PER BLOCK APPRESS SPACE = $$2^{32}$$ ; SIZE OF BLOCK = $2^2$ WORDS $$\Rightarrow 2^{30}$$ CACHE MISSES What is the number of page faults? - · Answer these yes/no questions - o "A victim cache" is a method to reduce the miss rate in caches [yes/no]: - ♦ Software pipelining is used to minimize the number of page faults [yes/no]: NO (TLP) - ♦ "Loop fusion" is a software technique to speed up recursive code [yes/no]: NO SETTER USE OF CACHE) - "Interleaved Memory" makes it possible to pipeline block replacement in caches [yes/no]: "YES END OF SECTION 7 (4 questions) AND END OF EXAM (and end of the 25 questions)