How is the pipelining in the 8086 throughput

Here: As many instructions as possible should be carried out in a unit of time. Throughput.

Transcript

1 Pipelining with the DLX 560 processor Pipelining: Implementation technology Widely used in computer architecture. Pipelining makes CPUs fast. Pipelining is like assembly line processing. Execution of stages of the pipeline one behind the other, pipe stages. Comparison of car production: Maximizes throughput, i.e. as many cars as possible are produced per hour. Even though the production of a car can take many hours. Here: As many instructions as possible should be carried out in a unit of time. Throughput. Since the stages of the pipeline hang one behind the other, the forwarding of the cars (the instructions) must be synchronous, i. H. take place clocked. Otherwise there would be a traffic jam somewhere. 1

2 For us, the cycle for this transfer will be the machine cycle. The length of the machine cycle is therefore determined by the maximum processing time of the units in the pipeline for a calculation. The pipeline designer should therefore strive to design all stages so that processing takes about the same time within the stages. Ideally, the cycle time of a machine with pipelining is the time for processing without pipeling. Number of stages As a result, the speedup through pipelining is at most equal to the number of stages. This corresponds to car production, in which an n-stage pipeline can produce n times as many cars. 2

3 Pipelining can - reduce the CPI or - reduce the cycle time or - both If the original machine needed several cycles to execute an instruction, the CPI is reduced by pipelining. If the originating machine took a long clock to execute an instruction, pipelining reduces the clock cycle time. Pipelining uses parallel processing within the processors. Advantage: It is - invisible to the user. 3

4 The DLX Pipeline We will study the problems of pipelining at the DLX because it is simple and has all the essential features of a modern processor. First we assume a version of the DLX without pipelining: - not optimal, but in such a way that it can easily be converted into a version with pipelining. - The execution of each instruction takes 5 clock cycles. 4th

5 Instruction fetch Instruction decode / register fetch Execute / address calculation Memory access Write back 4 Add NPC Zero? Branch taken Cond M u x PC Instruction memory IR isters A B M u x M u x output Data memory LMD M u x 16 Sign 32 extend lmm DLX data path with clock cycles (without pipelining) 5

6 1. Instruction Fetch cycle: (IF): IR <- Mem [PC] NPC <- PC + 4 Get the value from the memory, whereby the address is in the current PC and store it in the IR (instruction ister) . Increase the pc by 4, because that's where the next instruction is. Save this value in NPC (new PC). 2. Instruction decode / ister fetch cycle (ID): A <- s [ir] B <- s [ir] M <- ((IR 16) 16 ## IR Decode the instruction in IR and get the values The values ​​of the isters are initially temporarily latched in A and B for use in the following clock cycles. The lower 16 bits of IR are sign-extended and are also stored in the temporary ister M. 6

7 Decoding and reading are carried out in one cycle. This is possible because the address bits are in fixed positions (6..10,,). It may be that we read an iser that we don't even need. But that doesn't do any harm, otherwise it will simply not be used. The immediate operand is supplemented by 16 copies of the sign bit and is also read if it is needed in the following cycle. 3. Execution / Effective Address Cycle (EX): Four different operations can be carried out here, depending on the DLX command type: Command with memory access (load / store): output <- A + M The effective address is calculated and stored in the temporary list output saved. 7th

8 ister-ister command: output <- A func B The applies the operation func (which is specified in bits 0..5 and the opcode) to A and B and saves the result in the temporary ister output. ister-immediate command: output <- A op M The applies the operation op (which is specified in bits 0..5 of the opcode) to A and B and saves the result in the temporary ister output. Branch command: output <- NPC + M Cond <- (A op 0) The calculates the address of the jump destination relative to the PC. Cond is a temporary ister in which the result of the condition is saved. Op is specified in the opcode and is e.g. same for BEQZ or different for BNEZ. 8th

9 Because of the load / store architecture, it never happens that a memory address for a data access has to be calculated and an arithmetic operation has to be carried out at the same time. Therefore, the phases Execution and Effective Address determination are possible in the same cycle. 4. Memory Access / Branch Completion Cycle (MEM): The commands active in this cycle are load, store and branches: Load command (load): LMD Mem [output] PC NPC The word addressed with output is written into the temporary LMD . Memory command (store): Mem [output] B PC NPC The value from B is stored in the memory under the address in output. 9

10 Branch command (branch): if Cond then PC <- output else PC <- NPC If the branch is executed, the PC is set to the jump target that is stored in output, otherwise to the NPC. Whether or not a branch is made has been written to Cond in the Execute cycle. 5. Write back cycle (WB): ister-ister command: s [ir] <- output ister-immediate command: s [ir] <- output 10

11 Load command: s [ir] <- LMD The result is written into the ister. It comes either from the output or from the memory (LMD). The destination register can be coded in two places in the instruction, depending on the opcode. Explanation of the data path image At the beginning or at the end of each cycle, each result is in a memory, either in a GPR or in a main memory location or in a temporary memory (LMD, M, A, B, NPC, output, Cond). The temporary isters only hold their values ​​during the current instruction, while all other memories are visible parts of the processor state. In this implementation, branch commands and store commands take four bars and all other commands take five bars. If you set the branching frequency at 12% and store at 8% (as was done in the investigations in the last chapter), you get a CPI of 4.80. 11

12 We can now rebuild the machine so that we can start executing an instruction in every cycle with almost no change. This is called pipelining. This results in an execution pattern as shown on the following slide. 12th

13 Program execution (in commands) Time (in cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 DM DM DM DM DM Simplified representation of the DLX data path 13

14 Pipeline diagram clock command command i IF ID EX MEM WB command i + 1 IF ID EX MEM WB command i + 2 IF ID EX MEM WB command i + 3 IF ID EX MEM WB command i + 4 IF ID EX MEM WB 14

15 Pipelining is not as easy as it first appears here. In the following we want to cover the problems associated with pipelining. Different operations must use different parts of the hardware at time i. To check this, we use a simplified representation of the DLX data path, which we move to itself according to the pipeline. Difficulties arise in three places: 1. The memory is accessed in the IF cycle and in the MEM cycle. If this were physically the same memory, both accesses could not take place in one clock cycle. We therefore use two different caches, a data cache that is accessed in the MEM cycle and an instruction cache that we use in the IF cycle. By the way: the memory in the pipelined version has to supply five times as much data as in the simple version. The resulting bottleneck to the memory is the price for the higher performance. 2. The isters are used in the ID and in the WB cycle. You will see later in your studies how to do two reads and one write in one cycle. 15th

16 3rd PC. We don't have the PC in the simplified representation of the data path. PC must be incremented in each cycle, in the IF phase. The problem of branching only changes the PC in the MEM phase. But incrementing already happens in the IF phase. How to deal with this is shown later. Every step is active in every measure. Any combination of simultaneously active levels must be possible. Values ​​that are passed on between two stages of the pipeline must be stored in istern (latched). The isters can be seen on the next slide. They are denoted by the levels between which they lie. 16

17 Instruction fetch Instruction decode / register fetch Execute / address calculation Memory access Write back 4 Add NPC Zero? Branch taken Cond M u x PC Instruction memory IR isters A B M u x M u x output Data memory LMD M u x 16 Sign 32 extend lmm DLX data path with clock cycles (without pipelining) 17

18 IF / ID ID / EX EX / MEM MEM / WB 4 ADD M u x Zero? Branch taken PC Instruction memory IR IR IR isters MEM / WB.IR M u x M u x Data memory. Output.LMD. IR. IR. IR. Imm. B. B. IR. A. Output.cond. NPC. NPC M u x 16 Sign 32 extend 18

19 All temporary isters from our first draft can now be included in these isters. The pipeline isters, however, hold data and control information. Example: The IR information must move on with the stages of the pipeline so that, for example, in the WB phase, the result is written to the correct ister that belongs to the old instruction. Each switching process happens in one cycle, whereby the inputs are taken from the ister before the corresponding phase and the outputs are written into the ister after the phase. The following slide shows the activities in the individual phases under this view. 19th

20 stage What is all done IF IF / ID.IR Mem [PC]; IF / ID.NPC, PC (if EX / MEM.cond {EX / MEM. Output} else {PC + 4}); ID ID / EX.A s [if / id.ir]; ID / EX.B s [if / id.ir]; ID / EX.NPC IF / ID.NPC; ID / EX.IR IF / ID.IR; ID / EX.Imm (IR 16) 16 # # IR; EX MEM WB command Load or store command Branch command EX / MEM.IR ID / EX.IR; EX / MEM. Output ID / EX. A func ID / EX. B; or EX / MEM.Output ID / EX.A op ID / EX.Imm; EX / MEM.cond 0; MEM / WB.IR EX / MEM.IR; MEM / WB.Output EX / MEM.Output; s [MEM / WB. IR] MEM / WB.Output; or s [mem / wb.ir] MEM / WB.Output; EX / MEM.IR ID / EX.IR EX / MEM.Output ID / EX.A + ID / EX.Imm; EX / MEM.cond 0; EX / MEM.B ID / EX.B; MEM / WB.IR EX / MEM.IR; MEM / WB.LMD Mem [EX / MEM.Output]; or Mem [EX / MEM.Output] EX / MEM.B; s [mem / wb.ir] MEM / WB.LMD; EX / MEM.Output ID / EX.NPC + ID.EX.Imm; EX / MEM.cond (ID / EX.A op 0); 20th

21 The activities in the first two stages are not command dependent. That has to be the case because we can only interpret the command at the end of the second stage. Due to the fixed bit positions of the operand register in the IR field, decoding and reading is possible in one phase. In order to control the flow in this simple pipeline, the control of the four multiplexers in the diagram is necessary: ​​upper -input-mux: lower -input-mux: IF-Mux: WB-Mux: branch or not is-ister-instruction or not EX / MEM.cond load or operation There is a fifth Mux (not shown) that selects the WB where the address of the destination register is to be found in the MEM / WB.IR, namely bits in the case of an ister command and of bits for an immediate or load command. 21

22 Performance improvement through pipelining Throughput is improved - not execution time of an instruction on the contrary, through the pipeline overhead, through the fact that the stages are never perfectly balanced (the clock is determined by the slowest stage) and the additional latches with their setup Times and switching times, the dwell time of the individual instructions in the processor is usually longer. But overall: Programs run faster, although not a single instruction runs faster individually. 22nd

23 Example: DLX without a pipeline: four cycles for and branch operations, five for memory operations. 40%, branch 20%, load / store 40%. 1 GHz clock. Suppose we need 0.2 ns more per stage due to the pipeline overhead. What speedup do we get through the pipeline? Average execution time for an instruction = cycle time * average CPI = 1ns * ((40% + 20%) * 4 + 40% * 5) = 4.4 ns. In the pipeline version, the clock runs with the cycle time of the slowest stage plus overhead, i.e. H. 1ns + 0.2ns = 1.2ns. speedup time for unpipeled execution time for pipelined execution 4.4ns 3.67 1.2ns 23

So far the pipeline would work fine for paired independent integer instructions. In reality, however, commands are interdependent. This problem and that of the processing of floating point instructions are discussed below: Pipeline Hazards There are cases in which an instruction in the pipeline cannot be executed at its original rate. These are called hazards. There are three types: Structure hazards occur when the hardware cannot perform the combination of two operations that are supposed to run at the same time. Example: Writing to the memory (MEM) simultaneously with IF with only one memory. Data hazards occur when the result of an operation is the operand of a subsequent operation, but this result is not available in time (i.e. in the ID phase). Control hazards occur with branches in the pipeline or other operations that change the PC. 24

25 hazards lead to a stalls in the pipeline. Pipeline congestion is more complex than e.g. Cache miss congestion, because some operations in the pipe have to continue, others have to be stopped. In the pipeline, all instructions that have been in the pipeline longer than the jammed one must continue to run, while all younger ones also have to be jammed. If the older ones were not processed further, the traffic jam would not be able to be removed. As a result, no new instructions are fetched as long as the traffic jam lasts. 25th

26 Performance of pipelines with congestion (stall) CPI pipelined = ideal CPI + Pipeline stall clock cycles per instruction = 1 + Pipeline stall clock cycles per instruction If you disregard the pipeline overhead, the speedup 1 CPI is unpipelined pipeline stall cycles per instruction The most important special case is that in which all commands need the same number of cycles. speedup pipeline depth 1 pipeline stall cycles per instruction 26

27 Structural Hazards The overlapping work on all phases of the pipeline at the same time requires that all resources are available often enough so that all combinations of tasks in different stages of the pipeline can occur simultaneously. Otherwise we will get a structural hazard. Typical case: A unit is not fully pipelined: Then the following instructions that use this unit cannot be processed at speed 1 per cycle. Another common case is that e.g. the isterfile only allows one write operation per cycle, but two consecutive commands in different phases both want to execute a writeback in an ister. If a structural hazard is detected, the pipeline jams, which increases the CPI to a value> 1. Another example: A machine does not have a separate data and instruction cache. If a date and an instruction are accessed at the same time, a structure hazard arises. The following slide shows such a case. The idle phase (caused by the congestion) is commonly called a pipeline bubble because it travels through the pipeline without carrying any meaningful work. 27

28 Time (in measures) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 Load Mem Mem Command 1 Mem Mem Command 2 Mem Mem Command 3 Mem Mem Command 4 Mem Mem 28

29 Time (in bars) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 Load Mem Mem Command 1 Mem Mem Command 2 Mem Mem Stall Bubble Bubble Bubble Bubble Bubble Command 3 Mem Mem 29

30 To represent the situation in a pipeline, some designers prefer a different form of diagram that is not as descriptive, but more compact and just as informative. The following slide shows an example of this. 30th

31 Clock command Load command IF ID EX MEM WB command i + 1 IF ID EX MEM WB command i + 2 IF ID EX MEM WB command i + 3 stall IF ID EX MEM WB command i + 4 IF ID EX MEM WB command i + 5 IF ID EX MEM command i + 6 IF ID EX 31

32 Data Hazards Let us consider the following assembler program section ADD SUB AND OR XOR R1, R2, R3 R4, R5, R1 R6, R1, R7 R8, R1, R9 R10, R1, R11 All commands after ADD need the result of ADD. As you can see on the following slide, ADD only writes the result to R1 in its WB phase. But SUB is already reading R1 in its ID phase. We call this a data hazard. If we don't do anything about it, SUB will read the old value from R1. Or even worse, we don't know what value SUB gets, because if an interrupt would lead to the pipeline between ADD and SUB being interrupted, ADD would still be executed until the WB phase, and when the process starts again with SUB started, SUB would even get the correct operand from R1. 32

33 Time (in cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 ADD R1, R2, R3 DM SUB R4, R1, R5 DM AND R6, R1, R7 DM OR R8, R1, R9 XOR R10, R1, R11 33

34 The AND instruction suffers from the hazard as well. He too gets the wrong value from R1. The XOR works correctly because the reading process of the XOR is after the WB of the ADD. We can make the OR safe by ensuring that the implementation ensures that the first half of the clock is written to the ister and the second half is read.This is indicated in the drawing by the half boxes. How do we avoid stalls with SUB and AND? The magic word is forwarding. It works as follows: 1. The result of the (from EX / MEM) is (also) fed back to the input. 2. A special forwarding logic, which only searches for such situations, selects the version written back instead of the value from the ister as the actual operand and forwards it to the. 34

35 time (in cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 ADD R1, R2, R3 DM SUB R4, R1, R5 DM AND R6, R1, R7 DM OR R8, R1, R9 XOR R10, R1, R11 35

36 With forwarding it is noteworthy: If SUB is designed itself, the command does not need the returned version of the result, but can access from R1 as normal. The example also shows that the new result must also be kept available in returned form for the next but one command (AND). Forwarding is a technique of accelerating the forwarding of results to the correct processing unit when the natural flow of the pipeline is insufficient for such a transport. Another example: ADD LW SW R1, R2, R3 R4, 0 (R1) 12 (R1), R4 To avoid the data hazard here, we would have to bring R1 to the input and R4 to the memory input in good time. The following slide shows the forwarding paths required for this problem. 36

37 Time (in cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 ADD R1, R2, R3 DM LW R4, 0 (R1) DM SW 12 (R1), R4 DM 37

38 Required Forwarding Paths for RAW Hazards Very different data dependencies can appear between successive commands. In order to do justice to all of these, the two data path multiplexers must have more than two inputs in front of the and a data path multiplexer must also be inserted in front of the input into the data memory. These additional multiplexer inputs are connected with suitable values ​​from the pipeline latches, which without these additional connections would not be applied to the data memory in good time and would thus lead to a congestion in the pipeline. These connections are called forwarding paths. The following slide shows all forwarding paths that can eliminate possible RAW hazards. For example, if the command sequence ADD SUB R3, R2, R1, R4, R5, R3 is to be processed, there is a data dependency due to the use of R3. R3 is normally not written to the actual replacement until the WB phase of the first command, but is already expected in the ID phase of the second command in the actual replacement. The hardware will therefore use multiplexer input 8 to copy this value directly after the EX phase of the first command and to feed it into the lower data path multiplexer in the next cycle in the EX phase of the second command. 38

39 ID / EX EX / MEM MEM / WB Zero? Data memory

40 Classification of data hazards A data hazard arises when a date is accessed by two operations that are too close to one another. In our examples, these were always actual operations. That does not have to be that way. Many data hazards also occur with memory access. Not in the DLX pipeline, however, since memory accesses are always carried out in the intended order. There are three types of data hazards: RAW: read after write: This is the type we had in the examples. One operation writes, one that follows shortly afterwards tries to read the result and would still get the old result. Often these hazards can be avoided through forwarding. 40

41 WAW: write after write: Two consecutive operations want to write to the same position. If the first takes longer than the second, it is possible that the second writes first and then the first overwrites the new value. This cannot happen in our simple integer DLX, but a slight modification of the DLX pipeline would make such WAW hazards possible: - Instructions write back to the target register in the fourth phase, memory accesses need two clocks, i.e. two MEM phases. LW R1, 0 (R2) IF ID EX MEM1 MEM2 WB ADD R1, R2, R3 IF ID EX WB If writing is carried out in different phases, structural hazards can also arise because two instructions (which are in different stages of their pipeline ) want to have write access to the same position. 41

42 WAR: write after read: This type is rare: a command reads a value, but before it comes into play, its successor has already written a new value to this position. Rarely because that only happens when reading very late in the pipeline and writing very early. This cannot happen in our simple integer DLX, but the above modification of the DLX pipeline would make such WAW hazards possible: Memory accesses need two clocks, i.e. two MEM phases. With store, read access is only made towards the end of MEM2. SW 0 (R1), R2 IF ID EX MEM1 MEM2 WB ADD R2, R4, R3 IF ID EX WB RAR is not a hazard 42

43 Data hazards that have to cause traffic jams There are data hazards that cannot be defused by forwarding: LW SUB AND OR R1, 0 (R2) R4, R1, R3 R6, R1, R7 R8, R1, R9 The following slide shows the problem: The LW instruction only has the value for R1 after its MEM phase, but the EX phase of the SUB instruction, which requires the value of R1, is already running during its MEM phase. A forwarding path that can prevent this would have to be able to jump backwards in time, a skill that even modern computer architects are still unable to achieve. We can forward for the AND and OR operations, but not for SUB. In this case we need special hardware mechanisms, called pipeline interlock. 43

44 Time (in cycles) CC 1 CC 2 CC 3 CC 4 CC 5 LW R1, 0 (R2) DM SUB R4, R1, R5 DM AND R6, R1, R7 OR R8, R1, R9 44

45 Pipeline interlock, the pipeline only jams the instructions from the consumer instruction for the data hazard the source instruction and all of the previous instructions must be processed further. In between there is a pause, a so-called pipeline bubble, in which nothing meaningful is calculated. In the example it looks like this: everything is delayed by one measure from the bubble. In bar four no instruction is started and therefore none are fought. 45

46 Time (in cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 LW R1, 0 (R2) DM SUB R4, R1, R5 Bubble DM AND R6, R1, R7 Bubble OR R8, R1, R9 Bubble 46

47 LW R1,0 (R1) IF ID EX MEM WB SUB R4, R1, R5 IF ID EX MEM WB AND R6, R1, R7 IF ID EX MEM VB OR R8, R1, R9 IF ID EX MEM WB LW R1,0 (R1) IF ID EX MEM WB SUB R4, R1, R5 IF ID stall EX MEM WB AND R6, R1, R7 IF stall ID EX MEM WB OR R8, R1, R9 stall IF ID EX MEM WB 47

48 Example: 30% of the commands are loads. In half of the cases, the instruction following load requires the loaded operand. This hazard produces 1 cycle delay. How much slower is this machine compared to the ideal CPI = 1 machine? CPI after load = 0.5 * 1 + 0.5 * 2 = 1.5 CPI = 0.70 * 1 + 0.30 * 1.5 = 1.15 Answer: The ideal machine is by a factor of 1.15 more quickly. 48

49 In the el there is a load jam more often with integer programs than with FP programs. This is because FP programs are often structured more regularly. The time slot after a load is called the load slot. After global optimization, many of the load slots are often already occupied. But you need empty load slots for scheduling. Typical load frequency in these programs is between 19 and 34%. On average 24% The CPI grows here by between 0.01 and 0.15 due to load congestion. 49