MIPS Pipeline Stages

IF: Instruction Fetch
ID: Instruction Decode & Register Read
EX: Execute operation or calculate address
MEM: Access operand in data memory
WB: Write back the result into a register

Pipeline Registers

Datapath

Store data used by the “current” instruction, and pass it down step by step. Here is what they store at the end of a cycle

IF Stage, IF/ID stores
- Instruction read from InstructionMemory[PC] (Split into register numbers and 16-bit offset to be sign extended)
- PC+4
ID Stage, ID/EX stores
- Data read from register file
- 32-bit immediate value
- PC+4
- Write register number
EX Stage, EX/MEM stores
- PC+4 + Immediate $\times$ 4
- ALU result
- isZero? signal
- Data read 2 from register file
- Write register number
MEM Stage, MEM/WB stores
- ALU result
- Memory read data
- Write register number
WB Stage
- Result is written back to register file if applicable

Control

Control signals are generated in the ID stage, and passed down through the pipeline registers: WB, MEM and EX control are all passed to ID/EX, then WB and MEM are passed to EX/MEM etc.

Performance Comparison

Single Cycle Processor (1 instr takes 1 cycle)

Cycle time is the time taken by the slowest instruction: $CT_{seq} = max (\sum_{k = 1}^{N} T_{k})$
- $T_{k}$ : Time for operation in stage $k$ (e.g. IF, ID stage)
- $N$ : Number of stages
Execution time for $I$ instructions: $Time_{seq} = I \times CT_{seq}$

Multi Cycle Processor (1 instr at a time, many cycles per instr)

Cycle time is the time taken by the slowest stage: $CT_{multi} = max (T_{k})$
Execution time for $I$ instructions: $Time_{multi} = I \times CT_{multi} \times Average CPI$
- Average cycles per instruction needed since instructions take different number of cycles

Pipelining Processor

Cycle time is time taken by slowest stage + overhead for pipelining: $CT_{pipeline} = max (T_{k}) + T_{d}$
- $T_{d}$ : Overhead for pipelining e.g. pipeline register
Cycles needed for $I$ instructions: $I + N - 1$ (need $N - 1$ cycles to fill up the pipeline)
Execution time for $I$ instructions: $Time_{pipeline} = (I + N - 1) \times CT_{pipeline}$
Analysing ideal case:
- Every stage takes same amount of time → $\sum_{k = 1}^{N} T_{k} = N \times T_{1}$
- No pipeline overhead → $T_{d} = 0$
- $I >> N$ Number of instructions executed much greater than number of stages
- Pipeline speedup = $Time_{seq} / Time_{pipeline} = I \times N \times T_{1} / (I + N - 1) \times T_{1} \approx I \times N \times T_{1} / I \times T_{1} = N$
- So pipeline processor can have a $N$ times speedup, where $N$ is the number of stages

Hazards

A hazard is something that prevents the next instruction from immediately executing after the previous one

Structural

Simultaneous use of hardware resource

Reading instruction from memory at IF stage access memory at the same time as MEM stage of previous instruction
- Solved by either: delaying the instruction by 3 clock cycles
- or having a separate instruction memory and data memory (this is what we have)
Writing to register in WB stage at the same time as reading in ID stage
- Solved by splitting the clock cycle in half, write in first half, read in second half
- This order is used because the later instruction could depend on the data written by the earlier one

Data

Dependencies between instructions (instruction reading from previous instruction’s result) WAR & WAW dependencies exist, but will not cause problem here

RAW read after write: instruction reads register written by previous instruction (aka true data dependency)
- e.g. add $1, $2, $3 then sub $4, $1, $5 the issue is $1
- Depending on non-lw instruction: required data is already available, just not written to register yet, so it can be forwarded from one stage to another
- Depending on lw instruction: data is not available when required, so delay execution, then forward data
- Examples

sub $2,  $1, $3
and $12, $2, $5
or  $13, $6, $2
add $14, $2, $2
sw  $15, 100($2)

Without forwarding: With forwarding:

lw  $2,  20($3)
and $12, $2, $5
or  $13, $6, $2
add $14, $2, $2
sw  $15, 100($2)

Without forwarding: With forwarding:

Control

When there is a change in program flow (branch) When there is a branch instruction, decision whether to branch is made in the MEM stage. By then, the subsequent instructions could have partially executed

Solution 1: Stall pipeline (by 3 cycles)
- This is bad because many instructions are branch instructions, so this will lead to a big slow down
Solution 2: Early branch resolution (compute branch decision early)
- Add a comparator to the ID stage, so that the branch result can be known after ID stage
- Then only need to stall for 1 cycle while waiting for ID stage of branch instruction
- But the register that decides whether to branch might be written by the previous instruction, causing issues
  - So, add forwarding path from EX to ID stage, then add 1 cycle of delay so the register value is available
  - If previous instruction is lw, need to add forwarding path from MEM to ID, and more delay, but this is the same 3 cycle delay as no early branch resolution
  - Summary: branch → 1 cycle delay. Depend on previous instruction → 2 cycle delay. Depend on previous lw instruction → 3 cycle delay
Solution 3: Branch prediction (only simplest implementation in 2100: predict that branch is not taken)
- Right prediction: no stall, since we can execute as per normal
- Wrong prediction: stop executing, flush the pipeline and execute the correct instruction in the next cycle (1 cycle delay with early branch resolution, 3 cycle delay without)
Solution 4: Delayed branch Move instructions that do not affect the branch (non-control dependent instructions) to after the branch instruction. They can be executed while waiting for the branch to be decided
- The number of instructions that can be moved to after the branch: Branch-Delay slot (early branch: 1, no early branch: 3)
- Best case: there is an instruction that can be moved into branch delay slot
- Worst case: nothing can be moved there, so add a nop instruction in the slot
- The compiler has to do this

or  $8, $9, $10 //does not affect branch
add $1, $2, $3
sub $4, $5, $6
beq $1, $4, Exit
xor $10, $1, $11
Exit:

becomes

add $1, $2, $3
sub $4, $5, $6
beq $1, $4, Exit
or  $8, $9, $10
xor $10, $1, $11
Exit:

🪴 Notes Hehe

Explorer

Pipelining

MIPS Pipeline Stages

Pipeline Registers

Datapath

Control

Performance Comparison

Single Cycle Processor (1 instr takes 1 cycle)

Multi Cycle Processor (1 instr at a time, many cycles per instr)

Pipelining Processor

Hazards

Structural

Data

Control

Graph View

Table of Contents

Backlinks

Explorer