Pipelining in processors

2026-06-15

How pipelining in architecture works and improves efficiency.

Pipelining in processors

One instruction consists of multiple things like instruction fetch (IF), instruction decode (ID), register read (RD), execute (EX), data access (DA), write back (WB). The minimum clock cycle time will be the largest instruction consisting of all of these.

Instead, we make the clock cycle the largest of these individual actions. We can do that because we add buffers in between the processor everywhere.

This makes the clock cycle lower. We can also layer multiple clock cycles from different instructions now because they may not depend on each other. What happens they do? Problems.

Out of order write-back problems

Suppose instruction A has 5 pipelines and instruction B has 3 pipelines and B was executed right after A. Then before A is over, B should have had its write back into the registers. If an interrupt happens while B was over but A wasn't, then when we return from the interrupt B will also be executed for the second time and this can lead to potentially wrong values being stored in the registers.

Superscalar Pipelining

Basically scalar pipelines but in parallel. They are called s-issue pipelines.

An s-issue pipeline can execute s instructions in parallel.

On the other hand, there are diversified pipelines that can have multiple different execute stages in parallel. This is because some of the execute stages will hog more pipelines. Like Store instructions or floating point operations.

Unlike the buffers in single pipeline processors, superscalar pipelines are multi-entry and multi-exit in nature. We can also have instructions leaving the buffer out of order except for one special buffer called the reorder buffer.

Before the parallel execution stages in a dynamic pipeline, There is a dispatch buffer that takes the entries from the Instruction Decode stage. Till now everything was in order.

From the dispatch buffer things can either go in order or out of order depending on the execution time of the pipeline taken. Once the pipeline is completed the intermediate results are stored in the reorder buffer.

The reorder buffer may have to be significantly larger than the other buffers because situations may arise where there is a significant number of write backs waiting for an incomplete instruction to complete.

For example.

Suppose floating point op takes 5 clock cycles in execution stage and add takes 1 clock cycle. We can see that before the second floating point operation is over the 4 adds will be done. In a 3-issue processors this exceeds the maximum look-ahead you usually needed to have and therefore the reorder buffer must be significantly larger than the other buffers.

Stages of a super-scalar pipeline

Instruction Fetch (IF): If s-issue then fetch s instructions at once. Increments accordingly.
Instruction Decode (ID): Unlike scalar processors, these decode instructions do not read the register contents. They also identify dependencies between instructions.
Instruction Dispatch: This is done based on the availability of the ALU pipeline and the order in which the instruction Decode had reordered along with availability of the operands.

Dispatch buffer can be a single continuous buffer or split into multiple buffers.

PROS of singular: Better utilization of resources as stalls cannot happen that easily because the area of the buffer one type of instruction take isn't fixed.

CONS of singular: Makes hardware much more complicated as buffers now need to be multi-ported. This is also very bulky as it is kinda fully associative cache.

PROS of distributed: Simple hardware. Single ported multi entry buffers. Only one type of instructions are there.

CONS of distributed: There may be more stalls now as one branch buffer may not have sufficient space to execute.

Hybrid is also possible. Intel does that.
Execution Stage (EX): Can be utilized by having specialized hardware for certain operations. Also supports SIMD. This takes the addresses of 3 buffers that have some fixed length and performs the operations in parallel across the entire buffer at once in one clock cycle. This is a pipeline in and of itself.
Instruction Completion and Retiring (WB): There is a reorder buffer. This waits for the instruction results to come up to pack them in order. This is necessary to prevent the problems of interrupts.

For store instructions, they are retired only when the data is written back in the cache.

Pipelining in processors

Instead, we make the clock cycle the largest of these individual actions. We can do that because we add buffers in between the processor everywhere.

This makes the clock cycle lower. We can also layer multiple clock cycles from different instructions now because they may not depend on each other. What happens they do? Problems.

Out of order write-back problems

Superscalar Pipelining

Basically scalar pipelines but in parallel. They are called s-issue pipelines.

An s-issue pipeline can execute s instructions in parallel.

Before the parallel execution stages in a dynamic pipeline, There is a dispatch buffer that takes the entries from the Instruction Decode stage. Till now everything was in order.

For example.

Stages of a super-scalar pipeline

Instruction Fetch (IF): If s-issue then fetch s instructions at once. Increments accordingly.

Instruction Decode (ID): Unlike scalar processors, these decode instructions do not read the register contents. They also identify dependencies between instructions.

Instruction Dispatch: This is done based on the availability of the ALU pipeline and the order in which the instruction Decode had reordered along with availability of the operands.

Dispatch buffer can be a single continuous buffer or split into multiple buffers.

PROS of singular: Better utilization of resources as stalls cannot happen that easily because the area of the buffer one type of instruction take isn't fixed.

CONS of singular: Makes hardware much more complicated as buffers now need to be multi-ported. This is also very bulky as it is kinda fully associative cache.

PROS of distributed: Simple hardware. Single ported multi entry buffers. Only one type of instructions are there.

CONS of distributed: There may be more stalls now as one branch buffer may not have sufficient space to execute.

Hybrid is also possible. Intel does that.

Execution Stage (EX): Can be utilized by having specialized hardware for certain operations. Also supports SIMD. This takes the addresses of 3 buffers that have some fixed length and performs the operations in parallel across the entire buffer at once in one clock cycle. This is a pipeline in and of itself.

Instruction Completion and Retiring (WB): There is a reorder buffer. This waits for the instruction results to come up to pack them in order. This is necessary to prevent the problems of interrupts.

For store instructions, they are retired only when the data is written back in the cache.