Pipelining in processors

Pipelining in processors

innocentzero

2026-06-15

#hardware microarchitecture | Status: Ongoing

How pipelining in architecture works and improves efficiency.

Pipelining in processors

One instruction consists of multiple things like instruction fetch (IF), instruction decode (ID), register read (RD), execute (EX), data access (DA), write back (WB). The minimum clock cycle time will be the largest instruction consisting of all of these.

Instead, we make the clock cycle the largest of these individual actions. We can do that because we add buffers in between the processor everywhere.

This makes the clock cycle lower. We can also layer multiple clock cycles from different instructions now because they may not depend on each other. What happens they do? Problems.

Out of order write-back problems

Suppose instruction A has 5 pipelines and instruction B has 3 pipelines and B was executed right after A. Then before A is over, B should have had its write back into the registers. If an interrupt happens while B was over but A wasn't, then when we return from the interrupt B will also be executed for the second time and this can lead to potentially wrong values being stored in the registers.

Superscalar Pipelining

Basically scalar pipelines but in parallel. They are called s-issue pipelines.

An s-issue pipeline can execute s instructions in parallel.

On the other hand, there are diversified pipelines that can have multiple different execute stages in parallel. This is because some of the execute stages will hog more pipelines. Like Store instructions or floating point operations.

Unlike the buffers in single pipeline processors, superscalar pipelines are multi-entry and multi-exit in nature. We can also have instructions leaving the buffer out of order except for one special buffer called the reorder buffer.

Before the parallel execution stages in a dynamic pipeline, There is a dispatch buffer that takes the entries from the Instruction Decode stage. Till now everything was in order.

From the dispatch buffer things can either go in order or out of order depending on the execution time of the pipeline taken. Once the pipeline is completed the intermediate results are stored in the reorder buffer.

The reorder buffer may have to be significantly larger than the other buffers because situations may arise where there is a significant number of write backs waiting for an incomplete instruction to complete.

For example.

Suppose floating point op takes 5 clock cycles in execution stage and add takes 1 clock cycle. We can see that before the second floating point operation is over the 4 adds will be done. In a 3-issue processors this exceeds the maximum look-ahead you usually needed to have and therefore the reorder buffer must be significantly larger than the other buffers.

Stages of a super-scalar pipeline