IR in compilers

Intermediate Representation

In the syntax tree itself, we can do a lot of optimizations to have a more concise form. We can reuse subtrees for computations. As such the subgraph isomorphism problem is NP-complete.

However, we can still do it in a restricted scope. Then we number the nodes of the DAG. That gives us an evaluation ordering.

Three address code is the most common form of IR that compilers use. It has some basic forms but all of them obey one basic principle, there are no more than three variables involved.

Static single assignments

This is later used for determining lifetime of a variable and assigning registers to them. Each definition is abstracted as a separate node. This leads to the a new register allocation for each write.

Here is where we do the register allocation optimization. Mostly heuristics and approximate algorithms are used, but even then it is an NP-hard problem.

We also introduce something called as a phi node that helps us in decision-making. It is basically an operator that lets you choose at runtime which value you want to use so that you can have an assignment for it. It is not actually there, just used for analysis.

Types and type equivalence

With types come three questions:

Storage requirement (size of the object)
Storage interpretation (int or 4 chars)
Valid operations (Add file pointer to char array data).

We say that two types are structurally equivalent iff one the following conditions is true:

They are the same basic type (name equivalence)
They are constructed in the same fashion from structurally equivalent types.
One is a type name that denotes the other. (typedef)

Name equivalence is easy to check, but it is very strict. Eg: two typedefs for int.

Structural equivalence avoids this, but then may allow nonsensical comparisons like doubly linked list node == binary search tree nodes.

Type checking involves two different techniques:

Type synthesis: C++ templates for example. A template defines a skeleton. A type gets constructed when we define vector<int> v>; this involves type synthesis.
Type inference: auto keyword. Function polymorphism.

A strongly typed language is one where the compiler guarantees that the valid source program will run without type errors.

Type conversions

Widening conversions are usually safe.
Narrowing conversions are unsafe.

Array expressions

For IR of array expressions, we need the type of the array to calculate the offset correctly. In case of multi-level arrays, we also need the previous size parameters to calculate that. We use row major storage, meaning that one row is together in the memory followed by the second row and so on.

Control Flow

We mostly use short circuiting for boolean conditions so that evaluation is made easier and expensive computations are not done without a reason.