Understanding the need for additional metadata – Advanced IR Generation-1
To demonstrate the problem, let’s look at the following function:
void doSomething(int *p, float *q) {
*p = 42;
*q = 3.1425;
}
The optimizer cannot decide if the pointers, p and q, point to the same memory cell or not. During optimization, an important analysis can be performed called alias analysis. If p and q point to the same memory cell, then they are aliases. Moreover, if the optimizer can prove that both pointers never alias each other, this enables additional optimization opportunities. For example, in the doSomething() function, the stores can be reordered without altering the result in this case.
In addition, it depends on the definition of the source language if a variable of one type can be an alias of another variable of a different type. Please note that languages may also contain expressions that break the type-based alias assumption – for example, type casts between unrelated types.
The solution chosen by the LLVM developers is to add metadata to the load and store instructions. The added metadata serves two purposes:
- First, it defines the type hierarchy based on which type may alias another type
- Second, it describes the memory access in a load or store instruction
Let’s have a look at the type hierarchy in C. Each type of hierarchy starts with a root node, either named or anonymous. LLVM assumes that root nodes with the same name describe the same type of hierarchy. You can use different type hierarchies in the same LLVM modules, and LLVM makes the safe assumption that these types may alias. Beneath the root node, there are the nodes for scalar types. Nodes for aggregate types are not attached to the root node, but they refer to scalar types and other aggregate types. Clang defines the hierarchy for C as follows:
- The root node is called Simple C/C++ TBAA.
- Beneath the root node is the node for the char types. This is a special type in C because all pointers can be converted into a pointer to char.
- Beneath the char node are the nodes for the other scalar types and a type for all pointers, called any pointer.
In addition to this, aggregate types are defined as a sequence of member types and offsets.
These metadata definitions are used in access tags attached to the load and store instructions. An access tag is made up of three parts: a base type, an access type, and an offset. Depending on the base type, there are two possible ways the access tag describes memory access:
- If the base type is an aggregate type, then the access tag describes the memory access of a struct member with the necessary access type and is located at the given offset.
- If the base type is a scalar type, then the access type must be the same as the base type and the offset must be 0.