The Good IR: Reporting Semantic Errors via Type Checking
This post is part of a seres:
The Good IR (BasicBlocks and control flow instructions)
The Good IR: Other Control Flow Structures
The Good IR: Instructions and Values
The Good IR: Reporting Semantic Errors via Type Checking
The Good IR: Multiple Returns
—–
It’s actually, quite fascinating: not only is type-checking relatively simple to do, but I found it can also catch more errors than I naively expected.
Let’s consider how the parser might handle the grammar rule: designator := ident { "[" expression "]" }
.
Algorithm A
1. Ensure that an Identifier was seen.
2. Check that the identifier’s name is present in the symbol table, and signal a ‘use before definition’ if not.
3. Issue a LoadAddress if the identifier is an array, otherwise update an SSA binding.
4. In the event of an array, resolve the indexing
4.1 Perform the address arithmetic
4.2 Ensure at each step that the type can be actually be indexed.
When coding, you might face (as I did) an irresistibly strong temptation to put all this functionality into the designator
method of your parser. If you don’t resist that urge, you’ll almost certainly end up (as I did) with a
// designator = ident{ "[" expression "]" }. private Result designator(Node cur, Node join) { Result res = new Result(); if (sym.isType(Lexeme.IDENT)) { res.syminfo = SymbolTable.getInfo(sym.getName()); if (res.syminfo == null) { ErrorHandler.printSyntaxErr(sym.type.toString(), sym.getLine(), "symbol \"" + sym.getName() + "\" used but not defined!"); } res.name = sym.getName(); if (cur.getCurSsa().get(res.name) != null && res.syminfo.isType(SymType.var)) { res.instruction = cur.getCurSsa().get(sym.getName()); } else { res.instruction = new UnaryInstruction(cur,UnaryInstruction.Operator.loada, null, res.name); res.kind = Result.Kind.address; } sym = Scanner.getSym(); } else { ErrorHandler.printSyntaxErr(sym.getType().toString(), sym.getLine(), "identifier expected"); } // resolve array access Vector<Integer> dim = SymbolTable.getInfo(res.name).getDimensions(); if (dim != null) { int multiplier = 1; for (int i=1; i<dim.size(); i++) { multiplier *= dim.get(i); } int index = 1; while (sym.isType(Lexeme.OPENSQR)) { sym = Scanner.getSym(); Instruction instr = expression(cur, join).instruction; Instruction mul = new BinaryInstruction( cur, BinaryInstruction.Operator.mul, instr, new ConstantInstruction(cur, multiplier*8, "adda(" + multiplier +")")); res.instruction = new BinaryInstruction( cur, BinaryInstruction.Operator.adda, res.instruction, mul); multiplier /= index<dim.size() ? dim.get(index) : 1; index++; expect(Lexeme.CLOSESQR); } res.kind = Result.Kind.address; } // return the result return res; } |
But all of the manually inlined functionality that leads to such spaghetti can be fixed with a few refactorings! To clean up the Parser, we’ll delegate some functionality (such as error checking) to other pieces of code. This way the Parser code itself can read as closely to English as possible.
Let me first address two obvious drawbacks in the above sample, before getting into the type-checking.
Step 1. Ensure that an Identifier was seen.
The Parser is constantly checking if some particular token is ‘current’. This check is also tied to reporting errors and/or advancing the stream one token. My parser ended up with four functions combining three functionalities:
- inspecting the current token
- error reporting on unexpected tokens
- advancing the token stream
You can see how some of these functionalities are combined into the traditional accept/expect pattern for recursive descent parsers. I merely extended this common approach with a subtle case: When parsing an expression and checking the current token for a PlusToken or MinusToken, I wanted the parser to advance the stream only if the current token matched my query, otherwise leave the stream alone and don’t report an error.
Introducing Types
Ok. The remaining issues are semantic in nature, and I found that a clever use of type-checking can catch them all. The code sample above felt seriously ad-hoc. It’s missing a type-checking framework, which I wasn’t knowledgeable enough to model on my own. By occasionally surfing google scholar for compiler implementation details I eventually found an awesome paper about Design Patterns for Teaching Type Checking in a Compiler Construction Course. The authors demonstrate how to use the Composite pattern to represent a set of base types together with the ability to compose those types. Any reasonable programming language has within it a mini-language so that the program author may extend the built-in type system with knowledge of custom types. Naturally, a Builder automatically constructs any programmer specified types by using the available composition operators present in the type hierarchy.
Once we can successfully model arbitrary types, we can attach them to Symbols (representing variables/functions) and Values (both Instructions and Phi’s). Every constructor of every Value shall have a line of code which calculates its type based on the given arguments. The calculation itself is best delegated to the type hierarchy:
AddInstruction::AddInstruction(ValuePtr x, ValuePtr y) :Instruction() { m_args.append(x); m_args.append(y); m_type = x->type()->add(y->type()); } |
Should the types be incompatible, the Instruction will be of the ErrorType, which can hold a message describing the error. Now, it becomes possible for the Parser to always generate an Instruction without first checking the validity of the operation. The validity check is left up to an Instruction’s constructor, which in turn delegates the check to the type hierarchy, by calling a type composition function on its argument’s types.
But how/when do you detect the errors?
We add to the Parser another helper function: issue
which will append to the List<Instruction> any Instruction which has a non-Error type. Should the Parser attempt to issue an erroneous instruction, it can report the corresponding error message and then quit (or resume).
Designator Digression
Because a designator may appear on either the right-hand side of an assignment or the left-hand side in an expression, the grammar rule for designator
has some subtle complexities that require this digression. Let’s tabulate a few examples, to explore the territory.
Statement | Description | Result Type and Reason |
---|---|---|
a[4] = |
Non-SSA on LHS | Address(Integer) to receive the value stored. |
= a[4]; |
Non-SSA on RHS | Integer which provides a value. |
a = |
SSA on LHS | No instruction issued. Be careful to leave updating the defining Instruction (from the RHS) in the StateVector to the Parser method for AssignmentStatement. |
= a; |
SSA on RHS | No instruction issued, return the defining instruction found in the StateVector |
Inspecting the table for some patterns, we notice that the designator
shall be easier to write if it’s given knowledge of which side of an assignment it appears on. We may accomplish this by passing to designator
such a flag: enum Side {LeftHandSide, RightHandSide}
. Armed with this knowledge does not alleviate all issues, however.
We should also take notice that in only one of these cases (SSA on LHS) should the designator
take no action. In all other cases, designator
must return an Instruction. After some thought, I could find no way of avoiding some cooperation with assignmentStatement. It would not be sufficient for designator
to return a NULL to signal AssignmentStatement that the SSA mapping should be updated, because AssignmentStatement needs to know which Symbol to be update in the StateVector. For this reason, I opted for creating a special pseudo-instruction, SymbolInstructon, which is never issued to the instruction stream, and serves only to communicate the situation (SSA on LHS) to AssignmentStatement through an Instruction return type (forced by the other three cases).
Back to the Types
Let’s see how this approach solves the rest of the inlining issues.
Step 2. Check that the identifier’s name is present in the symbol table, and signal a ‘use before definition’ if not.
We delegate variable lookup to a helper method tryResolve
, which reports an error if a name could not be found in the SymbolTable. This guarantees that we have a valid symbol upon which to operate. It is ok for this Symbol to appear on the left hand side of an assignment, so we delegate the ‘use before definition’ check to another method, lookupValue
, which inspects the StateVector for the Symbol’s defining Instruction. It signals an error if the Symbol is bound to Uninitialized.
Step 3. Issue a LoadAddress if the identifier is an array or other non-SSA variable, otherwise return the SSA binding.
Now that we have handled both SSA cases, we can safely write the rest of the designator
assuming a non-SSA access. Whether we have a local array or a variable in another scope (ex. global), we begin with loading the address. Only if the designator
is on the right hand side, will that address have to changed to an actual value. In this event, the LoadAddress must be inserted into the instruction stream. We delegate the insertion to a method called issue
which verifies that the instruction is not of the ErrorType.
Step 4. In the event of an array, resolve the indexing
Step 4.1 Perform the address arithmetic
Step 4.2 Ensure at each step that the type can be actually be indexed.
Here’s where the use of typed instructions really comes into play. Beginning with the address, we must calculate the correct offset, which depends on the dimensions of the array. Rather than inlining the address arithmetic, we delegate it to IndexInstruction, which can perform the calculation by inspecting the types of its arguments. However, for multi-dimensional arrays, the result of IndexInstruction is not necessarily an Integer, rather it could be a subarray. The IndexInstruction simply peels off one of the ArrayType composite wrappers. For an example of how this works, consider the following erroneous access:
array a[3][4]; x = a[expr1][expr2][expr3];
and the list of issued instructions, along with their types:
i1 : LoadAddress(a) : Address(Array(4, Array(3, Integer))) i2 : expr1 : Integer i3 : Index(i1, i2) : Address(Array(3, Integer)) i4 : expr2 : Integer i5 : Index(i3, i4) : Address(Integer) i6 : expr3 : Integer xx : Index(i5, i6) : ErrorType("Could not index an Address(Integer) with Integer.")
In that last index operation, we actually produce an IndexInstruction. The error is caught and reported when the Parser tries to issue
that instruction into the stream. Using all these techniques, we can simplify the previous designator
spaghetti into something much more
// designator = ident { "[" expression "]" }. InstructionPtr Parser::designator(Side side) { assume(Token::IDENT); SymbolPtr const &sym = tryResolve(tok.name()); eat(Token::IDENT); if (isSSA(sym)) { switch (side) { case LeftHandSide: return new SymbolInstruction(sym); // not issued on purpose, signals caller case RightHandSide: return lookupDefiningValue(sym); // that instruction was already issued } } // else, must be an array InstructionPtr addr = issue(new LoadAddressInstruction(sym)); while (accept(Token::OpenBracket)) { InstructionPtr offset = expression(); expect(Token::CloseBracket); addr = issue(new IndexInstruction(addr, offset)); } switch (side) { case LeftHandSide: return addr; // it was issued in the loop case RightHandSide: return issue(new LoadValueInstruction(addr)); } } |
But That’s Not All!
We can also use this framework of helper functions and typed instructions to detect other semantic errors. You should be able to see that it can already detect type-invalid operations such as Add, Sub, Mul, Div, Index. It also directly generalizes to Call, where the constructor type-checks the supplied list of argument types against that of the function’s Symbol.
We can also gift Phi’s with a type
method that verifies that all of its arguments are valid. This technique allows us to detect a rather more difficult semantic rule: that not only does a function return a value with appropriate type, but also for any functions declaring a return type all paths through that function must encounter a valid return. I haven’t yet written the post on multiple returns, but the verification is conceptually simple: Inspect the Phi of the return value in the function’s ExitBlock and verify that it does not have the ErrorType.
Design Rule of Thumb: Write the Parser code for the expected case, delegating error checking and reporting to helper functions.
In discussion with Urs today at lunch, I discovered something missing from the above discussion. It is incorrect to assume that because a designator (array cell) occurs on the left hand side of an assignment that it necessarily must result in an address to store to. Consider the following line:
Indexing
A
should produce an address, but indexingB
should produce a value. When sendingSide
as an argument to thedesignator
method, be careful to handle this case: array index arguments should always be considered rvalues.