Building a Human Resource Machine Compiler - Lexical Analysis
Building a Human Resource Machine Compiler - Lexical Analysis
This article is part of the Human Resource Machine Compiler series.
Parsing a raw program word by word is almost impossible. You’ll quickly find yourself struggling with unpredictable and varied identifier names, and managing strings can be inefficient. This is where the concept of lexing naturally comes into play.
During lexing, the program is transformed into tokens. Keywords are simple tokens, but other tokens may carry metadata, such as integer values or variable names. The lexer is the first component of a compiler.
Transforming language into tokens is fairly straightforward. For a language like this, one could write a lexer with just a cup of coffee.
A naïve hand-written lexer
A simple approach involves scanning the text directly. For an operator like >
, if the next character isn’t another combinable operator, it can be immediately mapped to a token. If it is, the match is extended to the next character before mapping them as a token. For keywords, if the entire word matches, it’s converted into a token. Numbers containing only digits are clearly integer tokens. Brackets are also easy to tokenize. Comments and spaces are simply skipped. Since our language doesn’t include strings, the process becomes much easier.
Revisit our language
import lang_mod;
// only single line comments allowed
init floor[1] = 1; // init floor[id] sets the initial value for a box on the floor
init floor[0] = 0;
init floor_max = 10;
// A function is defined, which takes an argument.
// Only one or zero argument is allowed.
function countdown(a) {
for (let i = a, i >= 0, --i) {
if (a % 2 == 0) {
continue;
}
outbox(i);
}
return a;
}
// sub marks a no-return function
sub no_return_func() {
// and it does nothing
}
// the program starts here
sub start() {
// a loop construct
// int(true) = 1, int(false) = 0, boolean and int are basically the same thing
// bool(1) = true, bool(0) = false, bool(anything except 0) = true
while (true) {
// inbox and outbox
let a = inbox();
let b = inbox();
if (a > b) {
outbox(a);
} else {
outbox(b);
}
let c = a * b;
let num = countdown(a);
floor[2] = num;
floor[3] = floor[b] + c;
outbox(c);
}
}
Let's and implement this naive lexer.
from enum import Enum, auto
# Define token types
class TokenType(Enum):
IMPORT = auto()
IDENTIFIER = auto()
NUMBER = auto()
ASSIGN = auto()
SEMICOLON = auto()
COMMA = auto()
LPAREN = auto()
RPAREN = auto()
LBRACKET = auto()
RBRACKET = auto()
LBRACE = auto()
RBRACE = auto()
FUNCTION = auto()
SUB = auto()
FOR = auto()
WHILE = auto()
IF = auto()
ELSE = auto()
RETURN = auto()
LET = auto()
OPERATOR = auto()
INIT = auto()
FLOOR = auto()
FLOOR_MAX = auto()
END_OF_INPUT = auto()
# Define keywords
keywords = {
"import": TokenType.IMPORT,
"function": TokenType.FUNCTION,
"sub": TokenType.SUB,
"for": TokenType.FOR,
"while": TokenType.WHILE,
"if": TokenType.IF,
"else": TokenType.ELSE,
"return": TokenType.RETURN,
"let": TokenType.LET,
"init": TokenType.INIT,
"floor": TokenType.FLOOR,
"floor_max": TokenType.FLOOR_MAX
}
# Operator characters and combinations
operators = {'+', '-', '*', '/', '>', '<', '=', '!', '++', '--', '>=', '<=', '==', '!=', '&&', '||'}
def is_operator(char):
return char in '+-*/><=!'
def is_whitespace(char):
return char in ' \t\n\r'
def is_digit(char):
return '0' <= char <= '9'
def is_identifier_start(char):
return char.isalpha() or char == '_'
def is_identifier_part(char):
return char.isalnum() or char == '_'
def scanner(code):
tokens = []
i = 0
length = len(code)
while i < length:
char = code[i]
# Skip whitespace
if is_whitespace(char):
i += 1
continue
# Handle comments
if char == '/' and i + 1 < length and code[i + 1] == '/':
while i < length and code[i] != '\n':
i += 1
continue
# Handle numbers
if is_digit(char):
start = i
while i < length and is_digit(code[i]):
i += 1
tokens.append((TokenType.NUMBER, int(code[start:i])))
continue
# Handle identifiers and keywords
if is_identifier_start(char):
start = i
while i < length and is_identifier_part(code[i]):
i += 1
word = code[start:i]
token_type = keywords.get(word, TokenType.IDENTIFIER)
tokens.append((token_type, word))
continue
# Handle operators
if is_operator(char):
if i + 1 < length and is_operator(code[i + 1]):
two_char_operator = char + code[i + 1]
if two_char_operator in operators:
tokens.append((TokenType.OPERATOR, two_char_operator))
i += 2
continue
tokens.append((TokenType.OPERATOR, char))
i += 1
continue
# Handle brackets and punctuation
if char == '(':
tokens.append((TokenType.LPAREN, char))
elif char == ')':
tokens.append((TokenType.RPAREN, char))
elif char == '{':
tokens.append((TokenType.LBRACE, char))
elif char == '}':
tokens.append((TokenType.RBRACE, char))
elif char == '[':
tokens.append((TokenType.LBRACKET, char))
elif char == ']':
tokens.append((TokenType.RBRACKET, char))
elif char == ',':
tokens.append((TokenType.COMMA, char))
elif char == ';':
tokens.append((TokenType.SEMICOLON, char))
elif char == '=':
tokens.append((TokenType.ASSIGN, char))
i += 1
tokens.append((TokenType.END_OF_INPUT, ''))
return tokens
You can also find the code here.
A better way ...?
While examining the matching procedure, you’ll notice that a character might be evaluated multiple times against potential tokenizations. To optimize this, we can build a state transition table that directly jumps to the next valid token when an unmatched character is encountered. For instance, if we have an identifier called floor1
, and we know that floor
is a keyword, we can create a simplified automaton to tokenize it, like the chart below.
Let’s skip the math jargon and get straight to the point. Although not mathematically precise, this explanation is easier to grasp:
- The matching process can be abstracted as a
Finite State Automaton
. - The FSA is equivalent to a
Regular Expression
, an inferior formal language.
Yeah lex!
By constructing regular expressions for the tokens, we generate the automaton that helps optimize token production. Writing the transition table manually is impractical, so we rely on tools like lex
. In our implementation, we’ll use GNU flex.
flex
is a tool that generates a scanner. Essentially, we provide the tool with our token definitions, usually in the form of regular expressions. The tool then generates an automaton that reads the text input and outputs tokens. The following is an example of a lex definition file. The CAPITALized tags are our defined tokens.
%{
// some preamble declarations
%}
%option nounistd
%option never-interactive
%option noyywrap
%option yylineno
%%
"//".* { __currentToken.preceding_metadata.emplace_back(TokenMetadata {.type=TokenMetadata::Comment, .value = std::make_shared<std::string>(yytext)}); return COMMENT; }
"import" { return IMPORT; }
"return" { return RETURN; }
"let" { return LET; }
"init" { return INIT; }
"floor" { return FLOOR; }
"floor_max" { return FLOOR_MAX;}
"function" { return FUNCTION;}
"sub" { return SUBWORD;}
"if" { return IF; }
"else" { return ELSE; }
"while" { return WHILE; }
"for" { return FOR; }
"break" { return BREAK; }
"continue" { return CONTINUE; }
">=" { return GE; }
"<=" { return LE; }
"==" { return EE; }
"!=" { return NE; }
">" { return GT; }
"<" { return LT; }
"(" { return OPEN_PAREN; }
")" { return CLOSE_PAREN; }
"{" { return OPEN_BRACE; }
"}" { return CLOSE_BRACE; }
"[" { return OPEN_BRACKET; }
"]" { return CLOSE_BRACKET;}
"," { return COMMA; }
"&&" { return AND; }
"||" { return OR; }
"++" { return ADDADD; }
"--" { return SUBSUB; }
"!" { return NOT; }
"+" { return ADD; }
"-" { return SUB; }
"*" { return MUL; }
"/" { return DIV; }
"%" { return MOD; }
"=" { return EQ; }
";" { return T; }
"true" { __currentToken.boolean = true; return BOOLEAN; }
"false" { __currentToken.boolean = false; return BOOLEAN; }
[0-9]+ { __currentToken.integer = atoi(yytext); return INTEGER; }
"'"[A-Za-z]"'" { __currentToken.integer = toupper(yytext[1]); __currentToken.is_char = true; return INTEGER; }
[A-Za-z_][A-Za-z0-9_]* { __currentToken.identifier = std::make_shared<std::string>(yytext); return IDENTIFIER; }
[ \t]+ { }
[\n\r] { __currentToken.preceding_metadata.emplace_back(TokenMetadata {.type=TokenMetadata::Newline}); return NEWLINE; }
. { return TOKEN_ERROR; }
%%
The sections between %%
define the tokens. We’ve defined token enums in HRLToken.h
, such as IMPORT
, RETURN
, LET
, etc. flex
generates a C/C++ file from this .l
lex source file, and ultimately, we can use it simply by calling yylex()
, which returns the next token ID.
Our fully defined .l
file can be found here.
Below is our tokenized Human Resource Machine LazyCoder Language program.
IMPORT IDENTIFIER ;
INIT FLOOR OPEN_BRACKET INTEGER CLOSE_BRACKET EQ INTEGER ;
INIT FLOOR OPEN_BRACKET INTEGER CLOSE_BRACKET EQ INTEGER ;
INIT FLOOR_MAX EQ INTEGER ;
FUNCTION IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN OPEN_BRACE
FOR OPEN_PAREN LET IDENTIFIER EQ IDENTIFIER , IDENTIFIER GE INTEGER , SUBSUB IDENTIFIER CLOSE_PAREN OPEN_BRACE
IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
CLOSE_BRACE
RETURN IDENTIFIER ;
CLOSE_BRACE
SUBWORD IDENTIFIER OPEN_PAREN CLOSE_PAREN OPEN_BRACE
CLOSE_BRACE
SUBWORD IDENTIFIER OPEN_PAREN CLOSE_PAREN OPEN_BRACE
WHILE OPEN_PAREN BOOLEAN CLOSE_PAREN OPEN_BRACE
LET IDENTIFIER EQ IDENTIFIER OPEN_PAREN CLOSE_PAREN ;
LET IDENTIFIER EQ IDENTIFIER OPEN_PAREN CLOSE_PAREN ;
IF OPEN_PAREN IDENTIFIER GT IDENTIFIER CLOSE_PAREN OPEN_BRACE
IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
CLOSE_BRACE
ELSE OPEN_BRACE
IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
CLOSE_BRACE
LET IDENTIFIER EQ IDENTIFIER MUL IDENTIFIER ;
LET IDENTIFIER EQ IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
FLOOR OPEN_BRACKET INTEGER CLOSE_BRACKET EQ IDENTIFIER ;
FLOOR OPEN_BRACKET INTEGER CLOSE_BRACKET EQ FLOOR OPEN_BRACKET IDENTIFIER CLOSE_BRACKET ADD IDENTIFIER ;
IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
CLOSE_BRACE
CLOSE_BRACE
END