Building a Human Resource Machine Compiler - Lexical Analysis

PotekoAugust 13, 2024...About 3 min

Building a Human Resource Machine Compiler - Lexical Analysis

This article is part of the Human Resource Machine Compiler series.

Parsing a raw program word by word is almost impossible. You’ll quickly find yourself struggling with unpredictable and varied identifier names, and managing strings can be inefficient. This is where the concept of lexing naturally comes into play.

During lexing, the program is transformed into tokens. Keywords are simple tokens, but other tokens may carry metadata, such as integer values or variable names. The lexer is the first component of a compiler.

Transforming language into tokens is fairly straightforward. For a language like this, one could write a lexer with just a cup of coffee.

A naïve hand-written lexer

A simple approach involves scanning the text directly. For an operator like >, if the next character isn’t another combinable operator, it can be immediately mapped to a token. If it is, the match is extended to the next character before mapping them as a token. For keywords, if the entire word matches, it’s converted into a token. Numbers containing only digits are clearly integer tokens. Brackets are also easy to tokenize. Comments and spaces are simply skipped. Since our language doesn’t include strings, the process becomes much easier.

Revisit our language

import lang_mod;

// only single line comments allowed
init floor[1] = 1; // init floor[id] sets the initial value for a box on the floor
init floor[0] = 0;
init floor_max = 10;

// A function is defined, which takes an argument. 
// Only one or zero argument is allowed.
function countdown(a) {
    for (let i = a, i >= 0, --i) {
        if (a % 2 == 0) {
            continue;
        }
        outbox(i);
    }

    return a;
}

// sub marks a no-return function
sub no_return_func() {
    // and it does nothing
}

// the program starts here
sub start() {
    // a loop construct
    // int(true) = 1, int(false) = 0, boolean and int are basically the same thing
    // bool(1) = true, bool(0) = false, bool(anything except 0) = true
    while (true) {
        // inbox and outbox
        let a = inbox();
        let b = inbox();
        if (a > b) {
            outbox(a);
        } else {
            outbox(b);
        }
        let c = a * b;
        let num = countdown(a);
        floor[2] = num;
        floor[3] = floor[b] + c;
        outbox(c);
    }
}

Let's and implement this naive lexer.

from enum import Enum, auto

# Define token types
class TokenType(Enum):
    IMPORT = auto()
    IDENTIFIER = auto()
    NUMBER = auto()
    ASSIGN = auto()
    SEMICOLON = auto()
    COMMA = auto()
    LPAREN = auto()
    RPAREN = auto()
    LBRACKET = auto()
    RBRACKET = auto()
    LBRACE = auto()
    RBRACE = auto()
    FUNCTION = auto()
    SUB = auto()
    FOR = auto()
    WHILE = auto()
    IF = auto()
    ELSE = auto()
    RETURN = auto()
    LET = auto()
    OPERATOR = auto()
    INIT = auto()
    FLOOR = auto()
    FLOOR_MAX = auto()
    END_OF_INPUT = auto()

# Define keywords
keywords = {
    "import": TokenType.IMPORT,
    "function": TokenType.FUNCTION,
    "sub": TokenType.SUB,
    "for": TokenType.FOR,
    "while": TokenType.WHILE,
    "if": TokenType.IF,
    "else": TokenType.ELSE,
    "return": TokenType.RETURN,
    "let": TokenType.LET,
    "init": TokenType.INIT,
    "floor": TokenType.FLOOR,
    "floor_max": TokenType.FLOOR_MAX
}

# Operator characters and combinations
operators = {'+', '-', '*', '/', '>', '<', '=', '!', '++', '--', '>=', '<=', '==', '!=', '&&', '||'}

def is_operator(char):
    return char in '+-*/><=!'

def is_whitespace(char):
    return char in ' \t\n\r'

def is_digit(char):
    return '0' <= char <= '9'

def is_identifier_start(char):
    return char.isalpha() or char == '_'

def is_identifier_part(char):
    return char.isalnum() or char == '_'

def scanner(code):
    tokens = []
    i = 0
    length = len(code)

    while i < length:
        char = code[i]

        # Skip whitespace
        if is_whitespace(char):
            i += 1
            continue

        # Handle comments
        if char == '/' and i + 1 < length and code[i + 1] == '/':
            while i < length and code[i] != '\n':
                i += 1
            continue

        # Handle numbers
        if is_digit(char):
            start = i
            while i < length and is_digit(code[i]):
                i += 1
            tokens.append((TokenType.NUMBER, int(code[start:i])))
            continue

        # Handle identifiers and keywords
        if is_identifier_start(char):
            start = i
            while i < length and is_identifier_part(code[i]):
                i += 1
            word = code[start:i]
            token_type = keywords.get(word, TokenType.IDENTIFIER)
            tokens.append((token_type, word))
            continue

        # Handle operators
        if is_operator(char):
            if i + 1 < length and is_operator(code[i + 1]):
                two_char_operator = char + code[i + 1]
                if two_char_operator in operators:
                    tokens.append((TokenType.OPERATOR, two_char_operator))
                    i += 2
                    continue
            tokens.append((TokenType.OPERATOR, char))
            i += 1
            continue

        # Handle brackets and punctuation
        if char == '(':
            tokens.append((TokenType.LPAREN, char))
        elif char == ')':
            tokens.append((TokenType.RPAREN, char))
        elif char == '{':
            tokens.append((TokenType.LBRACE, char))
        elif char == '}':
            tokens.append((TokenType.RBRACE, char))
        elif char == '[':
            tokens.append((TokenType.LBRACKET, char))
        elif char == ']':
            tokens.append((TokenType.RBRACKET, char))
        elif char == ',':
            tokens.append((TokenType.COMMA, char))
        elif char == ';':
            tokens.append((TokenType.SEMICOLON, char))
        elif char == '=':
            tokens.append((TokenType.ASSIGN, char))

        i += 1

    tokens.append((TokenType.END_OF_INPUT, ''))
    return tokens

You can also find the code here.

A better way ...?

While examining the matching procedure, you’ll notice that a character might be evaluated multiple times against potential tokenizations. To optimize this, we can build a state transition table that directly jumps to the next valid token when an unmatched character is encountered. For instance, if we have an identifier called floor1, and we know that floor is a keyword, we can create a simplified automaton to tokenize it, like the chart below.

State Transition

Let’s skip the math jargon and get straight to the point. Although not mathematically precise, this explanation is easier to grasp:

The matching process can be abstracted as a Finite State Automaton.
The FSA is equivalent to a Regular Expression, an inferior formal language.

Yeah lex!

By constructing regular expressions for the tokens, we generate the automaton that helps optimize token production. Writing the transition table manually is impractical, so we rely on tools like lex. In our implementation, we’ll use GNU flex.

flex is a tool that generates a scanner. Essentially, we provide the tool with our token definitions, usually in the form of regular expressions. The tool then generates an automaton that reads the text input and outputs tokens. The following is an example of a lex definition file. The CAPITALized tags are our defined tokens.

%{
// some preamble declarations
%}

%option nounistd
%option never-interactive
%option noyywrap
%option yylineno

%%
"//".*                          { __currentToken.preceding_metadata.emplace_back(TokenMetadata {.type=TokenMetadata::Comment, .value = std::make_shared<std::string>(yytext)}); return COMMENT; }

"import"                        { return IMPORT;  }
"return"                        { return RETURN;  }
"let"                           { return LET;     }
"init"                          { return INIT;    }
"floor"                         { return FLOOR;   }
"floor_max"                     { return FLOOR_MAX;}
"function"                      { return FUNCTION;}
"sub"                           { return SUBWORD;}

"if"                            { return IF;    }
"else"                          { return ELSE;  }
"while"                         { return WHILE; }
"for"                           { return FOR;   }
"break"                         { return BREAK;     }
"continue"                      { return CONTINUE;  }

">="                            { return GE; }
"<="                            { return LE; }
"=="                            { return EE; }
"!="                            { return NE; }
">"                             { return GT; }
"<"                             { return LT; }

"("                             { return OPEN_PAREN;   }
")"                             { return CLOSE_PAREN;  }
"{"                             { return OPEN_BRACE;   }
"}"                             { return CLOSE_BRACE;  }
"["                             { return OPEN_BRACKET; }
"]"                             { return CLOSE_BRACKET;}
","                             { return COMMA;        }

"&&"                            { return AND;          }
"||"                            { return OR;           }

"++"                            { return ADDADD;       }
"--"                            { return SUBSUB;       }
"!"                             { return NOT;          }
"+"                             { return ADD;          }
"-"                             { return SUB;          }
"*"                             { return MUL;          }
"/"                             { return DIV;          }
"%"                             { return MOD;          }
"="                             { return EQ;           }
";"                             { return T;            }

"true"                          { __currentToken.boolean = true; return BOOLEAN; }
"false"                         { __currentToken.boolean = false; return BOOLEAN; }
[0-9]+                          { __currentToken.integer = atoi(yytext); return INTEGER; }
"'"[A-Za-z]"'"                  { __currentToken.integer = toupper(yytext[1]); __currentToken.is_char = true; return INTEGER; }

[A-Za-z_][A-Za-z0-9_]*          { __currentToken.identifier = std::make_shared<std::string>(yytext); return IDENTIFIER; }
[ \t]+                          { }
[\n\r]                          { __currentToken.preceding_metadata.emplace_back(TokenMetadata {.type=TokenMetadata::Newline}); return NEWLINE; }

.                               { return TOKEN_ERROR; }

%%

The sections between %% define the tokens. We’ve defined token enums in HRLToken.h, such as IMPORT, RETURN, LET, etc. flex generates a C/C++ file from this .l lex source file, and ultimately, we can use it simply by calling yylex(), which returns the next token ID.

Our fully defined .l file can be found here.

Below is our tokenized Human Resource Machine LazyCoder Language program.

IMPORT IDENTIFIER ;
INIT FLOOR OPEN_BRACKET INTEGER CLOSE_BRACKET EQ INTEGER ;
INIT FLOOR OPEN_BRACKET INTEGER CLOSE_BRACKET EQ INTEGER ;
INIT FLOOR_MAX EQ INTEGER ;
FUNCTION IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN OPEN_BRACE
  FOR OPEN_PAREN LET IDENTIFIER EQ IDENTIFIER , IDENTIFIER GE INTEGER , SUBSUB IDENTIFIER CLOSE_PAREN OPEN_BRACE
    IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
    CLOSE_BRACE
  RETURN IDENTIFIER ;
  CLOSE_BRACE
SUBWORD IDENTIFIER OPEN_PAREN CLOSE_PAREN OPEN_BRACE
  CLOSE_BRACE
SUBWORD IDENTIFIER OPEN_PAREN CLOSE_PAREN OPEN_BRACE
  WHILE OPEN_PAREN BOOLEAN CLOSE_PAREN OPEN_BRACE
    LET IDENTIFIER EQ IDENTIFIER OPEN_PAREN CLOSE_PAREN ;
    LET IDENTIFIER EQ IDENTIFIER OPEN_PAREN CLOSE_PAREN ;
    IF OPEN_PAREN IDENTIFIER GT IDENTIFIER CLOSE_PAREN OPEN_BRACE
      IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
      CLOSE_BRACE
    ELSE OPEN_BRACE
      IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
      CLOSE_BRACE
    LET IDENTIFIER EQ IDENTIFIER MUL IDENTIFIER ;
    LET IDENTIFIER EQ IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
    FLOOR OPEN_BRACKET INTEGER CLOSE_BRACKET EQ IDENTIFIER ;
    FLOOR OPEN_BRACKET INTEGER CLOSE_BRACKET EQ FLOOR OPEN_BRACKET IDENTIFIER CLOSE_BRACKET ADD IDENTIFIER ;
    IDENTIFIER OPEN_PAREN IDENTIFIER CLOSE_PAREN ;
    CLOSE_BRACE
  CLOSE_BRACE
END

NickName

E-Mail

Website

Comments

Latest
Oldest
Hottest

Building a Human Resource Machine Compiler - Lexical Analysis

Building a Human Resource Machine Compiler - Lexical Analysis

A naïve hand-written lexer

A better way ...?

Yeah lex!

Preview: