Building a Human Resource Machine Compiler - Syntax Parser (1)

PotekoAugust 13, 2024...About 9 min

Building a Human Resource Machine Compiler - Syntax Parser (1)

This article is part of the Human Resource Machine Compiler series.

Writing a parser isn’t exactly intuitive, and the process can feel, well, a bit rudimentary.

For simplicity, let's design a mini language called zoo for this example.

animal    : 'ostrich' | 'pig';
action    : 'eats' | 'smells' | 'takes';
fruit     : 'apple' | 'banana' | 'cherry';
vegetable : 'broccoli' | 'cabbage';
state     : 'fresh' | 'spoiled';
color     : 'green' | 'yellow' | 'pink';
subject   : [color] animal;
object    : [color] fruit | [state] vegetable;
sentence  : subject action object;

A valid sentence could be pig eats green apple or pink ostrich takes spoiled cabbage. However, sentences like pig eats pig or pig smells yellow broccoli are invalid in this context. The terminals are animal, action, fruit, color, vegetable, and state. The non-terminals are subject, object, and sentence, with sentence being our top rule.

To build a parser, you can start either bottom-up (beginning with terminals) or top-down (starting from the sentence).

The Simplest Top-Down Parser - `Zoo Language`

A natural approach is to start from the root rule and match tokens accordingly. Let’s take a straightforward, beginner-friendly approach.

Consider parsing the sentence pig eats green apple. We start from the topmost rule, sentence, and look ahead one token—pig. We then descend into the production rules until we reach a terminal rule. Here, that would be subject, then [color]. We try to match the token against the rule, but it fails—pig is not a color. Since color is optional, we backtrack and try matching the next rule, animal. This time, we have a match! We add pig to our Abstract Syntax Tree (AST). The process is repeated until we either run out of matching rules for the token (a failure) or successfully match all tokens against our rules.

Simplified Backtracking Top-Down Parser

Does this look familiar? It's essentially a graph search problem, specifically a depth-first search against our syntax production space!

There are some optimization opportunities. Currently, when a match fails during the descent, we backtrack. However, for simple grammars like this, we can precompute the possible valid tokens (a technique known as predictive parsing). These valid tokens are called the $\text{FIRST}$ set. For example, the $\text{FIRST}$ set of subject is [green, yellow, pink, ostrich, pig]. During parsing, we can look up the $\text{FIRST}$ set to decide whether to descend further. If the token isn't in the $\text{FIRST}$ set, it indicates a syntax error. Since zoo is a simple language, we only need to look ahead by one token. More complex languages may require more tokens to be looked ahead, which complicates the parser.

In this type of parser, we recursively descend through the production rules to find matches and build the AST. This approach is called a recursive descent parser. Since we parse from left to right and derive the leftmost production, it’s also known as an LL parser. If we implement predictive parsing with a single token lookahead, we create an LL(1) parser (our example here isn’t quite there yet, as it uses DFS and requires backtracking). An LL(1) parser can also be implemented using a state transition table.

There’s another concept called the $\text{FOLLOW}$ set, which represents the set of tokens that can validly follow a non-terminal. For instance, the $\text{FOLLOW}$ set of subject is [eats, smells, takes]. In simple grammars like zoo, this can also be precomputed, though we don’t need it for our current example.

We won’t define a meaningless $\text{FOLLOW}$ set 🤪. One useful application is in handling $\epsilon$ parsing. Sometimes, a transformed rule might produce an empty string, denoted by $\epsilon$ . We’ll discuss these transformations later. When the $\text{FIRST}$ set contains $\epsilon$ , it means the current rule can accept an empty string, so it’s safe to proceed to the next production rule if there's no match in $\text{FIRST}$ . In such cases, we match the lookahead token with the $\text{FOLLOW}$ set.

Here are the $\text{FIRST}$ and $\text{FOLLOW}$ tables:

Symbol	FIRST	FOLLOW
action	'eats', 'smells', 'takes'	'apple', 'banana', 'broccoli', 'cherry', 'cabbage', 'fresh', 'green', 'pink', 'spoiled', 'yellow'
animal	'ostrich', 'pig'	'eats', 'smells', 'takes'
color	'green', 'pink', 'yellow'	'apple', 'banana', 'cherry', 'ostrich', 'pig'
fruit	'apple', 'banana', 'cherry'
object	'apple', 'banana', 'broccoli', 'cherry', 'cabbage', 'fresh', 'green', 'pink', 'spoiled', 'yellow'
sentence	'green', 'ostrich', 'pig', 'pink', 'yellow'
state	'fresh', 'spoiled'	'broccoli', 'cabbage'
subject	'green', 'ostrich', 'pig', 'pink', 'yellow'	'eats', 'smells', 'takes'
vegetable	'broccoli', 'cabbage'

Generalized Parsing Process

Let’s break down the parsing process step by step:

Start the Parsing Process:
- Initialize the parser by reading the first token from the input stream and begin with the topmost production rule.
Use the Set of Possible Starting Tokens for Decision:
- For each non-terminal rule, examine the current token (lookahead) and decide which production rule to apply based on the set of possible tokens that could appear at the start of the sequences generated by each production rule ( $\text{FIRST}$ ).
- In our earlier example, we didn't implement this; instead, we descended into the production rules and allowed them to determine the course of action.
Match and Consume Tokens:
- If the current production rule expects a specific terminal (token), compare it with the current token.
- If they match, consume the token and move to the next one.
Recursive Function Calls:
- If the production rule involves other non-terminals, recursively call the corresponding functions for those non-terminals.
Handle Epsilon Productions:
- If a non-terminal can produce an empty string ( $\epsilon$ ), check if the current token is in the set of tokens that could validly follow this non-terminal ( $\text{FOLLOW}$ ) to decide whether the function should return without consuming any tokens.
- Our current simple grammar doesn’t include $\epsilon$ productions, but we’ll explore this concept in more details later.
Detect Syntax Errors:
- If no rule applies (i.e., the current token doesn’t match any of the possible starting tokens for the rules), a syntax error is raised.
Complete Parsing:
- The process continues recursively until the start symbol's function has completed and all tokens are consumed, confirming that the entire input adheres to the grammar.
End with EOF:
- Ensure the final token is the end-of-file (EOF) to verify that the input was fully parsed.

A More Complicated `Happy Zoo`

Let’s tweak our grammar a bit:

animal    : 'ostrich' | 'pig';
fruit     : 'apple' | 'banana' | 'cherry';
happiness : 'happy' | 'joyful' | happiness 'and' happiness;
color     : 'green' | 'yellow' | 'pink';
subject   : [color] happiness animal;
object    : [color] fruit;
sentence  : subject 'feeds on' object | object 'feeds' subject;

This grammar might produce a rather silly sentence like happy and happy and happy and joyful pig feeds on banana 🤪. Don't mind that, it serves our example's purpose. With this grammar, a significant issue arises in our parser.

The Dreaded Left-Recursion

Recall the production rule processing order in our parser, where we derive the leftmost rule to search for a valid match. The problem here lies in the rule for happiness because happiness is the leftmost production of itself. Our leftmost derivation strategy could lead to an infinite loop in producing happiness. While this can be handled by other parsing techniques, for those of us who are fond of LL parsers, transforming the grammar to avoid this issue is highly desirable.

Fortunately, this problem only occurs with left-recursive productions. While we can’t eliminate recursion entirely, we can convert left recursion into right recursion. Generally, left recursion can be systematically transformed into a right recursive form.

Horrible Math Fact

Generally, for a grammar

\Alpha \rightarrow \Alpha\omega | \beta

where $\Alpha$ is a non-terminal, and $\omega$ and $\beta$ are groups of symbols, the grammar can be rewritten as:

\Alpha \rightarrow \beta\Alpha' \\ \Alpha' \rightarrow \epsilon | \omega\Alpha'

This transformation effectively converts left recursion into right recursion.

Let's use happiness: 'happy' | 'joyful' | happiness 'and' happiness as an example. No matter how we produce this rule, it always starts with either happy or joyful, e.g., happy and joyful, happy and happy, joyful and happy and happy, or just joyful. In the notation $\Alpha \rightarrow \Alpha\omega | \beta$ , a $\beta$ (happy or joyful) will always appear first, followed by zero or more instances of $\Alpha$ ('and' happiness).

Naturally, we define a new rule r that allows the repetition of the 'and' happiness part without left recursion: r: 'and' happiness r | ε. Put this back into the original rule, we now have happiness: 'happy' r | 'joyful' r. The $\epsilon$ allows it to be empty, as happiness can be just happy or joyful. Now we have:

happiness: 'happy' r | 'joyful' r
r: 'and' happiness r | ε

This successfully transforms left recursion into right recursion.

FIRST Conflicts

In an LL(1) parser, we look ahead one token and match it against the $\text{FIRST}$ set. Now, consider the following production rules:

color    : 'green' | 'yellow' | 'pink';
subject  : [color] happiness animal;
object   : [color] fruit;
sentence : subject 'feeds on' object | object 'feeds' subject;

Both subject and object will have color derived in their $\text{FIRST}$ rules. This causes both $\text{FIRST(\texttt{subject `feeds on' object})}$ and $\text{FIRST(\texttt{object `feeds' subject})}$ to share [green, yellow, pink]. LL(1) parsers don’t allow $\text{FIRST}$ conflicts, which prevents the parser from immediately choosing the correct descending path.

Fortunately, this issue can be easily resolved through left factoring. By factoring out the common part, we can eliminate the conflict:

sentence        : [color] sentence_prime;
sentence_prime  : happiness_prime 'feeds on' object | fruit_prime 'feeds' subject;
happiness_prime : happiness animal;
fruit_prime     : fruit;

If modifying the grammar is undesirable, reverting to an LL backtracking parser can also address the issue—though at the cost of performance. A backtracking parser rewinds when it finds that the chosen descent path is unviable. Another approach is to increase the number of lookahead tokens. In this case, two tokens would suffice, as we could differentiate between subject and object by looking at the token following the color.

This example highlights that in an LL(1) grammar, the prediction sets (recursive $\text{FIRST}$ and $\text{FOLLOW}$ ) must be disjoint to avoid conflicts in the parsing table.

Horrible Math Fact

For a grammar $G$ , the condition for LL(1) parsing is satisfied if and only if, for any productions $\Alpha \rightarrow \omega_1$ and $\Alpha \rightarrow \omega_2$ , the following holds:

\text{FIRST(}\omega_1\text{ FOLLOW(}\Alpha\text{))} \cap \text{FIRST(}\omega_2\text{ FOLLOW(}\Alpha\text{))} = \emptyset

The Annoying Expressions...

There’s one context where left recursion frequently occurs: expression parsing. Consider the following grammar:

expr     : primary | expr operator expr;
primary  : INTEGER | '(' expr ')';
operator : '+' | '-' | '*' | '/' | '%' | '^';

It’s certainly possible to restructure the grammar to eliminate left recursion. However, expression parsing presents challenges beyond just recursion. Operators have precedence, which determines the order in which operations are performed. For instance, exponentiation has the highest precedence, followed by multiplication, division, and modulus, and finally, addition and subtraction. Parentheses can override this precedence. In addition to precedence, there’s also associativity. Associativity determines the direction in which operators of the same precedence are grouped and evaluated (i.e., left-to-right or right-to-left).

In our set of operators, only exponentiation is right-associative, while the others are left-associative. Consider the expression 1 + 2 * 3 / 4 ^ 5 ^ 6. Here, 4 ^ 5 ^ 6 has the highest precedence, and due to right-associativity, it’s grouped as 4 ^ (5 ^ 6). The entire expression is evaluated as 1 + ((2 * 3) / (4 ^ (5 ^ 6))). For a parser, it’s crucial to correctly recognize both precedence and associativity.

One approach is to tweak the grammar to ensure it respects these rules during parsing:

expr    : term | '+' term | '-' term ;
term    : factor | '*' factor | '/' factor | '%' factor;
factor  : base | base '^' factor;
primary : '(' expr ')' | INTEGER;

However, in some cases, the grammar rules can become overly complex, making them difficult to manage and resulting in unreadable EBNF. In such situations, it may be more practical to treat the expression as a whole and use specialized techniques for parsing.

Precedence Clibming

We will not go into the very details of this algorithm, since there are abundant resources available on the internet. Wikipedia has a fine page on this one. I'll make a basic intro of this algorithm.

The algorithm starts from the left. There's a minimum precedence level set, which the parsing function will only accept operators that have a precedence equal to or higher than this level. Suppose there's no syntax error; it parses the primary expression (like literals and parenthesized expressions) for the left-hand side and then takes a look at the operator.

If the operator is left-associative (the commonly seen case) and its precedence is greater than the current minimum precedence, the algorithm will recursively parse with a minimum precedence set to the current operator's precedence + 1. This ensures that operators with higher precedence are handled first. For operators of the same precedence level (where the precedence is equal), the algorithm does not recurse, ensuring that these operators are grouped correctly from left to right.

When the algorithm encounters a right-associative operator, it needs to ensure that the right-hand side (RHS) of the expression is fully parsed before the operator is applied to the left-hand side (LHS). To achieve this, the algorithm descends recursively with the same precedence level. This allows the rightmost operation to be handled first, ensuring that the right-associative operator groups expressions from right to left.

In essence, the algorithm evaluates two operators sequentially and decides whether to construct the expression tree with the current precedence or let the higher precedence operator's parsing function build its part first.

Here's the code:

from parse_token import *
from parse_node import *


def parse_primary() -> Node:
    lookahead = peek_token()
    if lookahead.token_type == TokenType.INTEGER:
        consume_token()
        return Node(lookahead)
    elif lookahead.token_type == TokenType.LEFT_PARENTHESIS:
        consume_token()
        node = parse_expression_root()
        lookahead = peek_token()
        if lookahead.token_type == TokenType.RIGHT_PARENTHESIS:
            consume_token()
            return node
        else:
            raise ParsingError(
                f"failed parse primary. expect ')' but got '{lookahead}'"
            )
    else:
        raise ParsingError(f"unknown token {lookahead}")


def parse_expression_root() -> Node:
    lhs = parse_primary()
    return parse_expression(lhs, 0)


def parse_expression(lhs: Node, min_precedence) -> Node:
    lookahead: Token = peek_token()
    while (
        lookahead.token_type == TokenType.BINARY_OPERATOR
        and get_precedence(lookahead.content) >= min_precedence
    ):
        operator = lookahead
        current_precedence = get_precedence(operator.content)
        consume_token()
        rhs = parse_primary()
        lookahead = peek_token()

        while lookahead.token_type == TokenType.BINARY_OPERATOR:
            lookahead_left_associative = (
                get_associativity(lookahead.content) == Associativity.Left
            )
            lookahead_right_associative = not lookahead_left_associative
            lookahead_precedence = get_precedence(lookahead.content)

            if (
                lookahead_left_associative and lookahead_precedence > current_precedence
            ) or (
                lookahead_right_associative
                and lookahead_precedence >= current_precedence
            ):
                rhs = parse_expression(
                    rhs,
                    (
                        current_precedence
                        if current_precedence == lookahead_precedence
                        else current_precedence + 1
                    ),
                )
                lookahead = peek_token()
            else:
                break

        result = Node(operator, lhs, rhs)
        lhs = result
    return lhs

Generated Tree for 1+2*3/4^5^6+7*(8+9)

Shunting Yard

The Shunting Yard algorithm is a well-known parsing technique, traditionally used to convert expressions from infix notation to Reverse Polish Notation (RPN). However, it also works exceptionally well for constructing Abstract Syntax Trees (ASTs). You can find more details on the Wikipedia page, but I'll provide a brief overview here.

When using the Shunting Yard algorithm for AST construction, as you advance through the tokens of an expression, operands are pushed onto an output stack, while operators are pushed onto an operator stack. If a left parenthesis is encountered, it's simply pushed onto the operator stack to await further processing. Upon encountering an operator, if the operator at the top of the stack has greater precedence (or the same precedence if it is left-associative), the previous expression is evaluated. This involves popping operands and the operator from their respective stacks, forming an AST node, and then pushing this node back onto the operand stack. When a right parenthesis is encountered, the algorithm builds nodes using the existing operators and operands until a matching left parenthesis is found.

Here is the code implementing this algorithm:

from parse_node import *
from parse_token import *


def parse_expression() -> Node | None:
    operators: list[Token] = []
    operands: list[Node] = []

    lookahead = peek_token()
    while lookahead.token_type != TokenType.EOF:
        match lookahead.token_type:
            case TokenType.INTEGER:
                operands.append(Node(lookahead))
            case TokenType.BINARY_OPERATOR:
                while (
                    operators and operators[-1].token_type != TokenType.LEFT_PARENTHESIS
                ):
                    lookahead_precedence = get_precedence(lookahead.content)
                    last_operator_precedence = get_precedence(operators[-1].content)
                    lookahead_associativity = get_associativity(lookahead.content)

                    if last_operator_precedence > lookahead_precedence or (
                        last_operator_precedence == lookahead_precedence
                        and lookahead_associativity == Associativity.Left
                    ):
                        op = operators.pop()

                        if len(operands) < 2:
                            raise ParsingError("missing operand")
                        right = operands.pop()
                        left = operands.pop()
                        operands.append(Node(op, left, right))

                    else:
                        break

                operators.append(lookahead)
            case TokenType.LEFT_PARENTHESIS:
                operators.append(lookahead)
            case TokenType.RIGHT_PARENTHESIS:
                while True:
                    if (
                        operators
                        and operators[-1].token_type != TokenType.LEFT_PARENTHESIS
                    ):
                        if len(operators) == 0:
                            raise ParsingError("mismatched parathensis")
                        op = operators.pop()

                        if len(operands) < 2:
                            raise ParsingError("missing operand")
                        right = operands.pop()
                        left = operands.pop()
                        operands.append(Node(op, left, right))
                    else:
                        break
                if (
                    not operators
                    or operators[-1].token_type != TokenType.LEFT_PARENTHESIS
                ):
                    raise ParsingError("mismatched parathensis")
                operators.pop()
        consume_token()
        lookahead = peek_token()

    while len(operators) > 0:
        last_operator = operators[-1]
        if last_operator.token_type == TokenType.LEFT_PARENTHESIS:
            raise ParsingError("mismatched parathensis")

        if len(operands) < 2:
            raise ParsingError("missing operand")
        right = operands.pop()
        left = operands.pop()
        operands.append(Node(last_operator, left, right))

        operators.pop()

    return operands.pop() if operands else None

Generated Tree for 1+2*3/4^5^6+7*(8+9)

Precedence climbing is more commonly seen in modern compilers due to its greater flexibility. It can handle more sophisticated expressions and supports direct AST construction in these cases, offering more granular error reporting as well.

NickName

E-Mail

Website

Comments

Latest
Oldest
Hottest

Building a Human Resource Machine Compiler - Syntax Parser (1)

Preview: