The lexer hack
From Seo Wiki - Search Engine Optimization and Programming Languages
When parsing computer programming languages, the lexer hack (as opposed to "a lexer hack") describes a common solution to the problems which arise when attempting to use a regular grammar-based lexer to classify tokens in ANSI C as either variable names or type names.
In a compiler, the lexer performs one of the earliest stages of converting the source code to a program. It scans the text to extract meaningful tokens, such as words, numbers, and strings. The parser analyzes sequences of tokens attempting to match them to syntax rules representing language structures, such as loops and variable declarations. A problem occurs here if a single sequence of tokens can ambiguously match more than one syntax rule.
(A) * B
the lexer may find these tokens:
- left parenthesis
- identifier 'A'
- right parenthesis
- operator '*'
- identifier 'B'
The parser can interpret this as variable A multiplied by B or as type A casting the dereferenced value of B. This is known as the "typedef-name: identifier" problem.
The hack solution
The solution generally consists of feeding information from the parser's symbol table back into the lexer. This incestuous commingling of the lexer and parser is generally regarded as inelegant, which is why it is called a "hack". The lexer cannot distinguish type identifiers from other identifiers without extra context because all identifiers have the same format.
With the hack in the above example, when the lexer finds the identifier A it should be able to classify the token as a type identifier. The rules of the language would be clarified by specifying that typecasts require a type identifier and the ambiguity disappears.
This problem does not arise (and hence needs no "hack" in order to solve) when using lexerless parsing techniques.
The yacc-derived BtYacc ("Backtracking Yacc") gives the generated parser the ability to try multiple attempts to parse the tokens. In the problem described here, if an attempt fails because of semantic information about the identifier, it can backtrack and attempt other rules.
- ↑ 1.0 1.1 Roskind, James A. (1991-07-11). "A YACC-able C++ 2.1 GRAMMAR, AND THE RESULTING AMBIGUITIES". http://www.cs.utah.edu/research/projects/mso/goofie/grammar5.txt.
- ↑ Bendersky, Eli (2007-11-24). "The context sensitivity of C's grammar". http://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar/.
- ↑ "BtYacc 3.0". http://www.siber.com/btyacc/. Based on yacc with modifications by Chris Dodd and Vadim Maslov.