Textbook Methods of Parsing

Among those who write compilers, there are standard techniques for parsing input text according to a formal grammar. Though special cases may call for special tweaks, often much of a compiler can be generated by automated tools.

A typical compiler may be divided into three components: the front end, the intermediate representation, and the back end.

The front end reads the source code and passes it through a series of transformations: lexical analysis, syntactical analysis, and semantic analysis. Finally it constructs the intermediate representation.

The intermediate representation is a complex data structure which embodies the meaning and intent of the source code. At this point the details of language and syntax no longer matter. Stripped of all aspects of mere physicality, the intermediate representation is a pristine abstraction, shimmering unseen in iridescent groves of silicon.

The back end transforms the intermediate representation back into a physical form, namely object code or assembly language. It must be specific for the target machine, but it need not be specific for the source language. Different front ends may share the same back end.

These pages will ignore the intermediate representation and the back end, concentrating instead on the three phases of the front end.

Lexical analysis

Lexical analysis examines successive characters to identify tokens -- atomic units of meaning such as keywords, variable names, and elements of punctuation. This phase is likely to be automated through the use of a tool such as lex. At some level, however, most lexical analyzers consist largely of one or more finite state machines (or "FSMs").

The output of lexical analysis is a stream of tokens.

Syntactical Analysis

Syntactical analysis examines successive tokens to identify larger units of meaning such as expressions and statements. For example, a COBOL compiler might see the following sequence of tokens: ...and conclude that it had found a subscripted variable with two subscripts. It might then discard the comma as irrelevant, incorporate the remaining tokens into some kind of data structure, and pass the whole package to the semantic analyzer.

A syntactical analyzer may use an FSM, whose events would represent tokens rather than individual characters.

For programming constructs which may nest arbitrarily, such as many kinds of expressions, a technique called recursive descent is popular. The analyzer calls itself recursively for each level of nesting.

To represent expressions, a typical syntactical analyzer will build a type of tree called a parse tree. Each internal node of the tree represents an operator, each of whose child nodes is an operand. For nested expressions, an operand may be another operator with operands of its own.

JOCKEY employs a more simplistic approach called predictive parsing. Most of the time we knows what kind of token to expect. If we see the tag '/TN', for example, then the very next token had better be a telephone number. So we call a specialized FSM designed to extract telephone numbers.

Semantic Analysis

Semantic analysis examines the fragments identified by syntactical analysis and attaches meaning to them, meanwhile enforcing the rules of the language.

Consider the example given above, where the syntax indicates a subscripted variable. The semantic analyzer would consult a list of declared variables, find the appropriate offset in memory, and verify that the variable has the right number of subscripts. It would also identify the variable named within the parentheses, and verify that it is suitable for use as a subscript. It might even issue an error or a warning if the literal subscript is not within bounds.

Finally the semantic analyzer integrates all the fragments of syntax to build the intermediate representation.

Caveats

This discussion pretends that the phases of the front end are neatly separated. In real life the boundaries may be fuzzier, and different compilers may draw the dotted lines in different places.

In particular, don't take the example too seriously. It's just a plausible guess about how a hypothetical COBOL compiler might work.


[block]Parsing [gears]Finite State Machines [books]Further reading
[home]COBOL Home