Compilers – The Basics

Pages: 1 2 3 4 5 6

The Parser (cont.)

Top-down Parsing

A common top-down parsing technique is “predictive parsing”. It can be used quite easily to parse C and Pascal. The more general method of “recursive descent” with backtracking can deal with C++. Predictive parsing is the most direct and elegant parsing method, and many languages, such as C, were developed with it in mind. To understand how this works I’ll explain how it is applied to a simple grammar.

expression -> term

		 | expression + term

		 | expression – term

This is shorthand for the three productions:

expression -> term

expression -> expression + term

expression -> expression – term

also:

term -> number

	 | term * number

	 | term / number

Number can be defined by a regular expression ( e.g. [9-0]+ ). It is a token.

This grammar understands simple arithmetic. Here I have called the component parts of the language expressions and terms, but I could have called them anything. To make a predictive parser, a subroutine ‘expression’ is written to parse expressions. This subroutine immediately calls ‘term’, a subroutine to understand terms. ‘Term’ then searches for a number token. After this it searches for a * or /. If it finds one it looks for another number. If it doesn’t find * or / its job is over and it hands back what it has found to ‘expression’, which attempts to finish the job.

This is the general pattern of a predictive parser. Go to the subroutine that understands the most basic phrase, when it can’t understand what it is given go to a higher level subroutine, and so on. In this way a top-down parser fits in nicely with the concept of top-down design, since the highest level subroutines call lower level routines to do their work for them. It is called a “predictive parser” because the higher level routine has to predict which of several lower level routines to call based only on the first token of the phrase that one of the routines must understand. The first token must predict the subroutine call. This is a serious limitation, but irrelevant if the language has been designed with predictive parsing in mind.

The parsing subroutines in the above example can do other things than recognize tokens; they can perform actions on the way. For instance when a term is recognized by the rule:

term -> term * number

The calculation of term * number can be performed and the result returned to the calling subroutine, expression. This could be written:

term -> term * number { return term * number }

similarly

expression -> expression + term { return expression + term }

Now you can see why I divided the arithmetic problem into terms and expressions. The terms are evaluated first, then their results are passed to expression, so the rules of “precedence” in arithmetic are treated correctly. Multiplication and division come before addition and subtraction. This is how precedence is dealt with in a recursive descent parser. In a compiler, the actions associated with each production could be used to generate assembly language, but generally they will be used to generate a data structure called an abstract syntax tree I alluded to earlier.

Bottom-up Parsing

In contrast to top-down parsing, bottom-up parsers are almost never written by hand, they are generated directly from grammars using parser generators such as Yacc and Bison. In fact, they are quite difficult to write, but once a parser generator is written that problem is solved. Yacc is an acronym for “Yet another compiler compiler”, while Bison is an extended version of Yacc. A bottom up parser starts by grouping together all the tokens on the right hand side of a production. When faced with:

8 + 7 / 5

it will group “7 / 5” then group that with “8 +”. This approach isn’t limited by the same problems as the predictive parser. Situations where the first token of a production doesn’t indicate how to parse it aren’t a problem. Bottom-up parsing has other problems of its own, specifically it isn’t always clear whether to add some more tokens to a group or start applying a new production. E.g.

statement -> if ( expression ) statement else statement

		 | if ( expression ) statement

In this case when the parser has seen “if ( expression ) statement” it doesn’t know whether to look for “else statement” or look for the next statement. This is a “shift-reduce” conflict. When specifying grammars in Yacc or Bison there are ways of eliminating these conflicts. Bottom up parsers also tend to have more problems with errors than top-down ones. They understand all the individual bits of a phrase before the highest level part, by which time the remaining bits make no sense. Although I haven’t mentioned it, when grammars are created for languages, generally more productions are added to deal with errors. e.g.

statement -> if !(

(where ! means not)

Will catch ‘if’ statements not followed by a bracket. The compiler can then give a useful error message such as “expression in if statement must be in brackets on line X”. This kind of stuff makes the grammar much more complex, but it is worth it to give sensible error messages.

For many languages bottom-up parsers are the only choice, because the languages have grown around a parser generator. I have often thought that the way humans understand language is a mixture of bottom-up and top-down methods. Particularly, human languages suffer from many shift-reduce conflicts, for example adding ‘not’ to the end of a sentence reverses its meaning (think “Wayne’s world”). Often, clever writing and script-writing depends on changing the meaning of words at the last minute it is possible to do so, relying on the readers mind to have already drawn the conclusion.


Pages: « Prev   1 2 3 4 5 6   Next »

Discuss (15 comments)