“Top reason why compilers are like women: Miss a period and they go crazy” – anon.
Parsing is the process of understanding the syntax of a language, to allow it to be represented by the compilers internal data structures. The most sophisticated ideas humans can relay to computers are communicated with programming languages. This has made programming languages compromises between the human thought process, the computer’s execution process and the computer’s capability to understand language. The parser deals with the last facet of the problem. As such a lot of research has gone into this area.
The are many ways of automatically recognizing language. In the field of programming languages there are only two significant methods: Top-down parsing and Bottom-up parsing. Top-down parsing partitions a program top-down, programs into modules, modules into subroutines, subroutines into blocks. Bottom-up techniques group tokens together into terms, then expressions, statements, then blocks and subroutines. The distinction will become clear as they are explained. Before explaining how a language is recognized it is useful to explain how its rules, its grammar, can be expressed.
The grammar of a language defines what comprises a meaningful statement of the language. It doesn’t deal with what it means, but that can be added as extra information. Grammars are specified using “productions.” Here is an example of a production:
statement -> if ( expression ) statement else statement
What does it mean? The part to the right of the “->” defines a phrase in a language. The part on the left is the class the phrase falls into. This production can be read “a kind of statement is the token ‘if’ followed by an expression in brackets followed by a statement, then the word ‘else’, ending in a statement.” There will be further rules defining what a statement is, such as:
statement -> while ( expression ) statement
By collecting many productions a grammar is formed. Many such grammar specifications are available on the net (ftp://ftp.iecc.com/pub/file has C, C++, Java & Delphi grammars). Notice that with production, more complex language can be described than with regular expressions. In particular, productions are recursive, so in the above a statement can be defined to include statements, which isn’t possible with regular expressions. Productions were invented by Emil Post and first used to specify a programming language (Fortran) by John Backus (hence Backus-Naur form). It is interesting to note that they were also invented between 200BC-400BC by Panini to specify the rules of Sanskrit grammar.
Discuss (15 comments)