Compilers – The Basics

Pages: 1 2 3 4 5 6

The Lexer (or lexical analyzer)

/bb|[^b]{2}/ – Hamlet

The lexer is the first small step in understanding a language. Its purpose is to decompose the stream of input characters into “tokens”. This is best understood by example. For instance in the C language the statement:

	char str[] = “A good line.”;

Decomposes into:

token 1: Keyword, specifically “char”

token 2: Identifier, specifically “str”

token 3: Left square bracket

token 4: Right square bracket

token 5: Equals sign

token 6: String, specifically “A good line.”

token 7: Semicolon

Whole strings are tokens, as are some individual characters, such as ‘=’. Each token is labeled as belonging to a certain type, such as “identifier”. Some have extra information associated with them; for example the string token contains the string in question. For most languages the lexer is simple, though there are exceptions such as Fortran and Perl. Languages exist that can generate lexers from specifications, such as ‘lex’ and ‘flex.’ These languages specify the form of a token using a regular expression.

A regular expression (RE) is a form of search. For instance ‘function’ is a valid regular expression that will search for the word ‘function’. The ‘f’ at the beginning searches for the character ‘f’, if I had written [Hh] it would mean look for ‘H’ in either case. [] defines a group of characters. Another example is [A-Z], which searches for a capital alphabetic letter. Following any such expression I could put ‘+’ meaning one or more. So, for instance [a-z]+ means one or more lower-case letters. A really useful RE is one to find an identifier, how about [A-Za-z_][A-Za-z0-9_]* this means one alphabetic letter followed by zero or more alphabetic letters or digits. [^b] means any letter but b (which may help you decipher the quote above), there are more symbols that can be used in REs I won‘t describe them because they are made to save typing, like [^b]. `lex’ and `flex’ are fed a list of regular expressions each with an associated action, which passes the token to the next stage, the parser.

Users of *nix systems will know all about REs since they are used for searching and replacing in many editors, utilities and text processing languages for similar purposes. Regular expressions were not always for processing text. They invented by S.C. Kleene in a paper called “Representing events in Nerve Nets” in 1956 as an attempt to understand nerve activity based on the flawed McCulloch-Pitts neuron model. Since then they have also been used in speech recognition and to detect flaws in printed circuit boards. Today they are heavily used in DNA sequencing. For an explanation of how they work see http://lambda.uta.edu/cse5317/notes/node1.html.


Pages: « Prev   1 2 3 4 5 6   Next »

Discuss (15 comments)