The term Lexical Analysis refers to the conversion of a raw character by character stream into a stream of what are termed tokens. A token is a simple abstraction which strives to decouple the lexical structure of something from its logical meaning.
Its sometimes a surprise to some that compilers and interpreters pay almost no attention to the raw text in source files, the bulk of a compiler refers only to tokens and these are often represented by small classes or structures and sometimes these classes or structures have a small header with pointers so that they can be situated in a tree structure (the parse tree).
A token contains several pieces of information one of them being the lexical text but also present is a token ID (often represented as an enum) and other meta data such as the line and column on which it was found and possibly the name or path of the file in which it was found (common for languages that support include/header files).
Creating these tokens from the raw source file is the role of the lexical analyzer sometimes called a scanner, recognizer or tokenizer. I’ve written lexical analyzers several times once for a full compiler I worked on for the PL/I language and another for the C language which formed part of a toolset that was able to perform some powerful maintenance operations on C based projects.
In reality a software tool is almost always used to create lexical analyzer source code from a special input written in a token descriptor language. These meta-languages and tools take much of the drudgery out of creating lexical analyzers.
Nevertheless writing a lexical analyzer from scratch is actually a very good way to study a new programming language because it often requires creative use of the language and often entails the use of non-trivial language constructs and features. Once you’ve written a working lexical analyzer you’ll have gained some insight into the mechanics or programming languages too and you’ll have no reliance on any third party tool either.
Also once you have a working lexical analyzer you may find your appetite whetted and be tempted to look at the next major stage in programming language processing – grammars and parsing – but that’s a whole nother story!
Finite State Machines
My technique of choice for designing a lexical analyzer has always been to use a finite state machine in which a character read from the input along with a current state are used to select functions from a 2D array. The function performs a few simple steps then sets a new state ready for the next character to be read and hence another function to be called to process that character.
As characters are read tokens are created, sometimes the analyzer will read several characters before returning a token and other times it may read one character and immediately return a token, this all depends upon the length of the token text.
The longest tokens are often quoted string constants, followed by identifiers, followed by keywords and then numeric constants and finally various symbol tokens used by the language, these are all examples of tokens from the C language:
|#||Used to begin a compiler directive.|
|“Socket was disconnected”||Some hard-coded text message.|
|1286||An integer literal.|
|[||A subscript or array bounds symbol.|
|record_count||A simple identifier.|
|12.087E-03||A floating point literal.|
If we assume the source file contains only a quoted text string like the socket error message above and the analyzer begins in state 0 say, then when the analyzer reads the quote character it enters a state (just some numeric value like 1) and initializes a new empty string buffer. It then reads characters and no matter what they are it just appends them to the string buffer and will continue to do this until it reads another quote character at that point is creates a STRING token attaches the string buffer to it fills in a few other details, sets the state back to 0 and finally returns the new token.
A similar process takes place for other tokens – the point is that how the analyzer behaves upon reading some character depends wholly upon its state, reading a semicolon in state 1 means the semicolon is just part of a string and it gets appended as does any other character but reading a semicolon in state 0 is treated very differently by terminating any token currently being built.
In some cases we need to preserve a character we’ve just read because it forms part of the next token and so we must ensure we read it again when we are called again to get the next token, an example is the DOT after an IDENTIFIER, consider:
The analyzer will at some point read the first DOT and since that is never part of an IDENTIFIER it creates an IDENTIFIER token, attaches the text “record” does whatever else is needed for that token then sets the state to 0 and then it must preserve the DOT before it returns the IDENTIFIER token. If this is not done then the next token seen will be the next IDENTIFIER “header”; this is very easy to code we simply have a previous_character variable and some flag previous_present and embed this somewhere inside our character reading function.
Any language that supports function pointers or delegates is well suited to writing an FSM based lexical analyzer and once the core framework is in place supporting just a few tokens, they are fairly easy to incrementally extend and test until all the required tokens are in place.
This is fine for writing an analyzer in an imperative language like C, C++, C#, Java etc but the iterative approach using a mutable state variable is the antithesis of a sound functional design – so I set myself a simple goal – how to write a simple lexical analyzer in F# respecting the practices and principles that underpin the functional mindset.
A Functional Lexical Analyzer
I’m very new to these languages having read about them but rarely writing any actual code and I always set myself real world coding problems when I want to learn a new language, this forces me to read and study and investigate with some degree of “pressure” something that always helps when one wants results!
The core requirements that my lexical analyzer must meet (in addition to actually working!) are that it must not rely on any mutable variables and it must rely on recursion rather than iteration. I’m learning that immutability is tied rather closely to recursion the latter very much helps with the former.
You see rather than modifying some variable (e.g. state) as we iterate in a loop, we can instead simply pass a new state into a recursive invocation of the same code – that invocation will then “see” only that new state – from an execution perspective the same code is running but “in” a new state yet no state variable has been modified – this is rather subtle but profoundly important when striving to understand functional languages.
Here is my first cut of a working lexical analyzer, it supports only a simple token type of alphabetic words or quoted strings (with optional underscores) but includes the essential recursion and immutability and can readily be extended to support more kinds of tokens:
// Learn more about F# at http://fsharp.net // See the 'F# Tutorial' project for more help. open System.IO type STATE = | START = 0 | IDENT = 1 // Creating an identifier token | STRING = 2 // Creating a string token [<EntryPoint>] let main argv = printfn "%A" argv // These functions which identify the 'type' of a character are intended // simply to make the code more readable even if some appear as overkill. // Simple function that classifies a character as being alphabetic or not. let Alpha = function | X when X >= 'a' && X <= 'z' -> true | X when X >= 'A' && X <= 'Z' -> true |'_' -> true | _ -> false // Simple function that classifies a character as being white space or not. let Space = function |' ' -> true |';' -> true | _ -> false // Simple function that classifies a character as being a quote or not let Quote = function |'"' -> true | _ -> false // Simple function that converts a Char to a String. let ToString C = sprintf "%c" C let Append S C = S + ToString C // Simple recursive state machine which consumes characters from the // supplied source and accumulates and returns complete tokens as white // spaces are encountered. // The design has very few states because it currently recognizes just two // kinds of tokens a simple alphabetic string and a quoted string. // The function returns a 'token' quad and the returned state is always 0 // so that the next external invocation begins in the start state. We can // avoid this by wrapping the recursive function in a 'start' function // which is never part of any recursion. let rec tokenize ((source:List<char>), state, lexeme, is_string) = let C = source.Head; // Get the next char. let S = List.tail source // Get the source with that char removed. match (C, state) with | (C, STATE.START) when Space C -> tokenize (S, state, lexeme,false) | (C, STATE.START) when Alpha C -> tokenize (S, STATE.IDENT, ToString C,false) | (C, STATE.START) when Quote C -> tokenize (S, STATE.STRING, "",false) | (C, STATE.IDENT) when Space C -> (S, STATE.START, lexeme,false) | (C, STATE.IDENT) when Alpha C -> tokenize (S, state, Append lexeme C,false) | (C, STATE.STRING) when Quote C -> (S, STATE.START,lexeme,true) | (C, STATE.STRING) -> tokenize (S, state, Append lexeme C,false) | _ -> (,STATE.START,"",false) // Get a list that contains all source file contents. // This test file is situated in the same folder as the .fs source // files for this project. let source = List.ofSeq(File.ReadAllText ("..\\..\\sample.txt")) // Create an initial quad containing the source text and an empty // lexeme and start state, the bool indicates if a lexme is a // quoted string. // In reality we'd support a token type enum that would avoid the // need for such a flag, but this is purely educational code. let context = (source,STATE.START,"",false) // Invoke the tokenizer three times, the constructed lexeme (a string) // is contained within the returned triple and that triple is passed in // again to retrieve the next token and so on... let result1 = tokenize context let result2 = tokenize result1 let result3 = tokenize result2 let result4 = tokenize result3 let result5 = tokenize result4 0 // return an integer exit code
The returned token quad contains the remaining unread part of the input stream because characters are stripped from the head of that list as they are read (this is a very fast operation in F#). So by passing a token into the next operation the process resumes where it “left off” so to speak.
If you create a new FSharp console app and simply past the above code over the default code that Visual Studio creates in project source file, you’ll be able to run it under the debugger by putting a break point on the last line of the source (the 0 // return… line above). You’ll need to create a simple .txt file in the same folder as the FSharp source if you name that ‘sample’txt’ you can leave the above source unmodified.
By examining the result1, result2 etc values you’ll see for yourself that the individual words from the input file have been tokenized correctly and text that was inside quotes has the bool set to true in the returned token quad.
When I get some more time I’m going to enhance the above code to support a richer set of tokens, perhaps I’ll strive to tokenize the full C# language since I know that language well it will be easier to test and debug such a tokenizer.