Lexical Analysis With F# – Part 3

Although the initial version of a lexical analyzer represents the input file as a List, it also exposes tokens in a crude “one at a time” manner, burdening the token consumer with the responsibility of passing the previous token into the tokenizer function when requesting the next token.

It would be more elegant to expose the tokens as a sequence – a term used in F# (and .Net too I should add) to represent a generator of items in which items are only created when requested by some consumer (like a foreach loop).

A sequence internally ‘’”tracks” state so that each request for an item is always created with respect to the previously created item and that state tracking is encapsulated, this relieves the consumer of any responsibility for doing this which reduces the likelihood of subtle coding errors.

Unfolding A Sequence

F# provides a wonderful mechanism that generalizes this concept and offers the perfect means to improve the lexical analyzer so it exposes a sequence. The mechanism is Seq.unfold – a powerful concept almost tailor made for this need.

The unfold operation requires us to provide a function that a) accepts a state value and b) returns a pair comprised of a generated value and a (usually new) state. The repetitive passing of that returned state into the next invocation of our function is handled automatically and invisibly by Seq.unfold.

The value returned by Seq.unfold is a true sequence (IEnumerable<T> under the hood) and can be enumerated with ease.

Seq.unfold is absolutely perfect for tokenizing because it’s a perfect fit for the existing tokenize function which accepts the previous token as the state value, here is a modified version of the analyzer which when run under debug prints each token from the file as a line on the console window:

// Learn more about F# at http://fsharp.net
// See the 'F# Tutorial' project for more help.
open System.IO

type STATE =
     | START
     | IDENT     // Creating an identifier token
     | STRING    // Creating a string token

let main argv =
    printfn "%A" argv

    // These functions which identify the 'type' of a character are intended
    // simply to make the code more readable even if some appear as overkill.

    // Simple function that classifies a character as being alphabetic or not.

    let Alpha = function
        | X when X >= 'a' && X <= 'z' -> true
        | X when X >= 'A' && X <= 'Z' -> true
        |'_' -> true
        | _  -> false

    // Simple function that classifies a character as being white space or not.

    let Space = function
        |' ' -> true
        |';' -> true
        | _  -> false

    // Simple function that classifies a character as being a quote or not

    let Quote = function
        |'"' -> true
        | _  -> false

    // Simple function that converts a Char to a String.

    let ToString C = sprintf "%c" C

    // Simple function that appends a char to a String.

    let Append S C = S + ToString C

    // Simple recursive state machine which consumes characters from the
    // supplied source and accumulates and returns complete tokens as white
    // spaces are encountered.
    // The design has very few states because it currently recognizes just two
    // kinds of tokens a simple alphabetic string and a quoted string.
    // The function returns a 'token' quad and the returned state is always 0
    // so that the next external invocation begins in the start state. We can
    // avoid this by wrapping the recursive function in a 'start' function
    // which is never part of any recursion.

    let rec tokenize ((source:List<char>), state, lexeme, is_string) =
        let C = source.Head;      // Get the next char.
        let S = List.tail source  // Get the source with that char removed.
        match (C, state) with
        // Actions associated with the start state
        | (C, STATE.START)  when Space C -> tokenize (S, state, "",false)
        | (C, STATE.START)  when Alpha C -> tokenize (S, STATE.IDENT, ToString C,false)
        | (C, STATE.START)  when Quote C -> tokenize (S, STATE.STRING, "",false)
        // Actions associated with the identifier state
        | (C, STATE.IDENT)  when Space C -> (S, STATE.START, lexeme,false)
        | (C, STATE.IDENT)  when Alpha C -> tokenize (S, state, Append lexeme C,false)
        | (C, STATE.STRING) when Quote C -> (S, STATE.START,lexeme,true)
        // Actions associated with the string state
        | (C, STATE.STRING)              -> tokenize (S, state, Append lexeme C,false)
        | _                              -> ([],STATE.START,"",false)

    // Get a list that contains all source file contents.
    // This test file is situated in the same folder as the .fs source
    // files for this project.
    let source = List.ofSeq(File.ReadAllText ("..\\..\\sample.txt"))

    // Helper function to get the input stream from the tuple used to represent a token.
    let the_stream (a,b,c,d) = a

    // Helper function to get the lexeme string from the tuple used to represen a token
    let the_lexeme (a,b,c,d) = c

    // Helper function to get the bool flag which indicates if a token was text
    // or quoted text.
    let is_string (a,b,c,d) = d

    // Helper function invoked by the seq unfold operation. This function is passed
    // an argument and expected to compute the 'next' item in the sequence. In this
    // design a returned token is passed back in as the argument and serves as a
    // context or state to be passed to each successive tokenize operation.
    let unfolder value =
        if List.isEmpty (the_stream value) then
           None // Denotes end of sequence, causes foreach etc to end.
           let token = tokenize value
           // In this simple design a token contains the unread input
           // stream and state so serves as both the returned value
           // (of the enumeration) and the state - hence we use the
           // same value for both args in 'Some'

    // Defines a sequence using the unfolder function and beginning with the
    // initialized start token value.
    let token_sequence = Seq.unfold (unfolder) (source,STATE.START,"",false)

    // Iterate over the sequence of tokens and print their lexeme.
    for token in token_sequence do
        if (is_string token) then
           printfn "Constant: '%s'" (the_lexeme token)
           printfn "Variable: %s" (the_lexeme token)

    0 // return an integer exit code

When you run this under debug (with a break on that last line) you’ll see a list of the tokens found within the input file – remember this simple example supports only identifiers or quoted strings – no checking or resilience is in place for unexpected or illegal tokens or file content.

Here is the output from one test:


and this was the input file from which the above was generated:

this is "just" some "text" to play with

Although this is a very simple lexical analyzer it is beginning to take on a solid shape with a neat approach to exposing tokens as a sequence and a neat recursive operation for creating the tokens. IN a future post I’ll restructure the code so that the input file is treated as a sequence rather than a list so that the on-demand sequence pattern propagates right the way down the chain.

In a real tokenizer we’d also probably represent a token with something more specific than a tuple and we’d support a token type enum too, this would eliminate the current crude bool ‘flag’ which is used to represent a quoted string.

One comment on “Lexical Analysis With F# – Part 3

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s