Friday, June 21, 2013

The JavaScript tokenizer, a finite state machine

As promised, I will start by talking about the tokenizer to explain the main concept. I also have to improve it as - today - there is at least one identified bug and a missing feature (JavaScript operator recognition).

The tokenizer

Initially, my very first parsing algorithm was based on regular expressions I found on the net. It was not bigger than 40 lines of code and worked fine.
However, the JavaScript regular expression object does not offer the necessary APIs to "inject" data asynchronously. Consequently, the whole text had to be loaded to start the parsing. Furthermore, from a performance perspective, it was slow and resource consuming.

Because of the different constraints I put on the tokenizer, I had to create an algorithm that was capable of parsing JavaScript by bits, i.e. with consecutive inputs. Considering the smallest item possible, I decided that the algorithm would process each character one by one and would know exactly:
  • What it has done before (i.e. the current state)
  • How to process the next character (i.e. what is the next step considering the current input)
It means that I defined a finite-state machine.

A simple example

Let's clarify this with a simple example: why don't we write a parser that would extract all numbers from a text stream?
You have two approaches:
  • The first one filters out only digits from the stream. For instance the string "abc123de4f" would output "1", "2", "3" and then "4"
  • The second one - the most interesting one - will try to group digits to output numbers. It means that "abc123de4f" would outut "123" and then "4".

First of all, let's define the "context" of the machine, it will be composed of one field named state that will allow us to know in which step we are (this will be detailled after). We will also need to store the numbers that have been already processed: we create an array named digits.

from a general manner, it is faster (and somehow memory safer) to add items to an array rather than concatenate a string with new characters. This is a trick that I use very often.
var state = 0; // Initial state var digits = []; Now let's detail how this machine works:

When I am in state 0 and I receive a character, it can be either a digit (0..9) or anything else. If a digit is received, I will change the state to 1 which will indicate that I started to receive one digit. Otherwise, I remain in state 0 (and I am waiting for a digit).

I will use the following notation to summarize:
(0) "0" → (1)
(0) "1" → (1)
   ...
(0) "9" → (1)
(0) ? → (0)

When I am in state 1, I will invert the logic: I will remain in state 1 until I receive a character that is not a digit: in that case, I output the number and go back to state 0.

(1) "0" → (1)
(1) "1" → (1)
    ...
(1) "9" → (1)
(1) ? → (0)

From a coding point of view, this generate the following method:
function filterNumbers( input ) { var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { digits = [ oneChar ]; state = 1; } // Else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits digits.push( oneChar ); } else { // Digits over, output the result alert( digits.join( "" ) ); // And go back to state 0 state = 0; } } } }
Want to try? Enter the value to analyse and click test:
It looks great!

However there are two problems:
  1. In the code above, there is no way to finalize the machine; For instance, if you try with "abc123de4", the value "4" will not be outputted. This is because the machine remains in state 1 and is waiting for other digits. When the source has ended, we must provide a way to signal that no other character will be injected.
  2. As a consequence of the above, two consecutive calls use the same context. For instance, after trying with "abc123de4", another test with "f" will eventually output "4". This might not be a problem after all, but having a method to reset the machine can be useful. Ideally, having a way to save and provide the context to the machine is better: it allows complex manipulations such as switching.

Let's add symbols

The above example shows a way to sort numbers out of a string. Let say that we also want to extract the possible JavaScript symbols (such as +, -, |, &, :, ; ) and, ideally, identify operators which may be group of symbols (such as ++, +=, ==, ===).

The first action is to list the possible characters and operators. After digging the net (I have to admit that I never used some of them), I consolidated the following list:

* *= / /= % %= ^ ^= ~ ~=
 + ++ += - -- -= | || |= & && &=
 = == ===  ! != !==
> >> >= >>= >>> >>>= < << <= <<=
[ ] ( ) . , ; ? :

Hence a total of 47 allowed combinations of the characters: (){}[]<>|&?,.;:!=+-*/%^

Please note that //, /* and */ are not taken into account because already used & defined for comments.
The same way, " and ' are reserved for strings, this is why they don't appear in this list.

The very first approach - that is currently implemented in the tokenizer - consists in identifying any character in the above list and output it as a symbol.
However, this does not allow us to detect operators.

The same way, unlike for numbers, it is not possible to keep all symbol characters together and throw them when something different is found. For instance, with the string "abc123++==d" the tokenizer  should be able to output "123", "++" and "==" and not "++==".

This is why we must be able to determine when a new character may be concatenated with the previous ones or not.

One way is to associate states to each possibility. Focusing on "*" and "*=", This would lead to:

(0) "0" → (1)
   ...
(0) "9" → (1)
(0) "*" → (2)
(0) ? → (0)


(1) "0" → (1)
    ...
(1) "9" → (1)
(1) "*" → (2)
(1) ? → (0)

(2) "0" → (1)
    ...
(2) "9" → (1)
(2) "*" → (2)
(2) "=" → (3)
(2) ? → (0)

(3) "0" → (1)
    ...
(3) "9" → (1)
(3) "*" → (2)
(2) ? → (0)

Hum... ok so if we add a new state for each symbol, it means adding 47 states and handling probably at least the same amount of transitions between each state... sounds huge.
This surely can be automated (and generated) but I would like to try a different approach.

Let's consider a state (2) that handles all symbols. The idea is to be able to determine, while in step 2, if the next character can be the next one of the "current" symbol or not.

Let's redefine the context of the machine:
var state = 0; // Initial state var buffer = []; var _TOKEN_SYMBOL_LIST = "(){}[]<>|&?,.;:!=+-*/%^"; // This will generate an array containing all allowed symbols var _TOKEN_ALLOWED_SYMBOLS = ( "* *=" + " / /=" + " % %=" + " ^ ^=" + " ~ ~=" + " + ++ +=" + " - -- -=" + " | || |=" + " & && &=" + " = == ===" + " ! != !==" + " > >> >= >>= >>> >>>=" + " < << <= <<=" + " [ ] ( ) . , ; ? :" ).split( " " );
Now we need a method to determine if the next character can be concatenated to the ones we have in the buffer, the signature will be:
function filterNumbersAndSymbolsIsValidSymbol( chars, newChar ) { // Is newChar part of a sybmol that starts with chars (an array of chars) }

This leads the engine states to be:
(0) <digit> → (1)
(0) <symbol> → (2)(0) ? → (0)


(1) <digit> → (1)
(1) <symbol> → (2)(1) ? → (0)

(2) <digit> → (1)
(2) <symbol> → (2)
(2) ? → (0)

Considering that <digit> is "0"..."9", <symbol> is one char among _TOKEN_SYMBOL_LIST.
It also means that the following transition:
(2) <symbol> → (2)
May output symbols if the current character does not concatenate with the previous ones.

This leads to the new method:
function filterNumbersAndSymbols( input ) { var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); // Check if a symbol } else if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } // else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits buffer.push( oneChar ); } else { // Digits over, output the result filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } else if( 2 === state ) { if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { if( filterNumbersAndSymbolsIsValidSymbol( buffer, oneChar ) ) { buffer.push( oneChar ); } else { filterNumbersAndSymbolsOutput( "symbol", buffer.join("") ); filterNumbersAndSymbolsStateChange( 2 ); buffer = [ oneChar ]; } } else { // Symbol over, output the result filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } } }
Please note the use of these two functions that will be clarified later:
function filterNumbersAndSymbolsOutput( type, token ) { alert( type + ": " + token ); } function filterNumbersAndSymbolsStateChange( value ) { state = value; }

Defining filterNumbersAndSymbolsIsValidSymbol

This is where the challenge - and the fun - is. There are several ways to define this functions and I wanted to try each of them to evaluate their performance and ease of maintenance.

Version 0
I wanted to use the array _TOKEN_ALLOWED_SYMBOLS to verify if the new symbol composed of the chars and the new one would be valid. This leads to the following function:
function filterNumbersAndSymbolsIsValidSymbol_0( chars, newChar ) { var token = chars.join("") + newChar; for( var idx = 0; idx < _TOKEN_ALLOWED_SYMBOLS.length; ++idx ) if( _TOKEN_ALLOWED_SYMBOLS[ idx ] === token ) return true; return false; }

Version 1
Another version consists in defining a huge condition that would test all possible symbols at once, something like ( "*" === symbol || "*=" === symbol || ...)
However, writing this function would take time that's why I preferred to 'build' this function with the actual list of allowed symbols:
var filterNumbersAndSymbolsIsValidSymbol_1 = (function(){ var src = [], idx; for( idx = 0; idx < _TOKEN_ALLOWED_SYMBOLS.length; ++idx ) src.push( "\"" + _TOKEN_ALLOWED_SYMBOLS[ idx ] + "\" === token" ); return new Function( "var token = arguments[0].join(\"\") + arguments[1]; return " + src.join( "||" )+ ";" ); })();

Version 2
Well, version 2 is the equivalent of version 1 but 'copied' from its definition. Indeed, when evaluating the performances, I observed something interesting I would like to demonstrate here.
function filterNumbersAndSymbolsIsValidSymbol_2() { var token = arguments[0].join("") + arguments[1]; return "*" === token||"*=" === token||"/" === token||"/=" === token||"%" === token||"%=" === token||"^" === token||"^=" === token||"~" === token||"~=" === token||"+" === token||"++" === token||"+=" === token||"-" === token||"--" === token||"-=" === token||"|" === token||"||" === token||"|=" === token||"&" === token||"&&" === token||"&=" === token||"=" === token||"==" === token||"===" === token||"!" === token||"!=" === token||"!==" === token||">" === token||">>" === token||">=" === token||">>=" === token||">>>" === token||">>>=" === token||"<" === token||"<<" === token||"<=" === token||"<<=" === token||"[" === token||"]" === token||"(" === token||")" === token||"." === token||"," === token||";" === token||"?" === token||":" === token; }

Version 3
This last - and preferred - version has been developed to be the most efficient one. It tests each situation by starting with the length of the current buffer and I observed that knowing the first char is a good way to see what will be allowed next.
As a result, it leads to the following function:
function filterNumbersAndSymbolsIsValidSymbol_3( chars, newChar ) { var firstChar = chars[ 0 ]; if( 1 === chars.length ) { if( -1 < "(){}[].,;:?".indexOf( firstChar ) ) return false; else if( -1 < "!^~*/%".indexOf( firstChar ) ) return "=" === newChar; else return "=" === newChar || firstChar === newChar; } else if( 2 === chars.length ) { if( -1 < "+-|&".indexOf( firstChar ) ) return false; else if( "<" === firstChar ) { return "<" === chars[ 1 ] && "=" === newChar; } else if( -1 < "=!".indexOf( firstChar ) ) { return "=" === newChar; } else if( ">" === firstChar ) { return "=" !== chars[ 1 ] && ( "=" === newChar || ">" === newChar ); } } else if( 3 === chars.length ) { return ">" === firstChar && "=" !== chars[ 2 ] && "=" === newChar; } return false; }

Choosing the right one

To select (and verify) the best one, I had to write a decent testing function. That's why I decided to write one that lists all verified symbols.
function filterNumbersAndSymbolsDumpAllValidSymbols( estimate ) { var res, // Result array (will contain all generated symbols) // Method selection based on a SELECT control in the page method = document.getElementById( "filterNumbersAndSymbols_method" ).selectedIndex, filterNumbersAndSymbolsIsValidSymbol = this[ "filterNumbersAndSymbolsIsValidSymbol_" + method ], // test function (recursive) test = function( buffer ){ var insert = false; // Try to inject each symbol char in the current sequence for( var jdx = 0; jdx < _TOKEN_SYMBOL_LIST.length; ++jdx ) { var newChar = _TOKEN_SYMBOL_LIST.charAt( jdx ); if( filterNumbersAndSymbolsIsValidSymbol( buffer, newChar ) ) { // Valid, call the test function recursively var newBuffer = [].concat( buffer ); newBuffer.push( newChar ); test( newBuffer ); } else // Invalid (but the buffer is valid), dump the buffer insert = true; } if( insert ) res.push( buffer.join("") ); }, // time estimate helper count = 0, max = 1, dtStart, timeSpent, msg; if( estimate ) { max = 100; dtStart = new Date(); } for( count = 0; count < max; ++count ) { res = []; for( var idx = 0; idx < _TOKEN_SYMBOL_LIST.length; ++idx ) test( [ _TOKEN_SYMBOL_LIST.charAt( idx ) ] ); } msg = [ res.join( " " ), "Count: " + res.length + " / " + _TOKEN_ALLOWED_SYMBOLS.length ]; if( estimate ) { timeSpent = (new Date()) - dtStart; msg.push( "Time spent: " + timeSpent + "ms" ); msg.push( "Function:", filterNumbersAndSymbolsIsValidSymbol ); } document.getElementById( "filterNumbersAndSymbols_output" ).value = msg.join( "\r\n" ); }

This leads to:


Conclusions

First of all, it looks like the fastest function is the last one. It can be easily explained: instead of joining the buffer and the new character (which takes time) to finally test all possibilities (without any specific order), the last version tries to distinguish the current symbol and find the appropriate condition based on the current length and the list of possibilities. What is surprising anyway is the fact that a dynamically generated function is slower than the same one declared statically. I must document myself to understand why but this is important to remember for performance reasons.
There are still two things to discuss:
  • How do you finalize the token when the input string is empty?
  • How do you debug this?

Finalizing the token

This last step does not need any input, the idea is to consider what we already have and - if valid - output the result (in our case, this is really easy as any symbol built so far is valid). This will consequently depend on the current state of the engine.

Debugging

Well, recent browsers offer the most convenient way to debug the JavaScript code. However, when the bit of code you are looking for is called after several inputs, the task may quickly become tedious to get it. One way I use to debug the engine (that you will retrieve by checking the release page) is to dump the input as well as engine information mixed inside it.
function filterNumbersAndSymbolsStateChange( value ) { var span = document.createElement( "span" ); span.style = "color: red;" span.innerHTML = "(" + value + ")"; document.getElementById( "filterNumbersAndSymbols_debug" ).appendChild( span ); state = value; } function filterNumbersAndSymbols( input ) { // Clear the output var debug = document.getElementById( "filterNumbersAndSymbols_debug" ); debug.innerHTML =""; // Set the initial state filterNumbersAndSymbolsStateChange( 0 ); // Parse the input string var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); // Dump the input debug.appendChild( document.createTextNode( oneChar ) ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); // Check if a symbol } else if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } // else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits buffer.push( oneChar ); } else { // Digits over, output the result filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } else if( 2 === state ) { if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { if( filterNumbersAndSymbolsIsValidSymbol_3( buffer, oneChar ) ) { buffer.push( oneChar ); } else { filterNumbersAndSymbolsOutput( "symbol", buffer.join("") ); filterNumbersAndSymbolsStateChange( 2 ); buffer = [ oneChar ]; } } else { // Symbol over, output the result filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } } // Finalize token if( 1 === state ) filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); else if( 2 === state ) filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); } Want to test it?

Tuesday, June 18, 2013

Blog and web site update

A quick update of the blog to insert tools used to:
  • reformat javascript snippets (using tokenizer)
  • add some colors (with a custom CSS)
A demonstration? look at the following snippet:

// This is an example of dynamically reformatted JavaScript code alert( "Hello world!" );
Everything has been uploaded to GitHub.
The same way, I updated the web site (http://buchholz.free.fr/gpf-js) to reflect the latest updates.

I am preparing a big article on the way the tokenizer works so stay tuned to the blog.

Friday, June 7, 2013

First version on GitHub

Welcome back,

After days of fine tuning, bug fixing and hard working, I finally finished a first version that demonstrates the concepts I exposed in my first posts.

To celebrate this, I created a GitHub account and each version will be published through it: https://github.com/ArnaudBuchholz/gpf-js. On top of that, the last version will always be published on my personnal website: http://buchholz.free.fr/gpf-js/

For instance, to test the release 'compiler', go to:
http://buchholz.free.fr/gpf-js/release.html
To test the debug version:
http://buchholz.free.fr/gpf-js/test.html
And to test the release version:
http://buchholz.free.fr/gpf-js/test.html?release

There are still some bugs that must be addressed, especially in the JavaScript tokenizer but they don't prevent the 'compiler' to work.

The next steps on this blog are:
- Add some colors and benefit from the tokenizer to add syntax highlighting to the javascript snippets
- Explain the technics used in this first version (dynamic loading of sources, uri creation, code rewriting)

And then, expand the library with new features.