Thursday, March 6, 2014

Goto fail;

Apple recently faced a security issue and provided a patch to fix it: About the security content of iOS 7.0.6
A web site is also available to check if your current OS must be updated: https://gotofail.com/

An interesting analysis of this - what appears to be - bug can be found in this article but please note that I am not sharing the point of view expressed in the conclusion.
I mean, whatever the way, bugs happen.
To make a long story short, it looks like a bad copy & paste duplicated a "goto fail;" instruction which, in the end, has no condition and is always evaluated.

As I recently talked about the advantages of using a lint-tool, I wanted to check if JShint would be capable of detecting this issue.
However - and fortunately - goto does not exist in JavaScript.

There are several ways to reproduce the same effect than a 'goto' instruction:

I usually embrace my if statements with brackets, I wanted to make my samples look similar to the initial problem and check how JSHint would react.
  • First example, the one-time do / while with break
  • function main(parameter) { var success = false; do { if (parameter === "condition1") break; if (parameter === "condition2") break; break; success = true; } while(0); if (success) { alert("Do"); } else { alert("Don't"); } } main("condition3"); JSHint produces a warning: 10 Unreachable 'success' after 'break'.
  • Second example, return in a separate function
  • function testCondition(parameter) { if (parameter === "condition1") return false; if (parameter === "condition2") return false; return false; return true; } function main(parameter) { if (testCondition(parameter)) { alert("Do"); } else { alert("Don't"); } } main("condition3"); JSHint produces a warning: 7 Unreachable 'return' after 'return'.
  • Last example, exceptions
  • function main(parameter) { try { if (parameter === "condition1") throw "fail"; if (parameter === "condition2") throw "fail"; throw "fail"; alert("Do"); } catch (e) { alert("Don't"); } } main("condition3"); JSHint produces a warning: 8 Unreachable 'alert' after 'throw'.

To conclude, JShint is capable of detecting an unreachable code and generates the appropriate warning.
It also means that all the warnings are meaningful and should be carefully considered.

If you want to check by yourself, go to http://www.jshint.com/ and copy & paste the samples directly where the JavaScript sample is written (upper left part of the homepage).

Wednesday, February 26, 2014

Mysteries of javascript objects, classes and inheritance (Part I)

What did I spend my time on during the last months?

I know that the blog activity does not reflect this reality but I have been working on my library during my - rare - spare time. For instance, I tried to organize the different features into namespaces to avoid creating one object containing all the methods.
Since I started the new job, I also rewrote some of it to fit most of JSLint validation rules and changed the way tests are made to support several hosts (cscript, nodeJS and a browser).

One interesting problem I had to deal with is related to the way attributes are handled in my implementation. Attributes could be compared to the Java annotations: they are information added to the source code that are retrieved at run-time using specific methods. They are used to generate code (such as object members' accessors) or customize behaviors (such as XML serialization).

I will write an article later about that very specific topic.

The problem I had to deal with is related to classes and inheritance: indeed, it sounds natural to inherit the attributes and - as a consequence - I needed the possibility to walk through the class hierarchy.
Also, I wanted to simplify the way attributes are created so I wrote aliases for constructors (let say object factories).

I learned a lot by implementing these methods and I would like to share with you my understanding of how objects are handled in JavaScript.

Classes and inheritance

First of all, I will not detail the class and inheritance concepts as they belong to the Object Oriented Programming languages. So I will take the assumption that you are familiar with these. If not, the previous link is a good starting point.

But, also, the main reason is that JavaScript is *not* an usual OOP language: it misses lots of interesting features such as the possibility to control members visibility (using private, protected and public), polymorphism or operators....

As a matter of fact, the class keyword is reserved but you can't do anything with it. You will also discover that the delete keyword exists but for a different usage.

That's why I would like to explain what is a JavaScript object, how it is declared and the way it can be manipulated in order to highlight what can be done and illustrate what can't.

Objects and members

My first object

In the JavaScript language, the simplest (and shortest) way to create an object is to write the following line of code:

var myObject1 = {};

The following syntax is strictly equivalent:

var myObject2 = new Object();

Those who are used to program with real OOP languages will recognize the use of the new operator.
I will come back later to this one, we will first focus on the result.

Both samples are generating an empty object and once created, it already exposes a number of predefined members (mostly methods), for instance:

  • toString a method that returns a string representation of the object
  • hasOwnProperty a method that will become helpful later in this article
  • constructor an object that will also be explained later

Members

To access the members (or call the methods), you can use either the "." operator or the "[]" one:

var string1 = myObject1.toString(); var string2 = myObject2["toString"]();
Calling a method is actually done in two steps: in JavaScript, functions are objects. When you access the member named "toString" on myObject1, it returns the function object. Then, the parenthesis are used to call the function and provide parameters.

Usually, the bracket operator is used whenever the member name is contained in a variable or if the name is an invalid identifier (just try to access the member named "my field" with the dot operator).

The real power of JavaScript compared to classical OOP languages is the fact that you can dynamically add any member to any object.
Again, there are several syntaxes you can use:

myObject1.additionalMember = "value"; myObject2["member"] = "value"; myObject1["member"] === myObject2.member;

Indeed, assigning a member on an object will either update the member (if already existing) or create it.

This simplicity is probably one of the most exciting feature of the JavaScript language. As a consequence, any object can be used as a map and there is virtually no limit to the number of members you can add to an object.
Very often, I used them as dictionary (to associate keys to values).

If you want to initialize the object with members, you can add them one by one:

var johnSmith = new Object(); johnSmith.firstName = "John"; johnSmith.lastName = "Smith"; johnSmith.age = 25; johnSmith.address = new Object(); johnSmith.address.streetAddress = "21 2nd Street"; johnSmith.address.city = "New York"; johnSmith.address.state = "NY"; johnSmith.address.postalCode = 10021;

Or you can make it simpler by using the JSON notation:

var johnSmith = { "firstName": "John", "lastName": "Smith", "age": 25, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": 10021 } };

This means that, unlike a strict OOP language, you can't be sure that your object will respect a given structure (or interface). To be more precise prototyping (that I will explain below) can help you ensure that the object has a minimum set of properties.

Object inspection

In any case, the JavaScript language also offers operators to inspect an object definition. Considering the fact that JavaScript allows you to check any function arity, it provides you a working concept of reflection.

Before going any further, accessing a member that is not defined does not generate any error. Instead, JavaScript returns a special value that represents this missing definition.

I hear some screams regarding the above assertion as most of us faced errors related to values that were not defined. However, please take a closer look at what really happened and you will realize that the error came from the use of the undefined value rather than its access.

The keyword undefined allows you to test this situation:

if (undefined === myObject1.notYetDefined) { myObject1.notYetDefined = "OK"; }

You may also want to use the typeof operator:
if ("undefined" === typeof myObject1.notYetDefined) { myObject1.notYetDefined = "OK"; }

I usually compare a value with undefined rather than using typeof. However, they are some situations where you have no choice. For instance, if you want to test if a global variable (so we are not really accessing a member), you have to use typeof.

So, those two operators allows you to verify if a member exists for a given object.

On the other hand, you might want to list all the members that are currently existing on the object.
This is where you will start using the operator in (this is also where the "[]" operator becomes really useful):

for (var member in myObject1) { alert(myObject1[member]); // Display the content of the member }

Another use of the in operator is to test if a member exists in an object:

if (!("notYetDefined" in myObject1)) { myObject1.notYetDefined = "OK"; }
There are lots of debate on the internet to know which method is the best to enumerate object members. For instance, JSLint does not allow the use of for in without using the hasOwnProperty.
My opinion is that it depends on what you really want to test and - right now - I just want to verify if myObject1 has a member named "notYetDefined" which is exactly what in is doing.
I hope things will become more clear in the next chapter.

Delete

To conclude with object inspection, there is one last operator that is not widely known and rarely used but it can be helpful when applied wisely (I will provide an example later): delete. As documented, it is used to remove a property from an object. It means that, unlike OOP languages, it does not 'free' objects.
An example of use:

if (!("notYetDefined" in myObject1)) { myObject1.notYetDefined = "OK"; } // Do something that relies on myObject1.notYetDefined console.log(myObject1.notYetDefined); // This will remove "notYetDefined" delete myObject1.notYetDefined; // Now "notYetDefined" is no more a property of in myObject1 console.log(typeof myObject1.notYetDefined); OK undefined

However, delete might seem to be dysfunctional in some cases.

Objects' own members vs prototype

The above sentence being confusing, here is an example to illustrate it:

var test = new Object(); if ("toString" in test) { console.log("test has \"toString\""); } else { console.log("test doesn't have \"toString\""); } delete test.toString; if ("toString" in test) { console.log("test still has \"toString\""); } else { console.log("test doesn't have \"toString\""); }

As a result, you will get:

test has "toString" test still has "toString"

According to everything that has been said before, we have two possibilities here:

  • Either the delete operator does not work as described
  • Or the toString member is not a property of the object

First, I confirm that delete operator works as described, however the sentence "remove a property from an object" has to be clarified.

hasOwnProperty

The operator hasOwnProperty has been introduced in the 3rd edition of the ECMAScript specification and is summarized as:
"returns a boolean indicating whether the object has the specified property"

But if you get a closer look to the documentation, the description provides the key to understand the reason why this operator was created: This method can be used to determine whether an object has the specified property as a direct property of that object; unlike the in operator, this method does not check down the object's prototype chain.

Applied to the previous example, here is what we obtain:

var test = new Object(); if ("toString" in test) { console.log("test has \"toString\""); } else { console.log("test doesn't have \"toString\""); } if (test.hasOwnProperty("toString") { console.log("\"toString\" is a direct property of test"); } else { console.log("\"toString\" is *not* a direct property of test"); } test has "toString" "toString" is *not* a direct property of test

So it means that the delete operator removes a direct property of an object.

Now... if "toString" is not a direct property of my object, where does it come from ?

Prototype

As a matter of fact, "toString" is defined on the Object prototype. This object defines all the properties and methods that are inherited by any instance of a JavaScript object.

This is the key concept in the language that allows modern JavaScript developers to achieve object oriented programming.

...to be continued...

Thursday, January 9, 2014

Huge code refactoring

After dealing with my move, new work and huge code re-factoring, I decided to push everything on GitHub.
Things are pretty much up to date on the repository and I feel like I am now going in the right direction.

Updates will be more frequent, I would like to publish at least one of each below per month:

  • JavaScript-oriented article: courses about various topics such as object programming, best practices ...
  • Article about some fun stuff I make: for instance, I recently developed a Yahtzee-like game score board for my iPhone as we needed it to play at home. Also I still need to publish my computer-craft related research and a SUDOKU solver... that I must rewrite as I lost it
  • Articles about my project progress: I am currently working an the XML serialization to rewrite my web service wrapper but I also think about writing a simplified PostScript-like engine for a game I would like to create...
To make a long story short, stay tuned!


Thursday, November 28, 2013

Why a lint-tool can reduce development time...

...and improve code quality.

Hi!

It's been a long time since the last article, lots of things happened recently and I will probably need more than one post to tell you more about this.
In any case, and to make a long story short, I recently moved to a new company (and a new country) in order to practice some JavaScript skills on a promising software and my first phase consisted in discovering the tools they put in place to assist the developments.

Among them, one has been really annoying at first glance but - as I progress with it - I realize how much time this tools is saving. Indeed, one of the requirement is to make sure that all common 'mistakes' are eliminated from the source code before checking it in in the source control software. Because of the JavaScript language, the tool JSHint is applied.

I won't detail all the features of the tool (that are deeply explained in the documentation page) but I'll rather highlight the benefits I see:

  • It identifies undeclared variables / unused parameters. For instance, when you mistyped a variable name, your JavaScript file will be executed until this variable is used (and replaced with undefined). Hence, the sooner you detect the issue, the better. Note that, in some cases, the variable name might be implicit (such as document or window in a browser): the tool can also be dynamically configured with comments in the source code.
  • It uses documented validation rules: forbids the use of eval, group var declarations, check code complexity...
  • It makes sure the code is 'formatted' properly: we all have our coding habits. However, when it comes to code review, it is better to have everybody format the code the same way...
This being said, I decided to adopt JSLint to my own developments.

I have developped some interesting ideas I would like to post in the blog in the near future:
  • An algorithm for computer-craft that fills one layer of blocks
  • A SUDOKU solver (what a big deal...)
  • Portability over browsers, operating systems & JavaScript hosts
So stay tuned!

Monday, July 1, 2013

Classes and Attributes

I published the last update on the tokenizer yesterday evening, I am now moving to another interesting topic meaning the way JavaScript handles Object Oriented Programming: this is called prototyping.
Please search for JavaScript prototyping, you will find really good explanations of the concept, for instance:
To make a long story short, classes do not exist in JavaScript. However, it is possible to create instances of objects where the members are defined through this prototyping. The most interesting fact is that inheritance is possible. However, real OOP concepts such as member visibility (private, protected, public), abstracts or method overriding are not built-in. There are ways to reach the same level of functionality but with development costs.


I will not pretend I invented the code you will see because the main part of it has been extracted from J.Reisig's excellent blog: http://ejohn.org/blog/simple-javascript-inheritance.

However, I tried to improve the idea by adding attributes, a concept that I discovered while coding C# (for instance, please check this intersting thread: http://stackoverflow.com/questions/144833/most-useful-attributes). Indeed, it allows to qualify the class members in order to alter the way they are handled.

This step is an important one for the library as it will structure its development.

Stay tuned!

Friday, June 21, 2013

The JavaScript tokenizer, a finite state machine

As promised, I will start by talking about the tokenizer to explain the main concept. I also have to improve it as - today - there is at least one identified bug and a missing feature (JavaScript operator recognition).

The tokenizer

Initially, my very first parsing algorithm was based on regular expressions I found on the net. It was not bigger than 40 lines of code and worked fine.
However, the JavaScript regular expression object does not offer the necessary APIs to "inject" data asynchronously. Consequently, the whole text had to be loaded to start the parsing. Furthermore, from a performance perspective, it was slow and resource consuming.

Because of the different constraints I put on the tokenizer, I had to create an algorithm that was capable of parsing JavaScript by bits, i.e. with consecutive inputs. Considering the smallest item possible, I decided that the algorithm would process each character one by one and would know exactly:
  • What it has done before (i.e. the current state)
  • How to process the next character (i.e. what is the next step considering the current input)
It means that I defined a finite-state machine.

A simple example

Let's clarify this with a simple example: why don't we write a parser that would extract all numbers from a text stream?
You have two approaches:
  • The first one filters out only digits from the stream. For instance the string "abc123de4f" would output "1", "2", "3" and then "4"
  • The second one - the most interesting one - will try to group digits to output numbers. It means that "abc123de4f" would outut "123" and then "4".

First of all, let's define the "context" of the machine, it will be composed of one field named state that will allow us to know in which step we are (this will be detailled after). We will also need to store the numbers that have been already processed: we create an array named digits.

from a general manner, it is faster (and somehow memory safer) to add items to an array rather than concatenate a string with new characters. This is a trick that I use very often.
var state = 0; // Initial state var digits = []; Now let's detail how this machine works:

When I am in state 0 and I receive a character, it can be either a digit (0..9) or anything else. If a digit is received, I will change the state to 1 which will indicate that I started to receive one digit. Otherwise, I remain in state 0 (and I am waiting for a digit).

I will use the following notation to summarize:
(0) "0" → (1)
(0) "1" → (1)
   ...
(0) "9" → (1)
(0) ? → (0)

When I am in state 1, I will invert the logic: I will remain in state 1 until I receive a character that is not a digit: in that case, I output the number and go back to state 0.

(1) "0" → (1)
(1) "1" → (1)
    ...
(1) "9" → (1)
(1) ? → (0)

From a coding point of view, this generate the following method:
function filterNumbers( input ) { var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { digits = [ oneChar ]; state = 1; } // Else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits digits.push( oneChar ); } else { // Digits over, output the result alert( digits.join( "" ) ); // And go back to state 0 state = 0; } } } }
Want to try? Enter the value to analyse and click test:
It looks great!

However there are two problems:
  1. In the code above, there is no way to finalize the machine; For instance, if you try with "abc123de4", the value "4" will not be outputted. This is because the machine remains in state 1 and is waiting for other digits. When the source has ended, we must provide a way to signal that no other character will be injected.
  2. As a consequence of the above, two consecutive calls use the same context. For instance, after trying with "abc123de4", another test with "f" will eventually output "4". This might not be a problem after all, but having a method to reset the machine can be useful. Ideally, having a way to save and provide the context to the machine is better: it allows complex manipulations such as switching.

Let's add symbols

The above example shows a way to sort numbers out of a string. Let say that we also want to extract the possible JavaScript symbols (such as +, -, |, &, :, ; ) and, ideally, identify operators which may be group of symbols (such as ++, +=, ==, ===).

The first action is to list the possible characters and operators. After digging the net (I have to admit that I never used some of them), I consolidated the following list:

* *= / /= % %= ^ ^= ~ ~=
 + ++ += - -- -= | || |= & && &=
 = == ===  ! != !==
> >> >= >>= >>> >>>= < << <= <<=
[ ] ( ) . , ; ? :

Hence a total of 47 allowed combinations of the characters: (){}[]<>|&?,.;:!=+-*/%^

Please note that //, /* and */ are not taken into account because already used & defined for comments.
The same way, " and ' are reserved for strings, this is why they don't appear in this list.

The very first approach - that is currently implemented in the tokenizer - consists in identifying any character in the above list and output it as a symbol.
However, this does not allow us to detect operators.

The same way, unlike for numbers, it is not possible to keep all symbol characters together and throw them when something different is found. For instance, with the string "abc123++==d" the tokenizer  should be able to output "123", "++" and "==" and not "++==".

This is why we must be able to determine when a new character may be concatenated with the previous ones or not.

One way is to associate states to each possibility. Focusing on "*" and "*=", This would lead to:

(0) "0" → (1)
   ...
(0) "9" → (1)
(0) "*" → (2)
(0) ? → (0)


(1) "0" → (1)
    ...
(1) "9" → (1)
(1) "*" → (2)
(1) ? → (0)

(2) "0" → (1)
    ...
(2) "9" → (1)
(2) "*" → (2)
(2) "=" → (3)
(2) ? → (0)

(3) "0" → (1)
    ...
(3) "9" → (1)
(3) "*" → (2)
(2) ? → (0)

Hum... ok so if we add a new state for each symbol, it means adding 47 states and handling probably at least the same amount of transitions between each state... sounds huge.
This surely can be automated (and generated) but I would like to try a different approach.

Let's consider a state (2) that handles all symbols. The idea is to be able to determine, while in step 2, if the next character can be the next one of the "current" symbol or not.

Let's redefine the context of the machine:
var state = 0; // Initial state var buffer = []; var _TOKEN_SYMBOL_LIST = "(){}[]<>|&?,.;:!=+-*/%^"; // This will generate an array containing all allowed symbols var _TOKEN_ALLOWED_SYMBOLS = ( "* *=" + " / /=" + " % %=" + " ^ ^=" + " ~ ~=" + " + ++ +=" + " - -- -=" + " | || |=" + " & && &=" + " = == ===" + " ! != !==" + " > >> >= >>= >>> >>>=" + " < << <= <<=" + " [ ] ( ) . , ; ? :" ).split( " " );
Now we need a method to determine if the next character can be concatenated to the ones we have in the buffer, the signature will be:
function filterNumbersAndSymbolsIsValidSymbol( chars, newChar ) { // Is newChar part of a sybmol that starts with chars (an array of chars) }

This leads the engine states to be:
(0) <digit> → (1)
(0) <symbol> → (2)(0) ? → (0)


(1) <digit> → (1)
(1) <symbol> → (2)(1) ? → (0)

(2) <digit> → (1)
(2) <symbol> → (2)
(2) ? → (0)

Considering that <digit> is "0"..."9", <symbol> is one char among _TOKEN_SYMBOL_LIST.
It also means that the following transition:
(2) <symbol> → (2)
May output symbols if the current character does not concatenate with the previous ones.

This leads to the new method:
function filterNumbersAndSymbols( input ) { var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); // Check if a symbol } else if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } // else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits buffer.push( oneChar ); } else { // Digits over, output the result filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } else if( 2 === state ) { if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { if( filterNumbersAndSymbolsIsValidSymbol( buffer, oneChar ) ) { buffer.push( oneChar ); } else { filterNumbersAndSymbolsOutput( "symbol", buffer.join("") ); filterNumbersAndSymbolsStateChange( 2 ); buffer = [ oneChar ]; } } else { // Symbol over, output the result filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } } }
Please note the use of these two functions that will be clarified later:
function filterNumbersAndSymbolsOutput( type, token ) { alert( type + ": " + token ); } function filterNumbersAndSymbolsStateChange( value ) { state = value; }

Defining filterNumbersAndSymbolsIsValidSymbol

This is where the challenge - and the fun - is. There are several ways to define this functions and I wanted to try each of them to evaluate their performance and ease of maintenance.

Version 0
I wanted to use the array _TOKEN_ALLOWED_SYMBOLS to verify if the new symbol composed of the chars and the new one would be valid. This leads to the following function:
function filterNumbersAndSymbolsIsValidSymbol_0( chars, newChar ) { var token = chars.join("") + newChar; for( var idx = 0; idx < _TOKEN_ALLOWED_SYMBOLS.length; ++idx ) if( _TOKEN_ALLOWED_SYMBOLS[ idx ] === token ) return true; return false; }

Version 1
Another version consists in defining a huge condition that would test all possible symbols at once, something like ( "*" === symbol || "*=" === symbol || ...)
However, writing this function would take time that's why I preferred to 'build' this function with the actual list of allowed symbols:
var filterNumbersAndSymbolsIsValidSymbol_1 = (function(){ var src = [], idx; for( idx = 0; idx < _TOKEN_ALLOWED_SYMBOLS.length; ++idx ) src.push( "\"" + _TOKEN_ALLOWED_SYMBOLS[ idx ] + "\" === token" ); return new Function( "var token = arguments[0].join(\"\") + arguments[1]; return " + src.join( "||" )+ ";" ); })();

Version 2
Well, version 2 is the equivalent of version 1 but 'copied' from its definition. Indeed, when evaluating the performances, I observed something interesting I would like to demonstrate here.
function filterNumbersAndSymbolsIsValidSymbol_2() { var token = arguments[0].join("") + arguments[1]; return "*" === token||"*=" === token||"/" === token||"/=" === token||"%" === token||"%=" === token||"^" === token||"^=" === token||"~" === token||"~=" === token||"+" === token||"++" === token||"+=" === token||"-" === token||"--" === token||"-=" === token||"|" === token||"||" === token||"|=" === token||"&" === token||"&&" === token||"&=" === token||"=" === token||"==" === token||"===" === token||"!" === token||"!=" === token||"!==" === token||">" === token||">>" === token||">=" === token||">>=" === token||">>>" === token||">>>=" === token||"<" === token||"<<" === token||"<=" === token||"<<=" === token||"[" === token||"]" === token||"(" === token||")" === token||"." === token||"," === token||";" === token||"?" === token||":" === token; }

Version 3
This last - and preferred - version has been developed to be the most efficient one. It tests each situation by starting with the length of the current buffer and I observed that knowing the first char is a good way to see what will be allowed next.
As a result, it leads to the following function:
function filterNumbersAndSymbolsIsValidSymbol_3( chars, newChar ) { var firstChar = chars[ 0 ]; if( 1 === chars.length ) { if( -1 < "(){}[].,;:?".indexOf( firstChar ) ) return false; else if( -1 < "!^~*/%".indexOf( firstChar ) ) return "=" === newChar; else return "=" === newChar || firstChar === newChar; } else if( 2 === chars.length ) { if( -1 < "+-|&".indexOf( firstChar ) ) return false; else if( "<" === firstChar ) { return "<" === chars[ 1 ] && "=" === newChar; } else if( -1 < "=!".indexOf( firstChar ) ) { return "=" === newChar; } else if( ">" === firstChar ) { return "=" !== chars[ 1 ] && ( "=" === newChar || ">" === newChar ); } } else if( 3 === chars.length ) { return ">" === firstChar && "=" !== chars[ 2 ] && "=" === newChar; } return false; }

Choosing the right one

To select (and verify) the best one, I had to write a decent testing function. That's why I decided to write one that lists all verified symbols.
function filterNumbersAndSymbolsDumpAllValidSymbols( estimate ) { var res, // Result array (will contain all generated symbols) // Method selection based on a SELECT control in the page method = document.getElementById( "filterNumbersAndSymbols_method" ).selectedIndex, filterNumbersAndSymbolsIsValidSymbol = this[ "filterNumbersAndSymbolsIsValidSymbol_" + method ], // test function (recursive) test = function( buffer ){ var insert = false; // Try to inject each symbol char in the current sequence for( var jdx = 0; jdx < _TOKEN_SYMBOL_LIST.length; ++jdx ) { var newChar = _TOKEN_SYMBOL_LIST.charAt( jdx ); if( filterNumbersAndSymbolsIsValidSymbol( buffer, newChar ) ) { // Valid, call the test function recursively var newBuffer = [].concat( buffer ); newBuffer.push( newChar ); test( newBuffer ); } else // Invalid (but the buffer is valid), dump the buffer insert = true; } if( insert ) res.push( buffer.join("") ); }, // time estimate helper count = 0, max = 1, dtStart, timeSpent, msg; if( estimate ) { max = 100; dtStart = new Date(); } for( count = 0; count < max; ++count ) { res = []; for( var idx = 0; idx < _TOKEN_SYMBOL_LIST.length; ++idx ) test( [ _TOKEN_SYMBOL_LIST.charAt( idx ) ] ); } msg = [ res.join( " " ), "Count: " + res.length + " / " + _TOKEN_ALLOWED_SYMBOLS.length ]; if( estimate ) { timeSpent = (new Date()) - dtStart; msg.push( "Time spent: " + timeSpent + "ms" ); msg.push( "Function:", filterNumbersAndSymbolsIsValidSymbol ); } document.getElementById( "filterNumbersAndSymbols_output" ).value = msg.join( "\r\n" ); }

This leads to:


Conclusions

First of all, it looks like the fastest function is the last one. It can be easily explained: instead of joining the buffer and the new character (which takes time) to finally test all possibilities (without any specific order), the last version tries to distinguish the current symbol and find the appropriate condition based on the current length and the list of possibilities. What is surprising anyway is the fact that a dynamically generated function is slower than the same one declared statically. I must document myself to understand why but this is important to remember for performance reasons.
There are still two things to discuss:
  • How do you finalize the token when the input string is empty?
  • How do you debug this?

Finalizing the token

This last step does not need any input, the idea is to consider what we already have and - if valid - output the result (in our case, this is really easy as any symbol built so far is valid). This will consequently depend on the current state of the engine.

Debugging

Well, recent browsers offer the most convenient way to debug the JavaScript code. However, when the bit of code you are looking for is called after several inputs, the task may quickly become tedious to get it. One way I use to debug the engine (that you will retrieve by checking the release page) is to dump the input as well as engine information mixed inside it.
function filterNumbersAndSymbolsStateChange( value ) { var span = document.createElement( "span" ); span.style = "color: red;" span.innerHTML = "(" + value + ")"; document.getElementById( "filterNumbersAndSymbols_debug" ).appendChild( span ); state = value; } function filterNumbersAndSymbols( input ) { // Clear the output var debug = document.getElementById( "filterNumbersAndSymbols_debug" ); debug.innerHTML =""; // Set the initial state filterNumbersAndSymbolsStateChange( 0 ); // Parse the input string var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); // Dump the input debug.appendChild( document.createTextNode( oneChar ) ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); // Check if a symbol } else if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } // else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits buffer.push( oneChar ); } else { // Digits over, output the result filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } else if( 2 === state ) { if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { if( filterNumbersAndSymbolsIsValidSymbol_3( buffer, oneChar ) ) { buffer.push( oneChar ); } else { filterNumbersAndSymbolsOutput( "symbol", buffer.join("") ); filterNumbersAndSymbolsStateChange( 2 ); buffer = [ oneChar ]; } } else { // Symbol over, output the result filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } } // Finalize token if( 1 === state ) filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); else if( 2 === state ) filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); } Want to test it?

Tuesday, June 18, 2013

Blog and web site update

A quick update of the blog to insert tools used to:
  • reformat javascript snippets (using tokenizer)
  • add some colors (with a custom CSS)
A demonstration? look at the following snippet:

// This is an example of dynamically reformatted JavaScript code alert( "Hello world!" );
Everything has been uploaded to GitHub.
The same way, I updated the web site (http://buchholz.free.fr/gpf-js) to reflect the latest updates.

I am preparing a big article on the way the tokenizer works so stay tuned to the blog.