Thursday, November 28, 2013

Why a lint-tool can reduce development time...

...and improve code quality.

Hi!

It's been a long time since the last article, lots of things happened recently and I will probably need more than one post to tell you more about this.
In any case, and to make a long story short, I recently moved to a new company (and a new country) in order to practice some JavaScript skills on a promising software and my first phase consisted in discovering the tools they put in place to assist the developments.

Among them, one has been really annoying at first glance but - as I progress with it - I realize how much time this tools is saving. Indeed, one of the requirement is to make sure that all common 'mistakes' are eliminated from the source code before checking it in in the source control software. Because of the JavaScript language, the tool JSHint is applied.

I won't detail all the features of the tool (that are deeply explained in the documentation page) but I'll rather highlight the benefits I see:

  • It identifies undeclared variables / unused parameters. For instance, when you mistyped a variable name, your JavaScript file will be executed until this variable is used (and replaced with undefined). Hence, the sooner you detect the issue, the better. Note that, in some cases, the variable name might be implicit (such as document or window in a browser): the tool can also be dynamically configured with comments in the source code.
  • It uses documented validation rules: forbids the use of eval, group var declarations, check code complexity...
  • It makes sure the code is 'formatted' properly: we all have our coding habits. However, when it comes to code review, it is better to have everybody format the code the same way...
This being said, I decided to adopt JSLint to my own developments.

I have developped some interesting ideas I would like to post in the blog in the near future:
  • An algorithm for computer-craft that fills one layer of blocks
  • A SUDOKU solver (what a big deal...)
  • Portability over browsers, operating systems & JavaScript hosts
So stay tuned!

Monday, July 1, 2013

Classes and Attributes

I published the last update on the tokenizer yesterday evening, I am now moving to another interesting topic meaning the way JavaScript handles Object Oriented Programming: this is called prototyping.
Please search for JavaScript prototyping, you will find really good explanations of the concept, for instance:
To make a long story short, classes do not exist in JavaScript. However, it is possible to create instances of objects where the members are defined through this prototyping. The most interesting fact is that inheritance is possible. However, real OOP concepts such as member visibility (private, protected, public), abstracts or method overriding are not built-in. There are ways to reach the same level of functionality but with development costs.


I will not pretend I invented the code you will see because the main part of it has been extracted from J.Reisig's excellent blog: http://ejohn.org/blog/simple-javascript-inheritance.

However, I tried to improve the idea by adding attributes, a concept that I discovered while coding C# (for instance, please check this intersting thread: http://stackoverflow.com/questions/144833/most-useful-attributes). Indeed, it allows to qualify the class members in order to alter the way they are handled.

This step is an important one for the library as it will structure its development.

Stay tuned!

Friday, June 21, 2013

The JavaScript tokenizer, a finite state machine

As promised, I will start by talking about the tokenizer to explain the main concept. I also have to improve it as - today - there is at least one identified bug and a missing feature (JavaScript operator recognition).

The tokenizer

Initially, my very first parsing algorithm was based on regular expressions I found on the net. It was not bigger than 40 lines of code and worked fine.
However, the JavaScript regular expression object does not offer the necessary APIs to "inject" data asynchronously. Consequently, the whole text had to be loaded to start the parsing. Furthermore, from a performance perspective, it was slow and resource consuming.

Because of the different constraints I put on the tokenizer, I had to create an algorithm that was capable of parsing JavaScript by bits, i.e. with consecutive inputs. Considering the smallest item possible, I decided that the algorithm would process each character one by one and would know exactly:
  • What it has done before (i.e. the current state)
  • How to process the next character (i.e. what is the next step considering the current input)
It means that I defined a finite-state machine.

A simple example

Let's clarify this with a simple example: why don't we write a parser that would extract all numbers from a text stream?
You have two approaches:
  • The first one filters out only digits from the stream. For instance the string "abc123de4f" would output "1", "2", "3" and then "4"
  • The second one - the most interesting one - will try to group digits to output numbers. It means that "abc123de4f" would outut "123" and then "4".

First of all, let's define the "context" of the machine, it will be composed of one field named state that will allow us to know in which step we are (this will be detailled after). We will also need to store the numbers that have been already processed: we create an array named digits.

from a general manner, it is faster (and somehow memory safer) to add items to an array rather than concatenate a string with new characters. This is a trick that I use very often.
var state = 0; // Initial state var digits = []; Now let's detail how this machine works:

When I am in state 0 and I receive a character, it can be either a digit (0..9) or anything else. If a digit is received, I will change the state to 1 which will indicate that I started to receive one digit. Otherwise, I remain in state 0 (and I am waiting for a digit).

I will use the following notation to summarize:
(0) "0" → (1)
(0) "1" → (1)
   ...
(0) "9" → (1)
(0) ? → (0)

When I am in state 1, I will invert the logic: I will remain in state 1 until I receive a character that is not a digit: in that case, I output the number and go back to state 0.

(1) "0" → (1)
(1) "1" → (1)
    ...
(1) "9" → (1)
(1) ? → (0)

From a coding point of view, this generate the following method:
function filterNumbers( input ) { var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { digits = [ oneChar ]; state = 1; } // Else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits digits.push( oneChar ); } else { // Digits over, output the result alert( digits.join( "" ) ); // And go back to state 0 state = 0; } } } }
Want to try? Enter the value to analyse and click test:
It looks great!

However there are two problems:
  1. In the code above, there is no way to finalize the machine; For instance, if you try with "abc123de4", the value "4" will not be outputted. This is because the machine remains in state 1 and is waiting for other digits. When the source has ended, we must provide a way to signal that no other character will be injected.
  2. As a consequence of the above, two consecutive calls use the same context. For instance, after trying with "abc123de4", another test with "f" will eventually output "4". This might not be a problem after all, but having a method to reset the machine can be useful. Ideally, having a way to save and provide the context to the machine is better: it allows complex manipulations such as switching.

Let's add symbols

The above example shows a way to sort numbers out of a string. Let say that we also want to extract the possible JavaScript symbols (such as +, -, |, &, :, ; ) and, ideally, identify operators which may be group of symbols (such as ++, +=, ==, ===).

The first action is to list the possible characters and operators. After digging the net (I have to admit that I never used some of them), I consolidated the following list:

* *= / /= % %= ^ ^= ~ ~=
 + ++ += - -- -= | || |= & && &=
 = == ===  ! != !==
> >> >= >>= >>> >>>= < << <= <<=
[ ] ( ) . , ; ? :

Hence a total of 47 allowed combinations of the characters: (){}[]<>|&?,.;:!=+-*/%^

Please note that //, /* and */ are not taken into account because already used & defined for comments.
The same way, " and ' are reserved for strings, this is why they don't appear in this list.

The very first approach - that is currently implemented in the tokenizer - consists in identifying any character in the above list and output it as a symbol.
However, this does not allow us to detect operators.

The same way, unlike for numbers, it is not possible to keep all symbol characters together and throw them when something different is found. For instance, with the string "abc123++==d" the tokenizer  should be able to output "123", "++" and "==" and not "++==".

This is why we must be able to determine when a new character may be concatenated with the previous ones or not.

One way is to associate states to each possibility. Focusing on "*" and "*=", This would lead to:

(0) "0" → (1)
   ...
(0) "9" → (1)
(0) "*" → (2)
(0) ? → (0)


(1) "0" → (1)
    ...
(1) "9" → (1)
(1) "*" → (2)
(1) ? → (0)

(2) "0" → (1)
    ...
(2) "9" → (1)
(2) "*" → (2)
(2) "=" → (3)
(2) ? → (0)

(3) "0" → (1)
    ...
(3) "9" → (1)
(3) "*" → (2)
(2) ? → (0)

Hum... ok so if we add a new state for each symbol, it means adding 47 states and handling probably at least the same amount of transitions between each state... sounds huge.
This surely can be automated (and generated) but I would like to try a different approach.

Let's consider a state (2) that handles all symbols. The idea is to be able to determine, while in step 2, if the next character can be the next one of the "current" symbol or not.

Let's redefine the context of the machine:
var state = 0; // Initial state var buffer = []; var _TOKEN_SYMBOL_LIST = "(){}[]<>|&?,.;:!=+-*/%^"; // This will generate an array containing all allowed symbols var _TOKEN_ALLOWED_SYMBOLS = ( "* *=" + " / /=" + " % %=" + " ^ ^=" + " ~ ~=" + " + ++ +=" + " - -- -=" + " | || |=" + " & && &=" + " = == ===" + " ! != !==" + " > >> >= >>= >>> >>>=" + " < << <= <<=" + " [ ] ( ) . , ; ? :" ).split( " " );
Now we need a method to determine if the next character can be concatenated to the ones we have in the buffer, the signature will be:
function filterNumbersAndSymbolsIsValidSymbol( chars, newChar ) { // Is newChar part of a sybmol that starts with chars (an array of chars) }

This leads the engine states to be:
(0) <digit> → (1)
(0) <symbol> → (2)(0) ? → (0)


(1) <digit> → (1)
(1) <symbol> → (2)(1) ? → (0)

(2) <digit> → (1)
(2) <symbol> → (2)
(2) ? → (0)

Considering that <digit> is "0"..."9", <symbol> is one char among _TOKEN_SYMBOL_LIST.
It also means that the following transition:
(2) <symbol> → (2)
May output symbols if the current character does not concatenate with the previous ones.

This leads to the new method:
function filterNumbersAndSymbols( input ) { var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); // Check if a symbol } else if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } // else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits buffer.push( oneChar ); } else { // Digits over, output the result filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } else if( 2 === state ) { if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { if( filterNumbersAndSymbolsIsValidSymbol( buffer, oneChar ) ) { buffer.push( oneChar ); } else { filterNumbersAndSymbolsOutput( "symbol", buffer.join("") ); filterNumbersAndSymbolsStateChange( 2 ); buffer = [ oneChar ]; } } else { // Symbol over, output the result filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } } }
Please note the use of these two functions that will be clarified later:
function filterNumbersAndSymbolsOutput( type, token ) { alert( type + ": " + token ); } function filterNumbersAndSymbolsStateChange( value ) { state = value; }

Defining filterNumbersAndSymbolsIsValidSymbol

This is where the challenge - and the fun - is. There are several ways to define this functions and I wanted to try each of them to evaluate their performance and ease of maintenance.

Version 0
I wanted to use the array _TOKEN_ALLOWED_SYMBOLS to verify if the new symbol composed of the chars and the new one would be valid. This leads to the following function:
function filterNumbersAndSymbolsIsValidSymbol_0( chars, newChar ) { var token = chars.join("") + newChar; for( var idx = 0; idx < _TOKEN_ALLOWED_SYMBOLS.length; ++idx ) if( _TOKEN_ALLOWED_SYMBOLS[ idx ] === token ) return true; return false; }

Version 1
Another version consists in defining a huge condition that would test all possible symbols at once, something like ( "*" === symbol || "*=" === symbol || ...)
However, writing this function would take time that's why I preferred to 'build' this function with the actual list of allowed symbols:
var filterNumbersAndSymbolsIsValidSymbol_1 = (function(){ var src = [], idx; for( idx = 0; idx < _TOKEN_ALLOWED_SYMBOLS.length; ++idx ) src.push( "\"" + _TOKEN_ALLOWED_SYMBOLS[ idx ] + "\" === token" ); return new Function( "var token = arguments[0].join(\"\") + arguments[1]; return " + src.join( "||" )+ ";" ); })();

Version 2
Well, version 2 is the equivalent of version 1 but 'copied' from its definition. Indeed, when evaluating the performances, I observed something interesting I would like to demonstrate here.
function filterNumbersAndSymbolsIsValidSymbol_2() { var token = arguments[0].join("") + arguments[1]; return "*" === token||"*=" === token||"/" === token||"/=" === token||"%" === token||"%=" === token||"^" === token||"^=" === token||"~" === token||"~=" === token||"+" === token||"++" === token||"+=" === token||"-" === token||"--" === token||"-=" === token||"|" === token||"||" === token||"|=" === token||"&" === token||"&&" === token||"&=" === token||"=" === token||"==" === token||"===" === token||"!" === token||"!=" === token||"!==" === token||">" === token||">>" === token||">=" === token||">>=" === token||">>>" === token||">>>=" === token||"<" === token||"<<" === token||"<=" === token||"<<=" === token||"[" === token||"]" === token||"(" === token||")" === token||"." === token||"," === token||";" === token||"?" === token||":" === token; }

Version 3
This last - and preferred - version has been developed to be the most efficient one. It tests each situation by starting with the length of the current buffer and I observed that knowing the first char is a good way to see what will be allowed next.
As a result, it leads to the following function:
function filterNumbersAndSymbolsIsValidSymbol_3( chars, newChar ) { var firstChar = chars[ 0 ]; if( 1 === chars.length ) { if( -1 < "(){}[].,;:?".indexOf( firstChar ) ) return false; else if( -1 < "!^~*/%".indexOf( firstChar ) ) return "=" === newChar; else return "=" === newChar || firstChar === newChar; } else if( 2 === chars.length ) { if( -1 < "+-|&".indexOf( firstChar ) ) return false; else if( "<" === firstChar ) { return "<" === chars[ 1 ] && "=" === newChar; } else if( -1 < "=!".indexOf( firstChar ) ) { return "=" === newChar; } else if( ">" === firstChar ) { return "=" !== chars[ 1 ] && ( "=" === newChar || ">" === newChar ); } } else if( 3 === chars.length ) { return ">" === firstChar && "=" !== chars[ 2 ] && "=" === newChar; } return false; }

Choosing the right one

To select (and verify) the best one, I had to write a decent testing function. That's why I decided to write one that lists all verified symbols.
function filterNumbersAndSymbolsDumpAllValidSymbols( estimate ) { var res, // Result array (will contain all generated symbols) // Method selection based on a SELECT control in the page method = document.getElementById( "filterNumbersAndSymbols_method" ).selectedIndex, filterNumbersAndSymbolsIsValidSymbol = this[ "filterNumbersAndSymbolsIsValidSymbol_" + method ], // test function (recursive) test = function( buffer ){ var insert = false; // Try to inject each symbol char in the current sequence for( var jdx = 0; jdx < _TOKEN_SYMBOL_LIST.length; ++jdx ) { var newChar = _TOKEN_SYMBOL_LIST.charAt( jdx ); if( filterNumbersAndSymbolsIsValidSymbol( buffer, newChar ) ) { // Valid, call the test function recursively var newBuffer = [].concat( buffer ); newBuffer.push( newChar ); test( newBuffer ); } else // Invalid (but the buffer is valid), dump the buffer insert = true; } if( insert ) res.push( buffer.join("") ); }, // time estimate helper count = 0, max = 1, dtStart, timeSpent, msg; if( estimate ) { max = 100; dtStart = new Date(); } for( count = 0; count < max; ++count ) { res = []; for( var idx = 0; idx < _TOKEN_SYMBOL_LIST.length; ++idx ) test( [ _TOKEN_SYMBOL_LIST.charAt( idx ) ] ); } msg = [ res.join( " " ), "Count: " + res.length + " / " + _TOKEN_ALLOWED_SYMBOLS.length ]; if( estimate ) { timeSpent = (new Date()) - dtStart; msg.push( "Time spent: " + timeSpent + "ms" ); msg.push( "Function:", filterNumbersAndSymbolsIsValidSymbol ); } document.getElementById( "filterNumbersAndSymbols_output" ).value = msg.join( "\r\n" ); }

This leads to:


Conclusions

First of all, it looks like the fastest function is the last one. It can be easily explained: instead of joining the buffer and the new character (which takes time) to finally test all possibilities (without any specific order), the last version tries to distinguish the current symbol and find the appropriate condition based on the current length and the list of possibilities. What is surprising anyway is the fact that a dynamically generated function is slower than the same one declared statically. I must document myself to understand why but this is important to remember for performance reasons.
There are still two things to discuss:
  • How do you finalize the token when the input string is empty?
  • How do you debug this?

Finalizing the token

This last step does not need any input, the idea is to consider what we already have and - if valid - output the result (in our case, this is really easy as any symbol built so far is valid). This will consequently depend on the current state of the engine.

Debugging

Well, recent browsers offer the most convenient way to debug the JavaScript code. However, when the bit of code you are looking for is called after several inputs, the task may quickly become tedious to get it. One way I use to debug the engine (that you will retrieve by checking the release page) is to dump the input as well as engine information mixed inside it.
function filterNumbersAndSymbolsStateChange( value ) { var span = document.createElement( "span" ); span.style = "color: red;" span.innerHTML = "(" + value + ")"; document.getElementById( "filterNumbersAndSymbols_debug" ).appendChild( span ); state = value; } function filterNumbersAndSymbols( input ) { // Clear the output var debug = document.getElementById( "filterNumbersAndSymbols_debug" ); debug.innerHTML =""; // Set the initial state filterNumbersAndSymbolsStateChange( 0 ); // Parse the input string var idx, oneChar; for( idx = 0; idx < input.length; ++idx ) { // Read one char from the input oneChar = input.charAt( idx ); // Dump the input debug.appendChild( document.createTextNode( oneChar ) ); if( 0 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); // Check if a symbol } else if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } // else ignore and remain in state 0 } else if( 1 === state ) { // Check if a digit if( 0 <= oneChar && oneChar <= "9" ) { // Add to the collected digits buffer.push( oneChar ); } else { // Digits over, output the result filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 2 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } else if( 2 === state ) { if( -1 < _TOKEN_SYMBOL_LIST.indexOf( oneChar ) ) { if( filterNumbersAndSymbolsIsValidSymbol_3( buffer, oneChar ) ) { buffer.push( oneChar ); } else { filterNumbersAndSymbolsOutput( "symbol", buffer.join("") ); filterNumbersAndSymbolsStateChange( 2 ); buffer = [ oneChar ]; } } else { // Symbol over, output the result filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); if( 0 <= oneChar && oneChar <= "9" ) { buffer = [ oneChar ]; filterNumbersAndSymbolsStateChange( 1 ); } else { // go back to state 0 filterNumbersAndSymbolsStateChange( 0 ); } } } } // Finalize token if( 1 === state ) filterNumbersAndSymbolsOutput( "number", buffer.join( "" ) ); else if( 2 === state ) filterNumbersAndSymbolsOutput( "symbol", buffer.join( "" ) ); } Want to test it?

Tuesday, June 18, 2013

Blog and web site update

A quick update of the blog to insert tools used to:
  • reformat javascript snippets (using tokenizer)
  • add some colors (with a custom CSS)
A demonstration? look at the following snippet:

// This is an example of dynamically reformatted JavaScript code alert( "Hello world!" );
Everything has been uploaded to GitHub.
The same way, I updated the web site (http://buchholz.free.fr/gpf-js) to reflect the latest updates.

I am preparing a big article on the way the tokenizer works so stay tuned to the blog.

Friday, June 7, 2013

First version on GitHub

Welcome back,

After days of fine tuning, bug fixing and hard working, I finally finished a first version that demonstrates the concepts I exposed in my first posts.

To celebrate this, I created a GitHub account and each version will be published through it: https://github.com/ArnaudBuchholz/gpf-js. On top of that, the last version will always be published on my personnal website: http://buchholz.free.fr/gpf-js/

For instance, to test the release 'compiler', go to:
http://buchholz.free.fr/gpf-js/release.html
To test the debug version:
http://buchholz.free.fr/gpf-js/test.html
And to test the release version:
http://buchholz.free.fr/gpf-js/test.html?release

There are still some bugs that must be addressed, especially in the JavaScript tokenizer but they don't prevent the 'compiler' to work.

The next steps on this blog are:
- Add some colors and benefit from the tokenizer to add syntax highlighting to the javascript snippets
- Explain the technics used in this first version (dynamic loading of sources, uri creation, code rewriting)

And then, expand the library with new features.

Tuesday, May 7, 2013

Working on...

Working on the tokenizer, I realized that I missed one critical part of it: as you may need to chain several calls to do the full parsing, it is important to have a 'finalization' call to flush any parsing buffers.

This complexified the function call and I was trying to figure out how to handle this the simplest way.

To solve the issue, I decided to create two functions: one would be 'single shot' whereas the other would allow consecutive calls.

I am now working on generating a release version from the debug using a lite syntax parsing.

More news soon.

Friday, April 26, 2013

Development choices

It's been a while since I wrote the last post on this blog but the project is still active and I am working on the very first implementation that already demonstrates concepts highlighted earlier such as debug and release version.
I also opened a GitHub account in order to make the whole thing public and handle version management (please read the following introduction: http://learn.github.com/p/index.html).

However, I am currently struggling with an implementation choice and I thought it was the appropriate moment to explain the kind of problem I like to work on. It is important to keep in mind that I am talking about a personal project (not to say a hobby); consequently, delay & cost are not the most important criteria at that time. Indeed, I prefer to focus on maintenance and reusability which are the key elements of a successful library.

Here is the point: to generate the release version, I will need an efficient JavaScript parser that is capable of:
- Understand the language syntax: I need to know what a variable declaration is, where the global scope is and these kinds of things...
- Generate events that will be used to parse and rewrite the code: here I found a keyword, then I have an identifier…

By extension, thinking about the future, this may also be used to validate a JavaScript code and - ideally - I should not take any assumption on the way the code will be available.

To clarify, I initially thought that I should write something like:
function tokenize( jsSource, cbOnToken )
Where cbOnToken would have the following signature:
function cbOnToken( type, text, pos )
The parameter type would provide the recognized token type (i.e. either "separator", "keyword", "identifier", "number", "string", "symbol", "separator" or "error"). Then, the parameter text would contain the token literal text ( "if", "myVariable", "\"string\""). Finally, the last parameter pos would provide the relative position of the token regarding the beginning of the string.

This looks easy to use and it will probably fit 98% of my needs.
However...

This takes two important (and impacting) assumptions:
1) The whole JavaScript source is available at once (and provided through parameter jsSource)
2) As a consequence of the above, it is not possible to call consecutively the function to chain the parsing (i.e. it is context-less)

Let me clarify what I am thinking of.
For instance, you have two strings: "return" and " true;". In that case, you have two ways to parse the whole content:
- Concatenate the two strings *before* calling the method
- Or it is still possible to call the method twice without breaking the way the tokens are recognized
Now what would happen if the strings to parse are the following: "ret", "urn", " t", "ru", "e;"?

See the problem?

Some of you might wonder if I am slightly moving to the mad side of programming and they don't see the need for such complex questioning. Again, delay and cost are not the key elements here.

So let's take a break: step back and take a new breath of fresh air; we will try to envision the whole thing.

JavaScript is the native programming language of browsers: we are talking about a world of internet, resources downloading and asynchronous processing.

Whatever the project or the question, I usually try to stick to very simple concepts:
- "KISS" for Keep It Short and Simple (a little bit contradictory with the above but you will see how)
- "The more it can do, the less it will" (which also implies the contrary concept: "if it can do only few, it will be harder to make it do more")
And the most famous one:
- "If it compiles, it works! "… Ok, this one is a private joke but I have seen people thinking like this.

Back to our business, I saw several ways to override the two assumptions:
1) The whole JavaScript source is available at once
We may decide to change the way the source is provided to the function. For instance, I was thinking of defining an interface that would contain a method "Read" used to get the characters from the source (whatever the implementation) and another method "EndOfStream" to know when the end has been reached.
However, how do you handle the situation when no characters are available yet and you have to wait for the next ones?
For instance:
function tokenize( iSourceReader, cbOnToken ) { while( !iSourceReader.EndOfStream () ) { var text = iSourceReader.Read(); if( text.length ) { /* Parse text and maintain a context */ } else /* Nothing to do... wait */ } } The main issue I see in the way this is developed: when you have to implement a waiting logic, you can do it either in the loop (and then the implementation would be fixed) or *in* the iSourceReader object (and then you report this complexity in the object).
Another problem that is typical to JavaScript in browser is that you have situations where you *must* break the execution sequence to make things happen.
For instance, when you use an AJAX request to get some information, you must release the JavaScript execution stack to let the browser sends and receives the answer. In other terms, it is really difficult – and counter-productive - to make an 'active' wait of the AJAX answer: it is usually preferred to make it through callback handling (the browser calls you back whenever the answer is received).
Note that jQuery allows it even if it says that it is *not* recommended: http://api.jquery.com/jQuery.ajax/
Hence, to remove this assumption, I will keep the function as simple as possible and make sure that if the source is split over several strings, it can be chained with no problem. This way, I defer the waiting logic to the caller of this methods and I allow any kind of processing.

2) As a consequence of the above, it is not possible to call consecutively the function to chain the parsing (i.e. it is context-less)
This is the second assumption and one easy way to break it is to manage a parsing context that can be reused in a consecutive call.
Here again, we have several possibilities.
The first solution could be to have this context stored in a global variable *inside* the library. Again, I take another assumption: the caller does only one parsing at a time. Most of the time, it will be true: by default (at least before HTML5), JavaScript is mono-threaded and there are few chances that the program tries to parse several sources at the same time. However… :-)
Another solution is to expose this context and offers the possibility to reuse it in another call.
This would change the method signature to:
function tokenize( jsSource, cbOnToken, parsingContext )
Where parsingContext parameter is optional and represents the context of a previous parsing. When not provided, another context is created.

How do you get it? We might use the function result to transmit the context.
So coming back to the initial problematic example ("ret", "urn", " t", "ru", "e;"), the could would be:
var parsingContext = tokenize( "ret", cbOnToken ); tokenize( "urn", cbOnToken, parsingContext ); tokenize( " t", cbOnToken, parsingContext ); tokenize( "ru", cbOnToken, parsingContext ); tokenize( "e;" , cbOnToken, parsingContext ); As a consequence, if you need to implement a waiting logic in between each call, you can do whatever you want!

There are still remaining questions that are important here:
- How do you make sure that the parsing context provided by and to the function actually comes from a previous function call?
- How do you hide implementation details?

For those of you who investigated the advanced parts of JavaScript you should know that creating private members is complex (you may use closure or other advanced technics) and it is not possible to simply convert an object into a pointer (at least, from what I know today).
It means that the parsing context presented above will probably be a JavaScript object with several public members.
Consequently, the parsing context can be altered or even "simulated" by a custom object.
Is it that dangerous? What are the advantages & drawbacks of this?
Advantages:
- The first obvious advantage I see is the possibility to clone the context in order to back it up and reuse it later
- Another advantage is the possibility to easily extend the method by providing customization possibilities in this context
Drawbacks:
- When things are public it means that they might be used. Later, If significant change is implemented in the object, it has to be backward compatible to make sure that whoever used it (and whatever the reason), he can still continue with the latest version
- Another risk is to have people hacking into the code and try to use a function to do more than expected

Fortunately, there is an easy way to solve the drawbacks and keep the advantages: create a class and document it :-)
NOTE 1: as a first step, a simple object would be used with *no* documentation meaning that no assumption must be taken by the user of the function in terms of the way the object is built.
NOTE 2: some might say that one easy solution for hiding the implementation details is to use a redirector to the object. To make a long story short, instead of returning the object, you return a 'pointer' to it (such as an index referring to a hidden table that contains all allocated execution contexts). This works *but* you must introduce a method to clean a previously allocated context (because these objects would be still referenced until you call this method).

To conclude, some might be surprised that I do not create a class to achieve this. The main reason is that I want to be able to use it quickly without being forced of creating the object first to then call a method.
For instance:
var tokenizer = new Tokenizer(); tokenizer.setCallback( cbOnToken ); tokinzer.process( "return true;" ); delete tokenizer; I do like the idea of using only tokenizer( "return true;", cbOnToken );
To make a long story short, KISS.
And remember, when needed, I can still create a class to encapsulate this (where the contrary is more complex).

Friday, April 5, 2013

Release and debug versions

The jQuery example (again)

Did you try to open jQuery.js? If you do so - especially the 'compressed' version - you'll see that it is almost unreadable. The whole code stands on few lines, everything looks encrypted. Fortunately, with the help of the Online JavaScript beautifier, one can make the source easier to read.

But why did the developer wrote the code this way?

Actually, the 'real' development version probably looks more like this one: indeed, it contains comments and the variable names are not shortened to one character.

So what happened and why?

The need for a "release" version

There are several advantages of publishing a condensed version: the most important one is the fact that it contains only what is necessary for the implementation. Indeed, all comments are removed; variables and non-exposed members can be renamed to reduce the file size. The release version may also be designed in a way that eliminates all the debugging stuff (asserts, traces, performance monitors...).

As a consequence, the source is smaller (which means it loads faster in the browser) and this should speed up the execution of code.

Another effect of this non readable version is to harden the possibility of code intrusion. Indeed, most of the interpreted languages like JavaScript do not include any compiling process. It means that when a software component is developed with it, the only way to publish it is to provide the source. Hence, it can be easily analyzed, copied or tweaked.
By extension, any programming language - at least the ones I know - are published in a way that allows reverse engineering. Even C or C++ languages are compiled and generate executable that - in the end - are a list of processor instructions that can be translated into an algorithm (but this is very difficult).

The need for a debug version

On the other hand, the debug version drops all the advantages of the release one. But, as a counterpart, it offers the possibility to troubleshoot the code with a readable version of the code. Additional tests and traces could be provided in order to test the algorithms.

Functionally speaking, it is - at least - equivalent to the release version but with a bigger size and slower execution speed.

Sources and generation

Sources

There are many different ways to build both versions: they can be independent or one may be generated out of the other. In that case, the debug version is maintained by the developers (as it contains comments and readable code) and an action is required to output the release version.

The debug version itself can be generated out of several files (sources) that are grouped together during another generation process. Usually, the bigger the project the more small units are required to simplify maintenance.

Generating the release version

As far as I can tell, it is simpler to generate automatically the release version from the debug one: it means that there is only one code to maintain and the release version can be obtained quickly. In any case, the developers have to make sure that the release version behaves like the debug one: the generation process must not introduce any defects.

Different methods can be used to generate the release version, from the simpler ones (automated search and replace) to more advanced (source code analyzing and rewriting).

Automating the tests

To guarantee that the release version is functionally equivalent to the debug one (and also because I am too lazy to document everything) one simple way is to provide a testing procedure that can be applied on each version and that should generate a comprehensive report highlighting the defects.

This is also useful for the non-regression tests, i.e. to answer the following question: how do you verify that the new version of your code behaves like the previous one?

My choices

I realize that if I want to be exhaustive, I would probably need more time than what I really have and the articles would be far much longer. I obviously plan to have a debug version composed of several files loaded at the same time. Also, the release version would be generated from the debug version.

I hope to have enough time and will to describe all the technics in this blog.

Saturday, March 30, 2013

Hiding implementation details

Introduction

I started to  seriously consider javascript as a development language three years ago when I tried to implement a card game simulator (the pokemon one). You will find some running code here: PokeMul. The project is not really finished nor over but a Flash version of the game has been officially released and it outblooms my own realisation (Pokemon TCG).This is why I decided to focus on putting in common tools and technics that could be reused in other projects.

I have been impressed by jQuery which offers a rich, comprehensive and widely accepted API and my goal is to achieve the same kind of success (let be optimistic). More seriously, I already learned a lot by doing this, I hope this would be helpful to someone.


Before going any further, I make the assumption that you are familiar with the javascript language  and some of its "advanced" concepts (objects, prototype, closure...).
If not, I highly advise you to go to this well documented learning site: Codeacademy.

Most of the libraries you would find on the net are delivered into a single javascript file (for instance: jquery.js). Once loaded and evaluated in your HTML page, it extends the execution context with its own API.
Very often, those libraries own a hidden part that contains the implementation details. For instance, some variables are kept as caches to store information. The same way, some methods - known only by the library developpers - are used to put in common some code.

My research is to know how I can control what is exposed or not.

Demonstration

I created a page to demonstrate the different ways to declare things and how they appear in the global context: demonstration. Don't hesitate to use "view source" on the page to see how it is coded.

How does it work?


The starting point is to enumerate all the members of the global context to check what becomes available to the user of a library.

As this enumeration is done several times, everything is stored in an object (used like a dictionary). It allows me to see what is different between two consecutive calls.

// Maintain the known properties inside a global map var members = { _count: 0 }; // The following are "known" because declared below (they are not 'counted') members[ "members" ] = 0; members[ "echo" ] = 0; members[ "enumMembers" ] = 0; members[ "addInclude" ] = 0; The enumeration function itself uses the for ... in syntax to enumerate members. By default, this is used to enumerate the global object.

You will also notice the use of an addInclude function that is capable of dynamically inserting a script tag and wait for the loading to be done (based on the jQuery implementation). This will be detailled later.

What does it demonstrate?

In the first example, you see that the global variable isItPrivate1 appears before its has been declared. This is inherant to javascript which declares the function and variables before executing the code.

On the contrary, by dividing the script tag in two parts, the second example shows that you can control when the variable will be declared: isItPrivate2 appears only on the third enumeration.

In any case, both examples are not relevant because they don't demonstrate the use of a separate javascript file.

I created the third example to show that if you declare variables in an external javascript file, it is merged with the current context once the script is loaded. It means that if the library needs to use global variables the developper has to control their visibility: it has to be done differently.

I wanted to see how jQuery extends the context, so I loaded it. As a result, only two new members are inserted in the global context: $ (the famous jQuery shortcut) and jQuery.
On the other hand, enumerating the jQuery object content shows lots of members that are not necessarily the ones documented.

Finally, by creating a closure, the last example demonstrates how an object can be declared to expose a function that uses a variable that is not in the current context:


(function(){ var isItPrivate4 = "Yes!"; this[ "example4" ] = { isItPublic4: function(){ return isItPrivate4; } }; })(); This will be the preferred solution to develop the library.

To conclude

I am not pretending that everything above is crystal clear and this would probably require more redacting to be understandable (do not hesitate to comment). My goal is to explain how I selected the implementation and why: by limiting the visibility of what is exposed by the library, I can significantly reduce the size of the context (and that is really helpful when you debug) and secure the code by creating variables that are not modifiable outside of the library.

Monday, March 25, 2013

The three W

As this is the first message of this new blog, I felt it was necessary to explain the three W: Who, What and Why.
I won't detail too much the Who: I started programming early and my first "workstation" was an 8 bit home computer that Santa brought me at the age of 8. Since this first hardware, I have tried Basic (many different ones), Assembly (several ones), C, C++, Pascal and other languages. Naturally, I became a software developper but not only: I have been quickly involved in other responsibilites where programming was not the only required skill. Also, I should mention that english is not my native language so... please excuse any mistake I may do.
The content of this blog (What) is mostly related to my developments and the thinking behind these developments: this is why you will find articles related to Javascript and the current things I am trying to achieve. I started a project some time ago which had some good feedback but I quickly realized that I needed to formalize some concepts in order to progress efficiently to a scalable software.
Finally, the Why: I have recently been told that I was a little egotistic. I can't deny that I like to share my ideas and enjoy implementing them. However, I try to keep an eye on what's going on and I like to hear feedback that help me to progress. I can't predict if I will have any kind of success - and this is not what I am looking for - but I would like to share my thoughts and get feedback about them.

This being said, I will try to post as often as possible in order to show you my progress and take into accounts all the comments (... if any ...).

Have a good reading,
- Arnaud