Friday, April 26, 2013

Development choices

It's been a while since I wrote the last post on this blog but the project is still active and I am working on the very first implementation that already demonstrates concepts highlighted earlier such as debug and release version.
I also opened a GitHub account in order to make the whole thing public and handle version management (please read the following introduction: http://learn.github.com/p/index.html).

However, I am currently struggling with an implementation choice and I thought it was the appropriate moment to explain the kind of problem I like to work on. It is important to keep in mind that I am talking about a personal project (not to say a hobby); consequently, delay & cost are not the most important criteria at that time. Indeed, I prefer to focus on maintenance and reusability which are the key elements of a successful library.

Here is the point: to generate the release version, I will need an efficient JavaScript parser that is capable of:
- Understand the language syntax: I need to know what a variable declaration is, where the global scope is and these kinds of things...
- Generate events that will be used to parse and rewrite the code: here I found a keyword, then I have an identifier…

By extension, thinking about the future, this may also be used to validate a JavaScript code and - ideally - I should not take any assumption on the way the code will be available.

To clarify, I initially thought that I should write something like:
function tokenize( jsSource, cbOnToken )
Where cbOnToken would have the following signature:
function cbOnToken( type, text, pos )
The parameter type would provide the recognized token type (i.e. either "separator", "keyword", "identifier", "number", "string", "symbol", "separator" or "error"). Then, the parameter text would contain the token literal text ( "if", "myVariable", "\"string\""). Finally, the last parameter pos would provide the relative position of the token regarding the beginning of the string.

This looks easy to use and it will probably fit 98% of my needs.
However...

This takes two important (and impacting) assumptions:
1) The whole JavaScript source is available at once (and provided through parameter jsSource)
2) As a consequence of the above, it is not possible to call consecutively the function to chain the parsing (i.e. it is context-less)

Let me clarify what I am thinking of.
For instance, you have two strings: "return" and " true;". In that case, you have two ways to parse the whole content:
- Concatenate the two strings *before* calling the method
- Or it is still possible to call the method twice without breaking the way the tokens are recognized
Now what would happen if the strings to parse are the following: "ret", "urn", " t", "ru", "e;"?

See the problem?

Some of you might wonder if I am slightly moving to the mad side of programming and they don't see the need for such complex questioning. Again, delay and cost are not the key elements here.

So let's take a break: step back and take a new breath of fresh air; we will try to envision the whole thing.

JavaScript is the native programming language of browsers: we are talking about a world of internet, resources downloading and asynchronous processing.

Whatever the project or the question, I usually try to stick to very simple concepts:
- "KISS" for Keep It Short and Simple (a little bit contradictory with the above but you will see how)
- "The more it can do, the less it will" (which also implies the contrary concept: "if it can do only few, it will be harder to make it do more")
And the most famous one:
- "If it compiles, it works! "… Ok, this one is a private joke but I have seen people thinking like this.

Back to our business, I saw several ways to override the two assumptions:
1) The whole JavaScript source is available at once
We may decide to change the way the source is provided to the function. For instance, I was thinking of defining an interface that would contain a method "Read" used to get the characters from the source (whatever the implementation) and another method "EndOfStream" to know when the end has been reached.
However, how do you handle the situation when no characters are available yet and you have to wait for the next ones?
For instance:
function tokenize( iSourceReader, cbOnToken ) { while( !iSourceReader.EndOfStream () ) { var text = iSourceReader.Read(); if( text.length ) { /* Parse text and maintain a context */ } else /* Nothing to do... wait */ } } The main issue I see in the way this is developed: when you have to implement a waiting logic, you can do it either in the loop (and then the implementation would be fixed) or *in* the iSourceReader object (and then you report this complexity in the object).
Another problem that is typical to JavaScript in browser is that you have situations where you *must* break the execution sequence to make things happen.
For instance, when you use an AJAX request to get some information, you must release the JavaScript execution stack to let the browser sends and receives the answer. In other terms, it is really difficult – and counter-productive - to make an 'active' wait of the AJAX answer: it is usually preferred to make it through callback handling (the browser calls you back whenever the answer is received).
Note that jQuery allows it even if it says that it is *not* recommended: http://api.jquery.com/jQuery.ajax/
Hence, to remove this assumption, I will keep the function as simple as possible and make sure that if the source is split over several strings, it can be chained with no problem. This way, I defer the waiting logic to the caller of this methods and I allow any kind of processing.

2) As a consequence of the above, it is not possible to call consecutively the function to chain the parsing (i.e. it is context-less)
This is the second assumption and one easy way to break it is to manage a parsing context that can be reused in a consecutive call.
Here again, we have several possibilities.
The first solution could be to have this context stored in a global variable *inside* the library. Again, I take another assumption: the caller does only one parsing at a time. Most of the time, it will be true: by default (at least before HTML5), JavaScript is mono-threaded and there are few chances that the program tries to parse several sources at the same time. However… :-)
Another solution is to expose this context and offers the possibility to reuse it in another call.
This would change the method signature to:
function tokenize( jsSource, cbOnToken, parsingContext )
Where parsingContext parameter is optional and represents the context of a previous parsing. When not provided, another context is created.

How do you get it? We might use the function result to transmit the context.
So coming back to the initial problematic example ("ret", "urn", " t", "ru", "e;"), the could would be:
var parsingContext = tokenize( "ret", cbOnToken ); tokenize( "urn", cbOnToken, parsingContext ); tokenize( " t", cbOnToken, parsingContext ); tokenize( "ru", cbOnToken, parsingContext ); tokenize( "e;" , cbOnToken, parsingContext ); As a consequence, if you need to implement a waiting logic in between each call, you can do whatever you want!

There are still remaining questions that are important here:
- How do you make sure that the parsing context provided by and to the function actually comes from a previous function call?
- How do you hide implementation details?

For those of you who investigated the advanced parts of JavaScript you should know that creating private members is complex (you may use closure or other advanced technics) and it is not possible to simply convert an object into a pointer (at least, from what I know today).
It means that the parsing context presented above will probably be a JavaScript object with several public members.
Consequently, the parsing context can be altered or even "simulated" by a custom object.
Is it that dangerous? What are the advantages & drawbacks of this?
Advantages:
- The first obvious advantage I see is the possibility to clone the context in order to back it up and reuse it later
- Another advantage is the possibility to easily extend the method by providing customization possibilities in this context
Drawbacks:
- When things are public it means that they might be used. Later, If significant change is implemented in the object, it has to be backward compatible to make sure that whoever used it (and whatever the reason), he can still continue with the latest version
- Another risk is to have people hacking into the code and try to use a function to do more than expected

Fortunately, there is an easy way to solve the drawbacks and keep the advantages: create a class and document it :-)
NOTE 1: as a first step, a simple object would be used with *no* documentation meaning that no assumption must be taken by the user of the function in terms of the way the object is built.
NOTE 2: some might say that one easy solution for hiding the implementation details is to use a redirector to the object. To make a long story short, instead of returning the object, you return a 'pointer' to it (such as an index referring to a hidden table that contains all allocated execution contexts). This works *but* you must introduce a method to clean a previously allocated context (because these objects would be still referenced until you call this method).

To conclude, some might be surprised that I do not create a class to achieve this. The main reason is that I want to be able to use it quickly without being forced of creating the object first to then call a method.
For instance:
var tokenizer = new Tokenizer(); tokenizer.setCallback( cbOnToken ); tokinzer.process( "return true;" ); delete tokenizer; I do like the idea of using only tokenizer( "return true;", cbOnToken );
To make a long story short, KISS.
And remember, when needed, I can still create a class to encapsulate this (where the contrary is more complex).

Friday, April 5, 2013

Release and debug versions

The jQuery example (again)

Did you try to open jQuery.js? If you do so - especially the 'compressed' version - you'll see that it is almost unreadable. The whole code stands on few lines, everything looks encrypted. Fortunately, with the help of the Online JavaScript beautifier, one can make the source easier to read.

But why did the developer wrote the code this way?

Actually, the 'real' development version probably looks more like this one: indeed, it contains comments and the variable names are not shortened to one character.

So what happened and why?

The need for a "release" version

There are several advantages of publishing a condensed version: the most important one is the fact that it contains only what is necessary for the implementation. Indeed, all comments are removed; variables and non-exposed members can be renamed to reduce the file size. The release version may also be designed in a way that eliminates all the debugging stuff (asserts, traces, performance monitors...).

As a consequence, the source is smaller (which means it loads faster in the browser) and this should speed up the execution of code.

Another effect of this non readable version is to harden the possibility of code intrusion. Indeed, most of the interpreted languages like JavaScript do not include any compiling process. It means that when a software component is developed with it, the only way to publish it is to provide the source. Hence, it can be easily analyzed, copied or tweaked.
By extension, any programming language - at least the ones I know - are published in a way that allows reverse engineering. Even C or C++ languages are compiled and generate executable that - in the end - are a list of processor instructions that can be translated into an algorithm (but this is very difficult).

The need for a debug version

On the other hand, the debug version drops all the advantages of the release one. But, as a counterpart, it offers the possibility to troubleshoot the code with a readable version of the code. Additional tests and traces could be provided in order to test the algorithms.

Functionally speaking, it is - at least - equivalent to the release version but with a bigger size and slower execution speed.

Sources and generation

Sources

There are many different ways to build both versions: they can be independent or one may be generated out of the other. In that case, the debug version is maintained by the developers (as it contains comments and readable code) and an action is required to output the release version.

The debug version itself can be generated out of several files (sources) that are grouped together during another generation process. Usually, the bigger the project the more small units are required to simplify maintenance.

Generating the release version

As far as I can tell, it is simpler to generate automatically the release version from the debug one: it means that there is only one code to maintain and the release version can be obtained quickly. In any case, the developers have to make sure that the release version behaves like the debug one: the generation process must not introduce any defects.

Different methods can be used to generate the release version, from the simpler ones (automated search and replace) to more advanced (source code analyzing and rewriting).

Automating the tests

To guarantee that the release version is functionally equivalent to the debug one (and also because I am too lazy to document everything) one simple way is to provide a testing procedure that can be applied on each version and that should generate a comprehensive report highlighting the defects.

This is also useful for the non-regression tests, i.e. to answer the following question: how do you verify that the new version of your code behaves like the previous one?

My choices

I realize that if I want to be exhaustive, I would probably need more time than what I really have and the articles would be far much longer. I obviously plan to have a debug version composed of several files loaded at the same time. Also, the release version would be generated from the debug version.

I hope to have enough time and will to describe all the technics in this blog.