Tuesday, March 30, 2010

CommonJS Module Trade-offs

First of all: why should you care about module formats?

If you use JavaScript, particularly in the browser, more is being expected of you each day. Every site or webapp that you build will want to do more things over time, and browser engines are getting faster, making more complex, web-native experiences possible. Having modular code makes it much easier to build these experiences.

One wrinkle though, there is no standard module format for the browser. There is the very useful Module Pattern, that helps encapsulate code to define a module, but there is no standard way to indicate your module's dependencies.

I have been following some of the threads in the CommonJS mailing list about trying to come up with a require.async/ensure spec and a Transport spec. The reason those two specs are needed in addition to the basic module spec is because the CommonJS module spec decided to make some tradeoffs that were not browser-friendly.

This is my attempt to explain the trade-offs the CommonJS module spec has made, and why I believe they are not the right trade-offs. The trade-offs end up creating a bunch of extra work and gear that is needed in the browser case -- to me, the most important case to get right.

I do not expect this to influence or change the CommonJS spec -- the developers that make up most of the list seem to generally like the module format as written. At least they agreed on something. It is incredibly hard to get a group of people to code in a certain direction, and I believe they are doing it because they love coding and want to make it easier.

I want to point out the trade-offs made though, and suggest my own set of trade-offs. Hopefully by explicitly listing them out, other developers can make informed choices on what they want to use for their project.

Most importantly, just because "CommonJS" is used for the module spec, it should not be assumed that it is an optimal module spec for the browser, or that it should be the default choice for a module spec.

Disclosure: I have a horse in this race, RequireJS, and much of its design comes from a different set of tradeoffs that I will list further down. I am sure someone who prefers the CommonJS spec might have a different take on the trade-offs.

To the trade-offs:

1) No function for encapsulating a module.

A function around a module can seem like more boilerplate. Instead each module in the CommonJS spec is just a file. This means only one module per file. This is fine on the server or local disk, but not great in the browser if you want performance.

2) Referencing and loading dependencies synchronously is easier than asynchronous

In general, sync programming is easier to do. That does not work so well in the browser though.

3) exports

How do you define the module value that other modules can use? If a function was used around the module, a return value from that function could be used as the module definition. However, in the effort to avoid a function wrapper, it complicates setting up a return value. The CommonJS spec instead uses a free variable called "exports".

The value of exports is different for each module file, and it means that you can only attach properties to the exports module. Your module cannot assign a value to exports.

It means you cannot make a function as the module value. Some frameworks use constructor functions as the module values -- these will not be possible in CommonJS modules. Instead you will need to define a property on the exports object that holds the function. More typing for users of your module.

Using an exports object has an advantage: you can pass it to circular dependencies, and it reduces the probability of an error in a circular dependency case. However, it does not completely avoid circular dependency problems.

Instead, I favor these trade-offs:

1) Use a function to encapsulate the module.

This is basically the core of the previously-mentioned Module Pattern. It is in use today, it is an understood practice, and functions are at the core of JavaScript's built-in modularity.

While it is an extra function(){} to type, it is fairly standard to do this in JavaScript. It also means you can put more than one module in a file.

While you should avoid multiple modules in a file while developing, being able to concatenate a bunch of modules together for better performance in the browser is very desirable.

2) Assume async dependencies

Async performs better overall. While it may not help performance much in the server case, making sure a format performs well out of the box in the browser is very important.

This means module dependencies must be listed outside the function that defines the module, so they can be loaded before the module function is called.

3) Use return to define modules

Once a function is used to encapsulate the module, the function can return a value to define the module. No need for exports.

This fits more naturally with basic JavaScript syntax, and it allows returning functions as the module definition. Hooray!

There is a slightly higher chance of problems in circular dependency cases, but circular dependencies are rare, and usually a sign of bad design. There are valid cases for having circular dependencies, but the cases where a return value might be a problem for a circular dependency case is very small, and can be worked around.

If getting function return values means a slightly higher probability of a circular dependency error (which has a mitigation) then that is the good trade-off.

This avoids the need for the "exports" variable. This is fairly important to me, because exports has always looked odd to me, like it did not belong. It requires extra discovery to know its purpose.

Return values are more understandable, and allowing your module to return a function value, like a constructor function, seems like a basic requirement. It fits better with basic JavaScript.

4) Pass in dependencies to the module's function wrapper

This is done to decrease the amount of boilerplate needed with a function wrapped modules. If this is not done, you end up typing the dependency name twice (an opportunity for error), and it does not minify as well.

An example: let's define a module called "foo", which needs the "logger" module to work:

require.def("foo", ["logger"], function () {

//require("logger") can be a synchronous call here, since
//logger was specified in the dependency array outside
//the module function
require("logger").debug("starting foo's definition");

//Define the foo object
return {
name: "foo"
};
});
Compare with a version that passes in "logger" to the function:

require.def("foo", ["logger"], function (logger) {

//Once "logger" module is loaded it is passed
//to this function as the logger function arg
logger.debug("starting foo's definition");

//Define the foo object
return {
name: "foo"
};
});

Passing in the module has some circular dependency hazards -- logger may not be defined yet if it was a circular dependency. So the first style, using require() inside the function wrapper should still be allowed. For instance, require("logger") inside a method that is created on the foo object could be used to avoid the circular dependency problem.

So again, I am making a trade-off where the more common useful case is easier to code vs increasing the probability of circular dependency issues. Circular dependencies are rare, and the above has a mitigation via the use of require("modulename").

There is another hazard that can happen with naming args in the function for each dependency. You can get an off-by-one problem:

require.def("foo", ["one", "two", "three"], function (one, three) {
//In here, three is actually pointing to the "two" module
});
However, this is a standard coding hazard, not matching inputs args to a function. And there is mitigation, you could use require("three") inside the module if you wanted.

The convenience and less typing of having the argument be the module is useful. It also fits well with JSLint -- it can help catch spelling errors using the argument name inside the function.

5) Code the module name inside the module

To define the foo module, the name "foo" needs to be part of the module definition:

require.def("foo", ["logger"], function () {});
This is needed because we want the ability to combine multiple module definitions into one file for optimization. In addition, there is no good way to match a module definition to its name in the browser without it.

If script.onload fired exactly after the script is executed, not having the module name in the module definition might work, but this is not the case across browsers. And we still need to allow the name to be there for optimization case, where more than one module is in a file.

There is a legitimate concern that encoding the module name in the module definition makes it hard to move around code -- if you want to change the directory where the module is stored, it means touching the module source to change the names.

While that can be an issue, in Dojo we have found it is not a problem. I have not heard complaints of that specific issue. I am sure it happens, but the fix cost is not that onerous. This is not Java. And YUI 3 does something similar to Dojo, encode a name with the module definition.

I think the rate of occurrence of this issue, and the work it takes to fix are rarer and one time costs vs. forcing every browser developer taking extra, ongoing costs of using the CommonJS module format in the browser.

Conclusion

Those are the CommonJS trade-offs and my trade-offs. Some of them are not "more right" but just preferences, just like any language design. However, the lack of browser support in the basic module spec is very concerning to me.

In my eyes, the trade-offs CommonJS has made puts more work on browser developers to navigate more specs and need more gear to get it to work. Adding more specs that allow modules to be expressed in more than one way is not a good solution for me.

I see it as the CommonJS module spec making a specific bet: treating the browser as a second class module citizen will pay off in the long run and allow it to get a foothold in other environments where Ruby or Python might live.

Historically, and more importantly for the future, treating the browser as second class is a bad bet to make.

All that said, I wish the CommonJS group success, and there are lots of smart people on the list. I will try to support what I can of their specs in RequireJS, but I do feel the trade-offs in the basic module spec are not so great for browser developers.

RequireJS 0.9.0 Released

I just pushed a new release of RequireJS, 0.9.0.

The optimization tool has seen the most change in this release. It sports some CSS optimizations now and it is much more robust. It also includes command line options for optimizing just one JS file or one CSS file.

The other new feature is the support for relative module names for require.def() dependencies. So this kind of call works now:

require.def("my/project/module", ["./dependency1"], function(){});

It will load my/project/dependency1.js. This should help cut down the amount of typing for larger projects that have deep directories of modules.

This release has some backwards-incompatible changes. That was the reason for the bump to 0.9.0. The project is still not at 1.0, so backwards-incompatible changes may still be considered. I do not have any more changes like that planned, but I will be sure to give more notice in the RequireJS list before doing so in the future.

All the details are on the download page.

Raindrop has been updated to the latest RequireJS release, and it is working great. Give the new RequireJS build a spin!

Sunday, March 14, 2010

RequireJS, kicking some AST

RequireJS has an optimization tool that can combine and minify your scripts. It uses Google's Closure Compiler to do the minification. Recently, but after the RequireJS 0.8.0 release, I ported over the CSS optimizations from the Dojo build system, so the optimization tool now inlines @import calls and remove comments from CSS files.

The script combining still has some rough edges though, and it was mainly due to me trying to use suboptimal regexp calls to find require() and require.def() calls in the files, so the dependencies for a script could be traced.

So I finally took the dive into Abstract Syntax Trees (ASTs) to do the work. What is an AST? An analogy that works for me: an AST is to JavaScript source as the DOM API is to HTML source. The AST has methods for walking through the nodes in the JS code structure, and you can get properties on a node.

Figuring out how to generate an AST from scratch can be a bit of work, but since I was already using Closure Compiler, I just used an AST it can generate.

Since the optimization tool for RequireJS is written in JavaScript, which makes calls into Java-land to do file access and minification calls, I wanted the same approach for working with the AST -- do my work in JavaScript, but call the Java methods for the AST walking and source transform.

My task was fairly simple -- I just wanted to find require() or require.def() calls that used strings for module names and dependencies, pull those calls out of the file, then just execute those calls to work out the dependencies.

The end result was this file:
http://github.com/jrburke/requirejs/blob/master/build/jslib/parse.js

The basic idea of the script:
//Set up shortcut to long Java package name,
//and create a Compiler instance.
var jscomp = Packages.com.google.javascript.jscomp,
compiler = new jscomp.Compiler(),

//The parse method returns an AST.
//astRoot is a kind of Node for the AST.
//Comments are not present as nodes in the AST.
astRoot = compiler.parse(jsSourceFile),
node = astRoot.getChildAtIndex(0);

//Use Node methods to get child nodes, and their types.
if (node.getChildAtIndex(1).getFirstChild().getType() === CALL) {
//Convert this call node and its children to JS source.
//This generated source does not have comments and
//may not be space-formatted exactly the same as the input
//source
var codeBuilder = new jscomp.Compiler.CodeBuilder();
compiler.toSource(codeBuilder, 1, node);

//Return the JavaScript source.
//Need to use String() to convert the Java String
//to a JavaScript String.
return String(codeBuilder.toString());
}
Thanks to the Closure Compiler team for doing the hard work and open sourcing the code. It looks like Closure Compiler deals with two AST formats -- one is perhaps an older one generated by Rhino, while the other one is a more custom one? It seems like I was getting back the Rhino-based Nodes for the methods I called.

I was tempted to try to go direct to just use Rhino for the AST, but decompiling the AST into source looked harder to do, and from what I recall, Rhino has a newer AST API in the trunk code. I believe the one in Closure Compiler is the older one? All that added up to me being wary of that path.

Most of the time spent was trying to figure out the Java invocations to get the code parsed, understand the tree structure, deal with Java-to-JavaScript translation issues and then figure out the Java invocations to convert a subtree back into source.

I am glad I finally stepped into working with a real AST. While some of the AST calls are a bit awkward (at least for me as a JavaScript person), it is a lot better than trying to use regexps for it. I still need to do more testing, but I feel more confident in the robustness of the solution now.

If you see how I can do it better, point me in the right direction!