Parallelizing Firefox
One talk at the workshop by Carmen Badea from a group at UC-Irvine presented work on parallelizing a part of the layout rendering engine in Firefox. A lot of the performance headlines regarding browsers today concerns their Javascript engines, with each new release touting the Sunspider numbers or other metrics. Aside from the questionable use of such metrics [5] as proxies for scripting performance, there is also the issue of the layout engine taking up an increasingly bigger part of the overall browser runtime as a result of Javascript tracing improvements.
The layout engine is responsible for taking the code and data that comprise a webpage and turning into something that is readable and hopefully aesthetically pleasing; positioning the text and graphics appropriately, with the right sizing and other effects. In the context of Javascript, this means both executing the actual Javascript code and also managing the interactions with various documents and events. Common examples of layout engines include Gecko (used in Mozilla, Firefox, Seamonkey), WebKit (used in Chrome and Safari) and Trident (used in IE).
Today, some browser vendors are investigating running render intensive parts of the browser on the GPU for acceleration, complementing the CPU. While beneficial, this is an example of coarse grained parallelism where entire modules are run concurrently on different resources. In contrast, the work presented sought to improve the performance of the serial CSS rule matching algorithm in Firefox, which runs on the CPU. The CSS rule matching algorithm can account for a third of the layout runtime; and the layout run time can consume up to 40% of the execution time of the browser. These metrics and speedup potentials were ascertained using Intel’s VTune and Mozilla’s VProf instrumentation packages over Mozilla’s own page load benchmark comprising hundreds of webpages, as well as the Javascript intensive Zimbra Collaboration Suite.

Figure 1 – Example of a simple CSS rule structure
The CSS descendent rule selector was chosen for speedup since it accounted for 99% of the encountered rules. This rule matching in Firefox works by bottom-up iteration over the elements of the dynamic document tree (anchored by an initial match) until a complete match is found. In almost all cases the iterations over the tree result in a non-match. Figure 2 below shows an example of this bottom-up iteration.

Figure 2 – Example bottom-up CSS descendent rule selectors
A different serial algorithm to implement matching was not attempted prior to parallelizing the current highly optimized serial implementation. As shown in Figure 3 below, the parallelized version [6] divides the ancestor paths through the document tree between n worker threads, along the path length. In the unlikely event one of the iterating threads finds a match it signals the other thread(s) iterating over its part of the path to stop unnecessary execution. Naturally if a certain descendent selector rule is rather long, dividing the work over multiple threads will substantially reduce the latency and improve performance.

Figure 3 – Original and parallelized CSS descendent rule selectors
Performance results from a 4-core Nehalem system demonstrated that the best performance was achieved with two threads. More threading decreased performance as a result of unnecessary overhead. Performance increased in the majority of cases with the page load benchmark and in half of the Zimbra benchmarks. Over all cases, average performance improved by 8% with the page load benchmarks, and 2.37% with the Zimbra suite tests [7]. Performance is measured with respect to overall browser runtime, not merely the layout component’s execution time.
Pages: « Prev 1 2 3 4 5 6 Next »
Discuss (250 comments)