Figure 1 – Example of a simple CSS rule structure
The CSS descendent rule selector was chosen for speedup since it accounted for 99% of the encountered rules. This rule matching in Firefox works by bottom-up iteration over the elements of the dynamic document tree (anchored by an initial match) until a complete match is found. In almost all cases the iterations over the tree result in a non-match. Figure 2 below shows an example of this bottom-up iteration.
Figure 2 – Example bottom-up CSS descendent rule selectors
A different serial algorithm to implement matching was not attempted prior to parallelizing the current highly optimized serial implementation. As shown in Figure 3 below, the parallelized version  divides the ancestor paths through the document tree between n worker threads, along the path length. In the unlikely event one of the iterating threads finds a match it signals the other thread(s) iterating over its part of the path to stop unnecessary execution. Naturally if a certain descendent selector rule is rather long, dividing the work over multiple threads will substantially reduce the latency and improve performance.
Figure 3 – Original and parallelized CSS descendent rule selectors
Performance results from a 4-core Nehalem system demonstrated that the best performance was achieved with two threads. More threading decreased performance as a result of unnecessary overhead. Performance increased in the majority of cases with the page load benchmark and in half of the Zimbra benchmarks. Over all cases, average performance improved by 8% with the page load benchmarks, and 2.37% with the Zimbra suite tests . Performance is measured with respect to overall browser runtime, not merely the layout component’s execution time.