What does this all mean?
It appears that our implementations of the barrel shifter would perhaps incur a logic depth of approximately 16 to 20 FO4 delays through the logic circuits in addition to the interconnect delays. More advanced circuits or re-orderings of the array structure to reduce impacts of signal propagation delay could further reduce the FO4 delay depth of this circuit. However, it appears clear that a 32 bit barrel shifter will not be able to fit within a single cycle timing budget of the previously given estimate of 12~16 FO4 delay depth of the Willamette processor pipeline stage.
The logical next question would then be if it would be possible to properly pipeline this array structure for the shift operation to be completed in two cycles and still sustain a throughput of 1 shift per cycle. However, since the upper range of the circuit delay ~20 FO4 is also sufficiently close to the 2X of the lower bound given for the Willamette processor, and with the additional uncertainty given with the delay through the interconnects in the interconnect dominated array, the question essentially becomes unanswerable without further analysis. In other words, the long answer here is essentially the same as the short answer given above, but with a slightly larger basis to substantiate it.
In this article, we’ve provided a cursory review of the FO4 metric as it applies to the delay depth of a logic block. We also engaged in brief discussions using the barrel shifter as an example in illustrating the use of the FO4 metric. Finally, the barrel shifter also served as an example to illustrate that the FO4 depth of a logic block can be reduced through the use of more advanced circuits or better architectural re-arrangement of logic and arrays to minimize signal delay paths. What is not mentioned is that even more engineering resources may be applied to specific logic paths, especially if that logic path happens to be the critical path that limits the operating frequency of a processor. In a design that uses the standard cell libraries, the 2 input mux may itself be a standard cell, or smaller bit vector barrel shifters may be created as a standard cell with which larger bit vector shifters may be constructed from. Processors or ASICS that use standard cell design methodology require less design effort as compared with a full custom design effort. In a full custom design effort, logic circuits may be specifically tailored to a given process technology at the cost of engineering resources. This point is well covered in the article by Chinnery and Keutzer .
Physical implementation of a processor utilizing a full custom design methodology requires detailed simulations and considerations for a functional unit’s area, power, and latency. The optimal design point of each logic block would depend on the assumptions made in the overall microarchitecture of the processor. However, such detailed simulations and considerations are not needed in cases where a processor’s architects wish to perform preliminary research into a new and novel architecture. The FO4 metric provides an abstraction from physical circuit implementations and allows architects to project, design and simulate a processor without knowing the specifics and details of the implementation. As a result, the FO4 metric is useful and popular metric used by architects and engineers as a basis to explore architectural concepts at an abstract level.
 Dhanesha, H., Falakshashi, K., Horowitz, M. “Array-of-arrays Architecture for Parallel Floating Point Multiplication”, Center for Integrated Systems, Stanford University. http://mos.stanford.edu/papers/hk_arvlsi_95.pdf
 Chinnery, D., Keutzer, K. “Closing the Gap between ASIC and Custom: An ASIC Perspective”, Proceedings, DAC 2000. http://www.sigda.org/Archives/ProceedingArchives/Dac/Dac2000/papers/2000/dac00/pdffiles/39_1.pdf
 Sprangle, E., Carmean, D. “Increasing Processor Performance by Implementing Deeper Pipelines”, ISCA 29 Proceedings. http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/public/doc/discussions/uniprocessors/technology/deep-pipelines-isca02.pdf
 Hrishikesh, M., Jouppi, N., Farkas, K., Burger, D., Keckler, S., Shivakumar, P., “The Optimal Logic Depth per pipelining stage is 6 to 8 FO4 Inverter delays”, ISCA 29 proceedings. http://www.eecs.harvard.edu/~dbrooks/cs246/deep-pipes.pdf
 MPC7450 RISC Microprocessor Family Technical Summary. Motorola Inc.
 Weste, N., Eshraghian, K., “Principles of CMOS VLSI Design”, Addison-Wesley 1999.
 Agarwal, V., Hrishikesh, M., Keckler S., Burger, D., “Clock Rate versus IPC: The End of the Raod for Conventional Microarchitectures”, Proceedings of ISCA 27, http://www.cs.utexas.edu/users/cart/publications/isca00.pdf
 “The IA-32 Intel® Architecture Software Developer’s Manual, Volume 2: Instruction Set Reference”, Intel. http://developer.intel.com/design/pentium4/manuals/24547107.pdf
Discuss (77 comments)