Branch Prediction Research in the LAVA Lab (including hybrid predictors, speculative update, return address stack, and alloying)

Photo courtesy of The Hawaii Center for Volcanology

Research on Branch Prediction

Improving branch prediction, although already a well-studied subject, remains critical because delivering very high branch-prediction rates is crucial to further performance gains in high-performance computing. For example, a single misprediction in the Alpha 21264 results in a minimum of 7 wasted cycles; in the Pentium 4, in a minimum of 17 wasted cycles. Because most processors today are out-of-order processors, branches can take an arbitrarily long time to resolve, and so the average misprediction penalty is often much larger, 10-20 cycles. It takes a "mere" 5% misprediction rate to cut performance by as much as 20-30% in today's wide-issue processors.

Our prior work has explored how to improve branch-prediction accuracy by updating the predictor speculatively. Mis-speculated updates require repair, and we have shown that some simple mechanisms can make this repair possible at low cost. A particularly elegant example is speculative update and repair of the return-address stack. It is sufficient to simply save and restore the top-of-stack pointer and top-of-stack contents. We have also shown that it makes sense to combine local and global history in the same Schematic of an alloyed (MAs) predictor branch predictor structure–a pseudo-hybrid branch predictor that we call an alloyed predictor. Using both global and local history is important. Not only do some branches in a program benefit from global history while others benefit from local history–a well known fact that suggests use of a hybrid predictor–but in fact individual branches vary between benefiting from global history and benefiting from local history. A large, dynamic-selection (McFarling-style) hybrid predictor can also give good performance. But a hybrid predictor is only effective with a large hardware budget. An alloyed predictor is equally effective for large budgets, but is more robust, giving good performance even down to 1 Kbits. An alloyed predictor consistently outperforms conventional 2-level organizations, and also outperforms hybrid predictors for all but the largest hardware budgets. The key is that the alloyed organization is combining global and local history. In a
related vein, we have explored the impact of context switching on branch-prediction accuracy and found the impact to be negligible except for very small context-switch intervals.

In the power domain, as mentioned above, we have also analyzed the power characteristics of branch predictors. We find that, for overall dynamic power savings, it is important to obtain the highest possible branch-prediction accuracy, even if this entails extra power dissipation in the fetch engine. But some techniques, like the use of a prediction probe detector (PPD), can save power without sacrificing prediction accuracy. For static or leakage power savings, one can apply decay techniques or use a novel four-transistor RAM cell design. The dynamic power work is joint with Dharmesh Parikh, Yan Zhang, Marco Barcella, and Mircea Stan; the leakage power work is joint with Zhigang Hu, Philo Juang, Margaret Martonosi, and Doug Clark from Princeton, and Stefanos Kaxiras from Agere.

New work is characterizing and proving the limits of predicatability–work with Karthik Sankaranarayanan. We are also looking for new benchmarks that exhibit interesting branch-prediction behavior, especially programs with poor predictability and/or a large branch footprint. Take the benchmark challenge!

Future work for which we need new group members includes (1) other ways to improve branch prediction by improving compiler-hardware synergy: we want to understand how aggressive predication and hardware branch-prediction interact, and how compile-time data-, control-, and dependence analysis can convey useful information to the runtime hardware without the need for tedious profiling and feedback; and (2) and a variety of projects to better characterize branch predictor behavior and the sources of branch mispredictions.

Selected Publications

Z. Lu, J. Lach, M. Stan, and K. Skadron. “Alloyed Branch History: Combining Global and Local Branch History for Robust Performance,” International Journal of Parallel Programming, Kluwer, volume 31, number 2, Apr. 2003. (abstract)
Z. Lu, J. Lach, M.R. Stan, and K. Skadron. “Alloyed Branch History: Combining Global and Local Branch History for Robust Performance,” Tech Report CS-2002-21, Univ. of Virginia Dept. of Computer Science, July 2002. (postscript | abstract)
D. Parikh, K. Skadron, Y. Zhang, M. Barcella, and M. Stan. "Power Issues Related to Branch Prediction." In Proceedings of the Eighth International Symposium on High-Performance Computer Architecture, Feb. 2002, to appear. (postscript | pdf | abstract)
M. Co and K. Skadron. "The Effects of Context Switching on Branch Predictor Performance." In Proceedings of the 2001 IEEE International Symposium on Performance Analysis of Systems and Software, pp. 77-84, Nov. 2001. (postscript | pdf | abstract)
K. Sankaranarayanan and K. Skadron. "A Scheme for Selective Squash and Re-issue for Single-Sided Branch Hammocks." Tech Report CS-2001-14, Univ. of Virginia Dept. of Computer Science, July, 2001. (pdf | abstract)
M. Co. "The Effects of Context Switching on Branch Predictor Performance." MCS project, Univ. of Virginia School of Engineering and Applied Science, June 2001. (postscript)
K. Skadron and P.S. Ahuja. "HydraScalar: A Multipath-Capable Simulator." In the Newsletter of the IEEE Technical Committee on Computer Architecture, pp. 65-70, Jan. 2001. (postscript | pdf | abstract)
K. Skadron, M. Martonosi, and D.W. Clark. "A Taxonomy of Branch Mispredictions, and Alloyed Prediction as a Robust Solution to Wrong-History Mispredictions." In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, pp. 199-206, Oct. 2000. (postscript | pdf | abstract)
K. Skadron, M. Martonosi, and D.W. Clark. "Speculative Updates of Local and Global Branch History: A Quantitative Analysis." Journal of Instruction-Level Parallelism, vol. 2, Jan. 2000 (http://www.jilp.org/vol2). (postscript | abstract)
K. Skadron, P.S. Ahuja, M. Martonosi, and D.W. Clark. "Branch Prediction, Instruction-Window Size, and Cache Size: Performance Tradeoffs and Simulation Techniques." IEEE Transactions on Computers, 48(11):1260-81, Nov. 1999. (postscript | pdf | abstract)
K. Skadron, M. Martonosi, and D.W. Clark. "Alloying Global and Local Branch History: A Robust Solution to Wrong-History Mispredictions." Tech Report TR-606-99, Princeton Dept. of Computer Science, Oct. 1999.

postscript

abstract

K. Skadron, P.S. Ahuja, M. Martonosi, and D.W. Clark. "Improving Prediction for Procedure Returns with Return-Address-Stack Repair Mechanisms." In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp. 259-71, December 1998.

postscript

pdf

abstract

Back to Skadron's home page
Back to the LAVA home page
Last updated: 7 Mar., 2003

Photo courtesy of the USGS/Cascades Volcano Observatory

Department of Computer Science
School of Engineering and Applied Science
University of Virginia, Charlottesville, Virginia 22904
(434) 982-2200, FAX: (434) 982-2214