AN EFFICIENT METHODOLOGY FOR BUILDING CUSTOM PROCESSORS
BASED ON INSTRUCTION SET EXTENSIONS
by
NAGARAJU POTHINENI
Department of Computer Science & Engineering
Submitted
in fulfillment of the requirements of the degree of Doctor of Philiosophy
to the
Indian Institute of Technology Delhi
October, 2008
I.T. DELHI
DD ~tl ~'`~f D.. Ti
Acc. No.. r :. .3G ,
Y1-E.
Certificate
This is to certify that the thesis titled An Efficient Methodology for Building Custom Processors based on Instruction Set Extensions being submitted by Nagaraju Pothineni for the award of Doctor of Philosophy in Computer Sci- ence & Engg. is a record of bona fide work carried out by him under our guidance and supervision at the Dept. of Computer Science & Engg., Indian Institute of Technology Delhi. The work presented in this thesis has not been submitted else- where, either in part or full, for the award of any other degree or diploma.
Anshu Kumar Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Delhi
olin Paul Assistant Professor
Dept. of Computer Science & Engg.
Indian Institute of Technology Delhi
Acknowledgments
First and foremost, I would like to express my deep gratitude to Prof. Anshul Ku- mar for his invaluable guidance from the early stages of my PhD till to date. He has spent several intense hours on my thesis and provided constant and constructive feed- back. He has supported and nurtured innovative thinking and shown the path of doing research in a dedicated manner. I can not forget the numerous times he spent long night hours working on my reports and providing the comments with impeccable patience. It has been very memorable and enjoying to have worked with him, as I have learnt not just how to develop and explore new ideas, but how to be professional, ethical and kind to others. He is undoubtedly the one person who has improved my technical abilities and writing skills most.
I would also deeply thank my co-advisor Dr. Kolin Paul for his guidance throughout my PhD. He has always injected many innovative and non-conventional ideas in me and encouraged to think beyond one narrow field. He is very prompt in any help I had asked whether it is evaluating the reports or providing feedback on my ideas etc. During the interaction with him, he used to ask very thought provoking questions that helped me to shape my work.
I had carried my work on datapath merging at EPFL, Switzerland. I thank Prof.
Paolo Ienne for giving me an opportunity to work in his lab and for technical guidance.
He has kindly given me access to MIMOSYS tool. I also would be greatly indebted to Dr. Philip Brisk for his guidance and help while I was at EPFL.
I thank Prof. Balakrishnan, Prof. Preeti Ranjan Panda, and Prof. Ranjan Bose for serving on my SRC committee and for their feedback on every presentation I gave before them.They had given several useful and constructive feedback on my work.
I would not have have been able to work without the cooperation and help from the laboratory staff, Mr. Somdutt Sharma and Ms. Vandana. They are very prompt in providing the hardware equipment or software whenever I had asked for.
Working for intense four years would have very painful without the pleasing and entertaining company of my colleagues. I need to acknowledge the personal and moral support from Anant Vishnoi, Aryabartta Sahu, Neeraj Goel, Lava Bhargav, Krishnaiah
and Murali during my stay at IITD.
Last but not least, I thank my parents for their support and encouragement during the course of my PhD.
Nagaraju Pothineni
Abstract
In today's consumer electronics market, device vendors have been constantly facing the challenge of delivering more energy efficient, feature rich mobile platforms in a very short span of time. With the advantage of growing number of transistors packaged in a single chip, we are now able to build all the components of a system on a single chip. It is a challenging task of the designer to exploit the large number of transistors and deliver an innovative platform that meets the user's requirements and has less aggregated design cost.
In the recent past, researchers have proposed paradigm of extensible processors, wherein designers augment the instruction set of the processor with custom instructions that implement the critical portions of the application on dedicated hardware units called Custom Functional Units (CFUs) that are tightly integrated into the processor pipeline. CFUs can significantly enhance the performance of the application and also reduce the energy consumption, which is a key requirement in today's design.
We believe that for extensible processors to become successfully employed in main- stream design, there must be automation framework that can assist the designer starting from profiling till the hardware generation. Though there are quite a few commercial tools available in the industry for the design of extensible processor, their use is still limited and designers are required to manually make decisions about the choice of CFUs and their use in the application. This puts a heavy burden on the designer and conse- quently, prolongs the design cycle and has direct effect on the design cost. To overcome these gaps, in this thesis, we have developed an efficient methodology for building cus- tomized processors in a short amount of time.
The methodology is built around the principle of thorough and accurate analysis of the design space. To handle the potentially large design space, we present efficient algorithms for exhaustive enumeration of custom instruction candidates and efficient spatial reuse computation based on preprocessing step known as Recurring Pattern Computation (RPC).
Also, we show that when the processor supports serialization of operands, largest legal arithmetic clusters (MaxMIM0s) yield maximum performance. We present an efficient method of identifying and evaluating the MaxMIMO patterns.
Finally, we present a methodology called as Data Path Merging (DPM), for the area efficient synthesis of generated custom instructions. The DPM is based on the principle of sharing resources among the mutually exclusive operations. When the number of inputs/outputs of the custom instruction are larger than the available register ports, the proposed synthesis methodology automatically pipelines and serializes the operands before performing the resource sharing.
Contents
List of Figures v
List of Tables ix
1 Introduction 1
1.1 Issues in Automatic Customization . . . 2
1.2 Limitations of Previous Work . . . 3
1.3 Overview of our Approach . . . 5
1.4 Main Contributions . . . 7
1.5 Layout of the Thesis . . . 7
2 Custom Processor Architectures 9 2.1 Tensilica LX2 . . . 10
2.2 ARC ARCompact . . . 12
2.3 MIPS CorExtend . . . 13
2.4 NIOS II ... 14
2.5 Influence of Base Processor Architecture on ISE . . . . . . 16
2.5.1 Architectural Constraints . . . 16
2.6 Execution Model . . . 17
3 Custom Instruction Generation 19 3.1 Introduction . . . 19
3.2 Related Work . . . 21
3.3 Problem Formulation . . . 26
3.4 Generating Convex Subgraphs . . . 27
3.5 Useful k-node Subgraphs . . . .. . . 34 i
ii CONTENTS
3.5.1 Useful k-node subgraphs . . . 34
3.6 Our Algorithm . . . 35
3.6.1 Generating set of k + 1 node useful subgraphs, UF[k + 1] . . . 36
3.7 Experimental Results . . . 44
3.7.1 Experiment Setup . . . 44
3.7.2 Results . . . 44
3.8 Summary . . . 46
4 Reuse Driven Customization 49 4.1 Introduction . . . 49
4.2 Outline of Pattern Analysis & Selection . . . . . . . 50
4.3 Related Work . . . 52
4.4 Recurring Pattern Computation . . . 54
4.4.1 Problem Formulation . . . 54
4.4.2 A Simple Exhaustive Approach . . . . . . 56
4.4.3 Our Algorithm . . . . . . . 58
4.4.4 Recurrence Extraction . . . 65
4.5 Instruction Selection . . . 68
4.5.1 Pattern Characterization . . . 68
4.5.2 Pattern Selection . . . 70
4.6 Experiments & Results . . . 72
4.7 Summary . . . 74
5 Flexible I/O FUs 81 5.1 Introduction . . . 81
5.2 Related Work . . . .. 83
5.3 Pattern Size and Performance Gain . . . 84
5.4 Our Methodology . . . .. . . . 86
5.5 Convex MaxMIMOs Generation . . . 88
5.5.1 Convex MaxMIMO Identification . . . . . . . . . . . . . . 88
5.5.2 Conditions for Convexity & Maximality . . . . . . 89
5.5.3 Algorithm based on Groupwise Incompatibility . . . . . . 91
5.6 Timeshaping of Patterns . . . . . . . 92
CONTENTS iii
5.7 Iterative Selection Algorithm . . . 98
5.8 Experiments . . . 99
5.9 Summary . . . 103
6 Custom Instruction Synthesis 105 6.1 Introduction . . . 105
6.2 Combinational Datapath Merging . . . 106
6.2.1 Problem Statement . . . 106
6.2.2 Related Work . . . 108
6.2.3 Problem Formulation . . . 109
6.2.4 Our Solution . . . 111
6.2.5 Evaluating the Supergraph . . . 112
6.2.6 Example . . . 113
6.2.7 ILP Formultaion . . . 113
6.2.8 Heuristic Solution . . . 116
6.2.9 Experiments & Results . . . 120
6.3 Multicycle Datapath Merging . . . 124
6.3.1 Timeshaping . . . 127
6.3.2 Timeshape Refinement . . . 130
6.3.3 Operation Merging . . . 133
6.3.4 Register Allocation . . . 137
6.3.5 Experiments & Results . . . 141
6.4 Summary . . . 145
7 Conclusion 147 7.1 Future Directions . . . 148
Bibliography 151