Exploring Cost-Performance Optimal Designs of Raw Microprocessors
Publication Files
Publication Medium:
pages
Year of Publication:
Abstract
The semiconductor industry roadmap projects that advances in VLSI technology will permit more than one billion transistors on a chip by the year 2010. The MIT Raw microprocessor is a proposed architecture that strives to exploit these chip-level resources by implementing thousands of tiles, each comprising a processing element and a small amount of memory, coupled by a static two-dimensional interconnect. A compiler partitions negrain instruction-level parallelism across the tiles and statically schedules inter-tile communication over the interconnect. Because Raw microprocessors fully expose their internal hardware structure to the software, they can be viewed as a gigantic FPGA with coarse-grained tiles, in which software orchestrates communication over static interconnections.One open challenge in Raw architectures is to determine their optimal grain size and balance. The grain size is the area of each tile, and the balance is the proportion of area in each tile devoted to memory, processing, communication, and I/O. If the total chip area is xed, more area devoted to processing will result in a higher processing power per node, but will lead to a fewer number of tiles. This paper presents an analytical framework using which designers can reason about the design space of Raw microprocessors. Based on an architectural model and a VLSI cost analysis, the framework computes the performance of applications, and uses an optimization process to identify designs that will execute these applications most cost-effectively.Although the optimal machine configurations obtained vary for different applications, problem sizes and budgets, the general trends for various applications are similar. Accordingly, for the applications studied, assuming an 1 billion logic transistor equivalent area, we recommend building a Raw chip with approximately 1000 tiles, 30 words/cycle global I/O, 20Kbytes of local memory per node, 3-4 words/cycle local communication bandwidth, and single-issue processors. This configurations will give performance near the global optimum for most applications.