SIGDA, Super Compendium, ISLPED 2002, Abstracts

ISLPED 2002 ABSTRACTS

Sessions: [Keynote] [1] [2] [Poster Session 1] [Poster Session 2] [3] [4] [5] [6] [Invited Talk] [7] [8] [Embedded Tutorial 1] [Embedded Tutorial 2] [9] [10] [Poster Session 3] [Poster Session 4] [Invited Talk] [11] [12]

Keynote Speech

Session Chair: Mary Jane Irwin (Penn State University)

Low-Voltage Memories for Power Aware Systems [p. 1]

Kiyoo Itoh (Hitachi Ltd.)

This paper describes low-voltage RAM designs for stand-alone and embedded memories in terms of signal-to-noise-ratio designs of RAM cells and subthreshold-current reduction. First, structures and areas of current DRAM and SRAM cells are discussed. Next, low-voltage peripheral circuits that have been proposed so far are reviewed with focus on subthreshold-current reduction, speed variation, on-chip voltage conversion, and testing. Finally, based on the above discussion, a perspective is given with emphasis on needs for high-speed simple non-volatile RAMs, new devices/circuits for reducing active-mode leakage currents, and memory-rich SOC architectures.
Keywords
subthreshold current, DRAM and SRAM cells, gain cells, peripheral circuits, gate-source/substrate-source back-biasing, multi-VT , on-chip voltage converters, testing, non-volatile RAMs, memory-rich architectures.

Session 1: Low Power Modeling and Systems

Session Chair: Nestoras Tzartzanis (CSEM, EPFL, TPC)
Session Organizer: Bill Athas (Apple)

1.1 Standby Power Management for a 0.18 um Microprocessor [p. 7]

L.T. Clark, S. Demmons, N. Deutscher, F. Ricci (Intel Corporation)

Static power dissipation is a concern for battery powered handheld devices since it can substantially impact the battery life. Here, the use of reverse body bias to limit I_off on the high performance, low power XScale^TM microprocessor core is described. The scheme utilized is amenable to implementation on a low-cost (non-triple well) process and has limited regulation requirements. The regulation requirements and circuits are described, as is the performance of the method. A measured current reduction factor of over 25 is achieved with this method of reverse body bias. Implications of the use of body bias leakage control for active power and performance, as well as system level implications are also discussed.
General Terms
Measurement, Performance, Design
Keywords
Low power, microprocessors, body effect

1.2 Physical Insight into Fractional Power Dependence of Saturation Current on Gate Voltage in Advanced Short Channel MOSFETs [p. 13]

H. Im (University of Tokyo, Dongguk University), M. Song (Dongguk University), T. Hiramoto (University of Tokyo, VLSI Design and Education Center), T. Sakurai (University of Tokyo)

The physical origin of the fractional power dependence of MOSFET drain current on gate voltage, namely a-power law model that has been considered as a fully empirical model, is analytically investigated. For this purpose, we have developed a new physics-based analytical drain current model. Using this model, we prove that the saturation current can be simplified in the form of B.(V_g-V_TH)alpha, alpha-power law model. The physical interpretations on alpha, B, V_TH are elucidated, and their analytical expressions are given in terms of MOSFET's parameters. Since the a-power model is compact and physics-based, it allows circuit designers to easily estimate the power dissipation and the gate delay time in a predictable manner.
Categories & Subject Descriptors:
I.6.5 Model Development.
General Term:
Theory, Verification.
Keywords:
MOSFET modeling, Saturation current, alpha-power model.

1.3 Full-chip Sub-threshold Leakage Power Prediction Model for sub-0.18um CMOS [p. 19]

S. Narendra (Massachusetts Institute of Technology, Intel Laboratories), V. De, S. Borkar (Intel Laboratories), D. Antoniadis, A. Chandrakasan (Massachusetts Institute of Technology)

The driving force for the semiconductor industry growth has been the elegant scaling nature of CMOS technology. In future CMOS technology generations, supply and threshold voltages will have to continually scale to sustain performance increase, control switching power dissipation, and maintain reliability. These continual scaling requirements on supply and threshold voltages pose several technology and circuit design challenges. With threshold voltage scaling sub-threshold leakage power is expected to become a significant portion of the total power in future CMOS systems. Therefore, it becomes crucial to predict sub-threshold leakage power of such systems. In this paper, we present a subthreshold leakage power prediction model that takes into account within-die threshold voltage variation. Statistical measurements of 32-bit microprocessors in 0.18 �m CMOS confirms that the mean error of the model to be 4%. Comparisons of this model to two other existing models that do not take within-die threshold voltage variation into account are also presented.

1.4 Power-Conscious Interconnect Buffer Optimization with Improved Modeling of Driver MOSFET and its Implications to Bulk and SOI CMOS Technology [p. 24]

K. Nose, T. Sakurai (University of Tokyo)

Closed-form formulas for optimum buffer insertion where the junction capacitance is taken into account are proposed. In order to use the derived formulas, an appropriate choice of the effective linear resistance of the driving transistor is also clarified. Using the proposed formulas, the optimum interconnect delay and power comparison between bulk and SOI CMOS technology are discussed. The calculation results show that both the optimum delay and power with SOI can be reduced by 15% compared with the bulk MOSFET whose junction capacitance is assumed to be equal to the gate capacitance.
Categories & Subject / General Terms
B.7.1 Integrated circuits / Performance, design

Session 2: Energy Efficient Communications

Session Chair: Teresa Meng (Stanford University)
Session Organizer: Vijaykrishnan Narayanan (Penn State University)

2.1 E²WFQ: An Energy-Efficient Fair Scheduling Policy for Wireless Systems [p. 30]

V. Raghunathan, S. Ganeriwal, C. Schurgers, M. Srivastava (University of California, Los Angeles)

As embedded systems are being networked, often wirelessly, an increasingly larger share of their total energy budget is due to the communication. This necessitates the development of power management techniques that address communication subsystems, such as radios, as opposed to computation subsystems, such as embedded processors, to which most of the research effort thus far has been devoted. In this paper, we present E²WFQ, an energy efficient version of the Weighted Fair Queuing (WFQ) algorithm for packet scheduling in communication systems. We employ a recently proposed radio power management technique, Dynamic Modulation Scaling (DMS), as a control knob to enable energy-latency tradeoffs during wireless packet scheduling. The use of E²WFQ results in an energy aware packet scheduler, which exploits the statistics of the input arrival pattern as well as the variability in packet lengths. Simulation results show that large savings in energy consumption can be obtained through the use of our scheduling scheme, compared to conventional WFQ, with only a small, bounded increase in worst case packet latency.
Categories and Subject Descriptors
C.2.1 [Computer-Communication Networks]: Network Architecture and Design�Network communications, Wireless communication; C.2.6 [Computer-Communication Networks]: Internetworking �Routers
General Terms
Algorithms, Design
Keywords
Energy Efficient Design, Power Management, Wireless Communications, Fair Scheduling

2.2 A Framework for Energy-Scalable Communication in High-Density Wireless Networks [p. 36]

R. Min, A. Chandrakasan (Massachusetts Institute of Technology)

Power-aware communication is essential for maximizing the lifetime of energy-constrained wireless devices. Applications running on such devices can cooperatively reduce communication energy by trading communication latency, reliability, or range for energy savings. We introduce a framework that exposes these high level trade-offs to a power-aware communication subsystem featuring variable-strength convolutional coding, an adjustable power amplifier, and a voltage-scaled processor. An application programming interface (API) exposes an application's minimum quality constraints on the communication. These constraints are translated into energy-efficient parameter settings for the communication hardware. We apply our framework to improved communication energy models and measurements from a wireless microsensor node to effect over an order of magnitude of energy scalability.
Categories and Subject Descriptors
C.2.1 [Network Architecture and Design] � wireless communication, network communications . C.2.3 [Network Operations] C.4 [PERFORMANCE OF SYSTEMS]. D.2.2 [Design Tools and Techniques]
General Terms
Performance, Design, Reliability
Keywords
power awareness, energy scalability, wireless sensor networks, distributed microsensors, uAMPS, API design, dynamic voltage scaling, forward error correction, transmit power, macromodels, energy models

2.3 Contents Provider-Assisted Dynamic Voltage Scaling for Low Energy Multimedia Applications [p. 42]

E.-Y. Chung (CSL Stanford University), L. Benini (University of Bologna), G. De Micheli (CSL Stanford University)

This paper presents a new concept of DVS (Dynamic Voltage Scaling) for multimedia applications. Many multimedia applications have a periodic property, but each period shows a large variation in terms of its execution time. Exact estimation of such variation is a crucial factor for low energy software execution with DVS technique. Previous DVS techniques focused only on end users (client sites) and their quality heavily depends on the accurateness of the worst case execution time estimation. This paper proposes that contents providers (server sites) supply the information of the execution time variations in addition to the content itself. This makes it possible to perform DVS independent to worst case execution time estimation. The extra work required to the contents provider for this purpose is fully compensated by the benefits for the end users because single content is often provided to many users. Experimental results show that our method greatly reduces the energy consumption of client systems compared to previous DVS techniques.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-Aided Engineering
General Terms
Algorithms, Management
Keywords
DVS(Dynamic Voltage Scaling), contents provider, low-power, worst case execution time, characterization, multimedia

Poster Session 1: Technology and Circuits

Session Chair: R.V. Joshi (IBM), Lars Svensson (Chalmers University)

P1.1 Low-Leakage Asymmetric-Cell SRAM [p. 48]

N. Azizi, A. Moshovos, F.N. Najm (University of Toronto)

We introduce a novel family of asymmetric dual-Vt SRAM cell designs that reduce leakage power in caches while maintaining low access latency. Our designs exploit the strong bias towards zero at the bit level exhibited by the memory value stream of ordinary programs. Compared to conventional symmetric high-performance cells, our cells offer significant leakage reduction in the zero state and in some cases also in the one state albeit to a lesser extend. A novel sense-amplifier, in coordination with dummy bitlines, allows for read times to be on par with conventional symmetric cells. With one cell design, leakage is reduced by 7X (in the zero state) with no performance degradation. An alternative cell design reduces leakage by 40X (in the zero state) with a performance degradation of 5%.
Categories and Subject Descriptors
B.3.1 [Memory Structures]: Semiconductor memories
General Terms
Design
Keywords
SRAM, Low-leakage, Low-power, Dual-Vt

P1.2 Managing Leakage for Transient Data: Decay and Quasi-Static 4T Memory Cells [p. 52]

Z. Hu, P. Juang (Princeton University), P. Diodato, S. Kaxiras (Agere Systems), K. Skadron (University of Virginia), M. Martonosi, D. W. Clark (Princeton University)

Much of on-chip storage is devoted to transient, often short-lived, data. Despite this, virtually all on-chip array structures use six transistor (6T) static RAM cells that store data indefinitely. In this paper we propose the use of quasi-static four-transistor (4T) RAM cells. Quasi-static 4T cells provide both energy and area savings. These cells have no connection to Vdd and thus inherently provide decay functionality: values are refreshed upon access but discharge over time without use. This makes 4T cells uniquely well-suited for predictive structures like branch predictors and BTBs where data integrity is not essential. We use quantitative evaluations (both circuit-level and cycle-level) to explore the design space and quantify the opportunities. Overall, 4T-based branch predictors offer 12-33% area savings and 60-80% leakage savings with minimal performance impact. More broadly, this paper suggests a new view of how to support transient data in power-aware processors.
Categories and Subject Descriptors
B.7.1 [Hardware]: Integrated Circuits�Types and Design Styles
General Terms
Design, Measurement
Keywords
Leakage power, transient data, decay, quasi-static, 4T, memory cell

P1.3 Conditional Pre-Charge Techniques for Power-Efficient Dual-Edge Clocking [p. 56]

N. Nedovic, M. Aleksic, V.G. Oklobdzija (University of California, Davis)

A new dual edge-triggered flip-flop that saves power by inhibiting transitions of the nodes that are not used to change the state is presented. The proposed flip-flop is 12% faster with 10% lower Energy-Delay Product for 50% data activity, as compared to the previously published dual edge-triggered storage elements. This was confirmed by simulation using 0.18um process, 1.8V power supply, and clock frequency of 250MHz. This flip-flop is particularly suitable for low-power applications.
Categories and Subject Descriptors
B.6.1 [Logic Design]: Design Styles � sequential circuits.
General Terms
Performance, Design.
Keywords
Dual edge-triggered flip-flop, clocked storage elements, clocking, clock distribution, power consumption.

P1.4 Circuit-Level Techniques to Control Gate Leakage for sub-100nm CMOS [p. 60]

F. Hamzaoglu, M.R. Stan (University of Virginia)

Although still negligible for state-of-the-art CMOS, gate leakage will become significant in the future for sub-100nm technologies, due to the scaling of oxide thickness. We propose several circuit techniques to control gate leakage based on the fact that PMOS transistors with SiO2 gate oxide have an order of magnitude smaller gate leakage than NMOS transistors in the same technology. First, we compare n-type domino with p-type domino circuits in terms of performance, leakage and switching power, and explore the different tradeoffs between performance and power. Second, we compare n-type with p-type gating for MTCMOS to control the leakage during sleep. The proposed circuits are simulated for a predictive 70nm CMOS technology with 10� gate oxide thickness and 1.2V supply voltage.
Categories and Subject Descriptors
B.7.1 [Hardware]: Integrated Circuits � types and design styles.
General Terms
Algorithms, Performance, Design, Reliability.
Keywords
Gate leakage, low power, domino circuits, MTCMOS.

P1.5 Modeling and Analysis of Leakage Power Considering Within-Die Process Variation [p. 64]

A. Srivastava, R. Bai, D. Blaauw, D. Sylvester (University of Michigan)

We describe the impact of process variation on leakage power for a 0.18�m CMOS technology. We show that variability, manifested in L_drawn, T_ox, and N_sub, can drastically affect the leakage current. We first present Monte Carlo-based simulation results for leakage current in various CMOS gates when the process parameters are varied both individually and concurrently. We then derive an analytical model to estimate the mean and standard deviation of the leakage current as a function of the process parameter distributions. We demonstrate that the results of the analytical model match well with Monte-Carlo simulations and also show the statistical mean leakage current is significantly different from the leakage predicted using a nominal case file.

Poster Session 2: System and Software Design

Session Chair: Vamsi Krishna (Agilent Technologies)

P2.1 Low-Power Approach for Decoding Convolutional Codes with Adaptive Viterbi Algorithm Approximations [p. 68]

R. Henning, C. Chakrabarti (Arizona State University)

Significant power reduction can be achieved by exploiting real time variation in system characteristics while decoding convolutional codes. The approach proposed herein adaptively approximates Viterbi decoding by varying truncation length and pruning threshold of the T-algorithm while employing trace-back memory management. Adaptation is performed according to variations in signal-to-noise ratio, code rate, and maximum acceptable bit error rate. Potential energy reduction of 70 to 97.5% compared to Viterbi decoding is demonstrated. Superiority of adaptive T-algorithm decoding compared to fixed T-algorithm decoding is studied. General conclusions about when applications can particularly benefit from this approach are given.
Categories and Subject Descriptors
C.3 [Special-Purpose and Application-Based Systems]: Signal processing systems.
General Terms
Algorithms, Performance, Experimentation.
Keywords
Low Power, Viterbi Algorithm, Adaptive T-algorithm Decoding, Convolutional Codes.

P2.2 Power-Aware Source Routing Protocol for Mobile Ad Hoc Networks [p. 72]

M. Maleki, K. Dantu, M. Pedram (University of Southern California)

Ad hoc wireless networks are power constrained since nodes operate with limited battery energy. To maximize the lifetime of these networks (defined by the condition that a fixed percentage of the nodes in the network "die out" due to lack of energy), network-related transactions through each mobile node must be controlled such that the power dissipation rates of all nodes are nearly the same. Assuming that all nodes start with a finite amount of battery capacity and that the energy dissipation per bit of data and control packet transmission or reception is known, this paper presents a new source-initiated (on-demand) routing protocol for mobile ad hoc networks that increases the network lifetime. Simulation results show that the proposed power-aware source routing protocol has a higher performance than other source initiated routing protocols in terms of the network lifetime.
Categories and Subject Descriptor
C.2.2 [Computer-Systems Organization]:Network Protocols
General Terms
Algorithms

P2.3 Analyzing Energy Friendly Steady State Phases of Dynamic Application Execution in Terms of Sparse Data Structures [p. 76]

E. G. Daylight (IMEC vzw, Katholieke University Leuven), S. Wuytack, C. Ykman-Couvreur (IMEC vzw), F. Catthoor (IMEC vzw, Katholieke University Leuven)

In the past decades, data structure analysis was mainly done at a high level of abstraction in the computer science community. For instance, choosing a linked list as a data structure as opposed to an array for a specific situation, was mainly motivated from a performance point of view under the implicit assumption that the computer platform (that had to run the software) consisted out of one monolithical, physical memory. In the context of mobile, embedded devices, energy consumption is as important as performance. In addition to this, the assumption of one monolithical memory is outdated for many (if not all) current-day platforms! Clearly, there is a need to improve the choices that are made during data structure analysis given specific knowledge of the memory hierarchy of the platform under investigation.
We show how memory related energy consumption can heavily be reduced by taking into account the access behaviour of the application on the one hand and the available on-chip and off-chip memory space on the other hand. We do this by exploiting the sparseness that is present in one steady state of the data structure under investigation. Analytical results show that energy reductions of a factor of 8.7 are feasible in comparison to common data structure implementations. We trade these gains off with on-chip memory space consumption of a custom memory architecture.
Categories and Subject Descriptors
E.2 [Data Storage Representations]: [composite structures, linked representations]; C.4 [Computer Systems Organization]: Performance of Systems�performance attributes
Keywords
Energy consumption, on-chip memory footprint, partitioned data structure

P2.4 Odd/Even Bus Invert with Two-Phase Transfer for Buses with Coupling [p. 80]

Y. Zhang, J. Lach, K. Skadron, M.R. Stan (University of Virginia)

The coupling capacitances between on-chip bus lines become dominant in deep-submicron technologies. Coding to reduce the switching activity of the individual lines was enough to reduce power on buses in older technologies, but new coding techniques that reduce the coupling activity between lines are needed for deep-submicron buses. One such coding technique uses the simple observation that coupling capacitances are always charged and discharged by activity on neighboring bus lines, where one line has an odd number and the other has an even number (if bus lines are numbered �in-order�). We thus propose to reduce the coupling activity by independently controlling the odd and even bus lines with two separate lines, the Odd Invert, and Even Invert line, respectively. We obtain significant reductions in power simply by comparing the coupling activity for the four possible cases of the Odd and Even Invert lines (00, 01, 10, 11), and then choosing the value with the smallest coupling activity to transmit on the bus. Even after encoding, the coupling activity for a pair of bus lines is still strongly dependent on the data. In particular the toggling sequences 01R10 and 10R01 result in 4 times more coupling energy dissipation than other coupling events. We thus propose a targeted Two-Phase transfer in order to reduce total power only on the pairs of lines that carry such toggling events.
Categories and Subject Descriptors
B.7.1 [Hardware]: Integrated Circuits � types and design styles.
General Terms
Algorithms, Performance, Design.
Keywords
Coding for low-power I/O, Bus Invert, buses with coupling.

P2.5 An Intra-Task Dynamic Voltage Scaling Method for SoC Design with Hierarchical FSM and Synchronous Dataflow Model [p. 84]

S. Lee (Seoul National University), S. Yoo (TIMA Lab), K. Choi (Seoul National University)

This paper presents a method of intra-task dynamic voltage scaling (DVS) for SoC design with hierarchical FSM and synchronous dataflow model (in short, HFSM-SDF model). To have an optimal intra-task DVS, exact execution paths need to be determined in compile time or runtime. In general programs, since determining exact execution paths in compile time or runtime is not possible, existing methods assume worst/average-case execution paths and take static voltage scaling approaches. In our work, we exploit a property of HFSM-SDF model to calculate exact execution paths in runtime. With the information of exact execution paths, our DVS method can calculate exact remaining workload. The exact workload enables to calculate optimal voltage level which gives optimal energy consumption while satisfying the given timing constraint. Experiments show the effectiveness of the presented method in low power design of an MPEG4 decoder system.
Categories & Subject Descriptors:
J.6 [Computer-Aided Engineering]: Computer Aided Design (CAD)
General Terms:
Design, Performance
Keywords:
Low power, dynamic voltage scaling, variable supply voltage, formal model, finite state machine, synchronous dataflow

P2.6 Reducing Access Energy of On-Chip Data Memory Considering Active Data Bitwidth [p. 88]

T. Okuma, Y. Cao, M. Muroyama, H. Yasuura (Kyushu University)

This paper presents a new concept called active data bitwidth, which is the effective data length of data bus. By means of profiling the active data bitwidth dynamically, we present a novel low-energy memory access technique for on-chip data memory design. By reducing the redundant access energy of data memory, our experimental results of two real applications, show that we can achieve significant energy reduction. Compared to the monolithic memory, for JPEG, 52.2%; for MPEG-2 84.2%, the energy reduction is reported. Compared to the memory banking technique, 12.3% energy reduction for JPEG and 65.9% for MPEG-2 is reported.
Categories and Subject Descriptors
C.3 [Computer Systems Organization]: Special-Purpose and Application-Based Systems
General Terms
Design

Session 3: Low Power Circuit Techniques

Session Chair: Ram Krishnamurthy (Intel)
Session Organizer: Kaushik Roy (Purdue University)

3.1 Energy Recovering Static Memory [p. 92]

J. Kim, C.H. Ziesler, M.C. Papaefthymiou (University of Michigan)

This paper proposes an energy-recovering (a.k.a. adiabatic) static RAM with a novel driver that reduces power dissipation by efficiently recovering energy from the bit/word line capacitors. Powered by a single-phase sinusoidal power-clock, our SRAM delivers read and write operations with single-cycle latency. To that end, a precharge-low scheme is employed along with a modified sense amplifier design that achieves high efficiency at differential voltages near V_SS. A simple control circuit is used to maintain driver operation in synchrony with the power-clock waveform. Feedback circuitry from the driver output to the control circuit ensures that our driver remains efficient, independent of the access pattern. Our energy recovering SRAM functions correctly while achieving substantial energy savings over a wide range of supply voltages and operating frequencies. Hspice simulations of a simple full custom adiabatic 256x256 SRAM, that includes the energy recovering bit/word line drivers, the cell array, and the sense amplifiers, show over 2.6x energy savings at 3V, 300MHz in comparison with its conventional counterpart.
Categories and Subject Descriptors
B.3.1 [Memory Structures]: Semiconductor Memories� Static memory (SRAM)
General Terms
Design, Performance
Keywords
Adiabatic circuitry, charge recovery, cache memories, on-chip memories, low-energy design, low-power computing.

3.2 Low Power Integrated Scan-Retention Mechanism [p. 98]

V. Zyuban, S.V. Kosonocky (IBM T.J. Watson Research Center)

This paper presents a methodology for unifying the scan mechanism and data retention in latches which leads to scannable latches with the data retention capability achieved at a very low power overhead during the active mode. A detailed analysis of power and area overhead is presented, with layout examples for various common latch styles. Implications of using different power gating techniques for reducing leakage during sleep mode on the design of retention latches are considered, including well biasing for leakage control and sharing wells between gated logic and retention latch devices.
Categories and Subject Descriptors
B.2.1 [Design Styles]: Pipeline; B.6.1 [Design Styles]: Sequential circuits; B.7.1 [Types and Design Styles]: VLSI General Terms
Design
Keywords
data retention, MTCMOS, subthreshold, leakage, low power, latch, scan, balloon latch

3.3 Closed Loop Adaptive Voltage Scaling Controller for Standard-Cell ASICs [p. 103]

S. Dhar, D. Maksimovic (University of Colorado), B. Kranzen (National Semiconductor)

The paper describes a closed-loop controller for adaptive voltage scaling (AVS) where the supply voltage to a standard-cell ASIC is dynamically adjusted to the minimum value required for the desired system speed. The controller includes a clock generator that provides a low-jitter clock to the ASIC at all steady-state operating points and through transients. To speed up the voltage transient response to step changes in clock frequency, the controller is based on a multiple-tap resettable delay line. A chip including the AVS controller and a dual 16-bit MAC application has been fabricated in a standard 0.5 � CMOS process. The area taken by the AVS controller is 0.12 mm2. Experimental results demonstrate operation over the application clock frequency range from 80 kHz to 20 MHz, and a 38 �s transient response for a step change in speed from standby to maximum throughput operation.
Categories and Subject Descriptors
B.7 [Hardware]: Integrated Circuits; B.5 [Hardware]: Register- Transfer-Level Implementation; B.8 [Hardware]: Performance and Reliability
General Terms
Design,Performance,Experimentation
Keywords
circuit design, design methodology, delay-line, low-power, energy efficient, voltage scaling, standard-cell, DC-DC converter

3.4 Design of a Branch-Based 64-bit Carry-Select Adder in 0.18um Partially Depleted SOI CMOS [p. 108]

A. N�ve, D. Flandre (Universite Catholique de Louvain), H. Schettler, T. Ludwig, G. Hellner (IBM Entwicklung GmbH)

The paper presents the design of a 64-bit carry-select adder in Branch-Based Logic, a static design style that minimizes the internal node capacitances. This feature is used to lower the VZ dynamic power dissipation, while maintaining good speed performances. The experimental realization of the adder demonstrates an overall delay of 720 ps while only dissipating 96 mW at 1 GHz. The fabrication is based on the 0.18 �m IBM CMOS8S2 SOI technology, which uses partially depleted transistors and copper metallization.
Categories and Subject Descriptors
B.6.1 [Logic Design]: Design Styles � Combinational Logic.
General Terms
Performance, Design.
Keywords
Circuit Design, SOI technology, Logic design styles.

Session 4: Energy Efficient System Design

Session Chair: N. Ranganathan (University of S. Florida, Tampa, FL)
Session Organizer: Mahmut Kandemir (Penn State University)

4.1 Low-Power Color TFT LCD Display for Hand-Held Embedded Systems [p. 112]

I. Choi, H. Shim, N. Chang (Seoul National University)

An LCD (Liquid Crystal Display) is a standard display device for hand-held embedded systems. Today, color TFT (Thin-Film Transistor) LCDs are common even in cost-effective equipments. An LCD display system is composed of an LCD panel, a frame buffer memory, an LCD and frame buffer controller, and a backlight inverter and lamp. All of them are heavy power consumers, and their portion becomes much more dominant when running interactive applications. This is because interactive applications are often triggered by human inputs and thus result in a lot of slack time in the CPU and memory system, which can be effectively used for dynamic power management. In this paper, we introduce low-power LCD display schemes as a system-level approach. We accurately characterize the energy consumption at the component level and minimize energy consumption of each component without appreciable display quality degradation. We develop several techniques such as variable-duty-ratio refresh, dynamic-color-depth control and backlight luminance dimming with brightness compensation or contrast enhancement. Each method exhibits power reduction of 260mW, 250mW and 480mW, respectively. The aggregate energy reduction ratio is 28% out of total energy consumption including the CPU and the main memory system when we execute a document viewer. We also demonstrate that we can extend the battery life about 38% and 20% for a text editor and an MPEG4 player, respectively.
Categories and Subject Descriptors
C.5 [Computer Systems Organization]: Computer System Implementation; I.3.1 [Computer Graphics]: Hardware Architecture; B.4.2 [Input/Output And Data Communications]: Input/Output Devices�Image Display
General Terms
Design
Keywords
low power, low energy, LCD, embedded system

4.2 Discharge Current Steering for Battery Lifetime Optimization [p. 118]

L. Benini (Universita di Bologna), A. Macii, E. Macii (Politecnico di Torino), M. Poncino (Universita di Verona)

Recent work on battery-driven power management has demonstrated that sequential discharge is suboptimal in multibattery systems, and lifetime can be maximized by distributing (steering) the current load on the available batteries, thereby discharging them in a partially concurrent fashion. Based on these observations, we formulate multi-battery lifetime maximization as a continuous, constrained optimization problem, which can be efficiently solved by non-linear optimizers. We show that great lifetime extensions can be obtained with respect to standard sequential discharge, as well to previously proposed battery allocation schemes.
Categories and Subject Descriptors
J.6 [Computer Applications]: Computer-Aided Engineering; C.4 [Computer Systems Organization]: Performance of Systems; G.1 [Numerical Analysis]: Optimization
General Terms
Design, Performance
Keywords
Energy consumption, battery lifetime optimization

4.3 Towards Energy-Aware Software-Based Fault Tolerance in Real-Time Systems [p. 124]

O.S. Unsal, I. Koren, C.M. Krishna (University of Massachusetts, Amherst)

Many real-time systems employed in defense, space, and consumer applications have power constraints and high reliability requirements. In this paper, we focus on the relationship between fault tolerance techniques and energy consumption. In particular, we establish the energy efficiency of Application Level Fault Tolerance (ALFT) over other software-based fault tolerance methods. We then develop sensible energy-aware heuristics for ALFT schemes. The heuristics yield up to 40% energy savings.

Session 5: Memory Subsystem

Session Chair: Lawrence Clark (Intel)
Session Organizer: Peter Kogge (University of Notre Dame)

5.1 Fine-Grain CAM-Tag Cache Resizing Using Miss Tags [p. 130]

M. Zhang, K. Asanovic (MIT Laboratory for Computer Science)

A new dynamic cache resizing scheme for low-power CAM-tag caches is introduced. A control algorithm that is only activated on cache misses uses a duplicate set of tags, the miss tags, to minimize active cache size while sustaining close to the same hit rate as a full size cache. The cache partitioning mechanism saves both switching and leakage energy in unused partitions with little impact on cycle time. Simulation results show that the scheme saves 28{56% of data cache energy and 34{49% of instruction cache energy with minimal performance impact.
Categories and Subject Descriptors
B.3.2 [Memory Structures]: Design Styles|Associative Memory, Cache Memory, Primary Memory
General Terms
Design
Keywords
Content-Addressable-Memory, Low-Power, Cache Resizing, Energy Efficiency, Leakage Current

5.2 An Adaptive Serial-Parallel CAM Architecture for Low-Power Cache Blocks [p. 136]

A. Efthymiou, J.D. Garside (University of Manchester)

There is an on-going debate about which consumes less energy: a RAM-tagged associative cache with an intelligent order of accessing its tags and ways (e.g. way prediction), or a CAM-tagged high associativity cache. If a CAM search can consume less than twice the energy of reading a tag RAM, it would probably be the preferred option for low-power applications. Based on memory traces | which usually cause tag mismatch within the lower four bits |a new serial CAM organisation is proposed which consumes just 45% more than a single tag RAM read and is only 25% slower than the conventional , parallel CAM. Furthermore, it can optionally be operated as a parallel CAM, at no speed penalty, and still reduce energy consumption.
Categories and Subject Descriptors
B.3.2 [Memory Structures]: Design Styles|Associative memories; B.3.2 [Memory Structures]: Design Styles| Cache memories; B.7.1 [Types and Design Styles]: VLSI
General Terms
Design, Performance
Keywords
CAM, cache design, VLSI, low power, low energy, asynchronous circuits

5.3 Reducing Energy Consumption of Video Memory by Bit-Width Compression [p. 142]

V.G. Moshnyaga, K. Inoue, M. Fukagawa (Fukuoka University),

A new architectural technique to reduce energy dissipation of video memory is proposed. Unlike existing approaches, the technique exploits the pixel correlation in video sequences, dynamically adjusting the memory bit-width to the number of bits changed per pixel. Instead of treating the data bits independently, we group the most significant bits together, activating the corresponding group of bit-lines adaptively to data variation. The method is not restricted to the specific bit-patterns nor depends on the storage phase. It works equally well on read and write accesses, as well as during precharging. Simulation results show that using this method we can reduce the total energy consumption of video memory by 20% without affecting the picture quality.
Categories and Subject Descriptors B.3 [Hardware]: Memory Structures; B.5.1 [Hardware]: Design�memory design
General Terms
Design
Keywords
bitwidth-compression, frame memory, low-power design

5.4 A History-Based I-Cache for Low-Energy Multimedia Applications [p. 148]

K. Inoue, V.G. Moshnyaga (Fukuoka University), K. Murakami (Kyushu University)

This paper proposes a history-based tag-comparison scheme for reducing energy consumption of direct-mapped instruction caches. The proposed cache efficiently exploits programexecution footprints recorded in the Branch Target Buffer (BTB), and attempts to detect and eliminate unnecessary tag checks at run time. Simulation results show that our approach can eliminate up to 95% of tag checks, saving the cache energy by 17%, while affecting the processor performance by only 0.2%.
Categories and Subject Descriptors
B.3 [Hardware]: Memory Structures; C.1 [Computer Systems Organization]: Processor Architectures
General Terms
Design

Session 6: Modeling and Design Issues

Session Chair: Anand Raghunathan (NEC)
Session Organizer: Joerg Henkel (NEC)

6.1 Battery Lifetime Prediction for Energy-Aware Computing [p. 154]

D. Rakhmatov, S. Vrudhula (University of Arizona), D.A. Wallach (Hewlett-Packard Western Research Laboratory)

Predicting the time of full discharge of a finite-capacity energy source, such as a battery, is important for the design of portable electronic systems and applications. In this paper we present a novel analytical model of a battery that not only can be used to predict battery lifetime, but also can serve as a cost function for optimization of the energy usage in battery-powered systems. The model is physically justified, and involves only two parameters, which are easily estimated. The paper includes the results of extensive experimental evaluation of the model with respect to numerical simulations of the electrochemical cell, as well as measurements taken on a real battery. The model was tested using constant, interrupted, periodic and non-periodic discharge profiles, which were derived from standard applications run on a pocket computer.
Categories and Subject Descriptors
C.4.5 [Performance of Systems]: Performance Attributes
General Terms
Performance, Experimentation
Keywords
Battery, modeling, low-power design

6.2 Early Evaluation Techniques for Low Power Binding [p. 160]

E. Kursun, A. Srivastava, S.O. Memik, M. Sarrafzadeh (University of California Los Angeles)

This paper presents effective metrics to evaluate the power dissipation of scheduled data flow graphs (DFGs). This enables early evaluation of schedules without performing the computationally expensive resource-binding step. Our metrics correlate heavily (as high as 0.95 and > 0.75 for most test cases) with power dissipation values obtained after resource binding and rescheduling for power optimization steps. An experimental flow that integrates path-based scheduling, power optimal binding and power driven iterative rescheduling stages is constructed. The flow integrates commercial tools like Synopsys, VSS and academic compilers like SUIF in a common optimization framework. Experimental results on DFGs from MediaBench suit also demonstrate the fact that metric evaluation is on average 42.6 times faster than performing optimal binding and iterative power improvement. Hence metric based evaluation enables fast design exploration at early stages.
Categories & Subject Descriptors:
[Design] High Level Synthesis, Power Optimization, Scheduling, Resource Binding. General Terms:
Design
Keywords:
Low Power Design, Scheduling, Resource Binding, Metric Evaluation.

6.3 Unified Methodology for Resolving Power-Performance Tradeoffs at the Microarchitectural and Circuit Levels [p. 166]

V. Zyuban, P. Strenski (IBM T.J. Watson Research Center)

Evaluation of architectural tradeoffs is complicated by implications in the circuit domain which are typically not captured in the analysis but substantially affect the results. We propose a metric of hardware intensity (h), which is useful for evaluating issues that affect both circuits and architecture. Analyzing data for actual designs we show how to measure the introduced parameters and discuss variations between observed results and common theoretical assumptions. For a power-efficient design we derive relations for h and supply voltage V under progressively more general situations, and incorporate h into a prior art architectural energy-efficiency criterion. Then, a more general relation is derived for the optimal balance between the architectural complexity, hardware intensity and power supply. Modified forms for these relations are obtained in special cases where the supply voltage is constrained or when clock gating is disallowed.
Categories and Subject Descriptors
B.2.4 [High-Speed Arithmetic]: Cost/performance; B.2.1 [Design Styles]: Pipeline; B.6.1 [Design Styles]: Combinational logic, Parallel circuits; B.6.3 [Design Aids]: Optimization; B.7.1 [Types and Design Styles]: Microprocessors and microcomputers,VLSI; C.5.3 [Microcomputers]: Microprocessors; C.0 [General]: Modeling of computer architecture
General Terms
Design, Performance
Keywords
Energy, power, energy efficiency, hardware intensity, metric

Invited Talk

Session Chair: Ingrid Verbauwhede (UCLA)

Is Nanoelectronics the Future of Microelectronics? [p. 172]

M. Lundstrom (Purdue University)

We examine current research in nanoelectronics and discuss the role it may play in future electronic systems. Categories and Subject Descriptors B.7.1 [Integrated Circuits]: Types and Design Styles � advanced technologies, memory technologies, VLSI
General Terms
Design, Performance, Theory
Keywords
nanoelectronics, Moore's Law, molecular electronics

Session 7: Microarchitecture Techniques

Session Chair: David Brooks (IBM T.J. Watson)
Session Organizer: Lea Hwang Lee (Motorola)

7.1 Saving Energy with Just In Time Instruction Delivery [p. 178]

T. Karkhanis, J.E. Smith (University of Wisconsin-Madison), P. Bose (IBM T.J. Watson Research Center)

Just-In-Time instruction delivery is a general method for saving energy in a microprocessor by dynamically limiting the number of in-flight instructions. The goal is to save energy by 1) fetching valid instructions no sooner than necessary, avoiding cycles stalled in the pipeline -- especially the issue queue, and 2) reducing the number of fetches and subsequent processing of mis-speculated instructions. A simple algorithm monitors performance and adjusts the maximum number of in-flight instructions at fairly long intervals, 100K instructions in this study. The proposed JIT instruction delivery scheme provides the combined benefits of more targeted schemes proposed previously. With only a 3% performance degradation, energy savings in the fetch, decode pipe, and issue queue are 10%, 12%, and 40%, respectively.
Categories and Subject Descriptors
C.1.3 [Processor Architectures]: Other Architecture Styles � adaptable architectures, pipeline processors.
General Terms
Performance, Design
Keywords
Low-power, adaptive processor, instruction delivery

7.2 Tradeoffs in Power-Efficient Issue Queue Design [p. 184]

A. Buyuktosunoglu, D. H. Albonesi (University of Rochester), P. Bose, P.W. Cook, S E. Schuster (IBM T.J. Watson Research Center)

A major consumer of microprocessor power is the issue queue. Several microprocessors, including the Alpha 21264 and POWER4^TM, use a compacting latch-based issue queue design which has the advantage of simplicity of design and verification. The disadvantage of this structure, however, is its high power dissipation. In this paper, we explore different issue queue power optimization techniques that vary not only in their performance and power characteristics, but in how much they deviate from the baseline implementation. By developing and comparing techniques that build incrementally on the baseline design, as well as those that achieve higher power savings through a more significant redesign effort, we quantify the extra benefit the higher design cost techniques provide over their more straightforward counterparts.
Categories and Subject Descriptors
C [1]: Processor Architectures, C.1.3 Other Architecture Styles- Adaptable architectures
General Terms
Performance, Design
Keywords
Low-power, microarchitecture, issue queue, banking, adaptation, compacting, non-compacting

7.3 Reducing Transitions on Memory Buses using Sector-based Encoding Technique [p. 190]

Y. Aghaghiri (University of Southern California), F. Fallah (Fujitsu Laboratories of America), M. Pedram (University of Southern California)

In this paper, we introduce a class of irredundant low power encoding techniques for memory address buses. The basic idea is to partition the memory space into a number of sectors. These sectors can, for example, represent address spaces for the code, heap, and stack segments of one or more application programs. Each address is first dynamically mapped to the appropriate sector and then is encoded with respect to the sector head. Each sector head is updated based on the last accessed address in that sector. The result of this sector-based encoding technique is a reduction in the number of bus transitions when encoding consecutive addresses that access different sectors. Our proposed techniques have small power and delay overhead when compared with many of the existing methods in the literature. One of our proposed techniques is very suitable for encoding addresses that are sent from an on-chip cache to the main memory when multiple application programs are executing on the processor in a time-sharing basis. For a computer system without an on-chip cache, the proposed techniques decrease the switching activity of data address and multiplexed address buses by an average of 55% and 67%, respectively. For a system with on-chip cache, up to 55% transition reduction is achieved on a multiplexed address bus between the internal cache and the external memory. Assuming a 10pF per line bus capacitance, we show that power reduction of up to 52% for an external data address bus and 42% for the multiplexed bus between cache and main memory is achieved using our methods.
Categories and Subject Descriptors:
B.4.3. [Input/output and data communications]: Interconnections, Interfaces. General Terms:
Algorithms and Design.

7.4 Energy-Efficient Hybrid Wakeup Logic [p. 196]

M. Huang, J. Renau, J. Torrellas (University of Illinois at Urbana-Champaign)

The instruction window is a critical component and a major energy consumer in out-of-order superscalar processors. An important source of energy consumption in the instruction window is the instruction wakeup: a completing instruction broadcasts its result register tag and an associative comparison is performed with all the entries in the window. This paper shows that a very large fraction of the completing instructions have to wake up no more than a single instruction currently in the window. Consequently, we propose to save energy by using indexing to only enable the comparator at the single instruction to wake up. Only in the rare case when more than one instruction needs to wake up, our scheme reverts to enabling all the comparators or a subset of them. For this reason, we call our scheme Hybrid. Overall, our scheme is very effective: for a processor with a 96-entry window, the number of comparisons performed by the average completing instruction with a destination register is reduced to 0.8. The exact magnitude of the energy savings will depend on the specific instruction window implementation. Furthermore, the application suffers no performance penalty.
Categories & Subject Descriptors:
C.0 Computer System Organization: System Architectures. C.1.1 Single Data Stream Architectures: RISC/CISC,VLIW Architectures C.5.3 Microcomputers: Microprocessors.
General Terms:
Design, Experimentation, Performance
Keywords:
Low Power, Wakeup Logic, Issue Logic

Session 8: Technology-Driven Power Optimization

Session Chair: Unni Narayanan (Intel)
Session Organizer: G. Stamoulis (Technical University of Crete)

8.1 Automated Selective Multi-Threshold Design for Ultra-Low Standby Applications [p. 202]

K. Usami, N. Kawabe, M. Koizumi, K. Seta (Toshiba Corporation Semiconductor Company), T. Furusawa (Toshiba Microelectronics Corporation)

This paper describes an automated design technique to selectively use multi-threshold CMOS (MTCMOS) in a cell-by-cell fashion. MT cells consisting of low-Vth transistors and high-Vth sleep transistors are assigned to critical paths, while high-Vth cells are assigned to non-critical paths. Compared to the conventional MTCMOS, the gate delay is not affected by the discharge patterns of other gates because there is no virtual ground to be shared. We applied this technique to a test chip of a DSP core. The worst path-delay was improved by 14% over the single high-Vth design without increasing standby leakage at 10% area overhead.
Categories and Subject Descriptors
B.7.1 [Integrated Circuits]: Types and Design Styles � VLSI, DSP.
General Terms
Performance, Design, Experimentation.
Keywords
Automated design, Multi-Threshold, standby leakage current.

8.2 HA²TSD: Hierarchical Time Slack Distribution for Ultra-Low Power CMOS VLSI [p. 207]

K.-w. Choi, A. Chatterjee (Georgia Institute Technology)

This paper describes an efficient hierarchical design and optimization approach for ultra-low power CMOS logic circuits. We introduce the Hierarchical Activity-Aware Time Slack Distribution (HA2TSD) algorithm, which distributes the surplus time slack into the most power-hungry modules hierarchically. HA2TSD ensures that the total slack budget is maximal and the total power is near-minimal. Based on these time slacks, we have optimized technology parameters (supply voltage, threshold voltage, and device width) through a gate level power optimizer and have tested the algorithm on a set of benchmark example circuits and building blocks of a synthesizable ARM core. The experimental results show that our strategy delivers over an order of magnitude savings in total (static and dynamic) power and reduces the optimization run-time significantly.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids-simulation.
General Terms
Algorithms.
Keywords
Low-power design, time slack distribution, and gate-level power optimization.

8.3 Runtime Mechanisms for Leakage Current Reduction in CMOS VLSI Circuits [p. 213]

A. Abdollahi (University of Southern California), F. Fallah (Fujitsu Laboratories of America), M. Pedram (University of Southern California)

This paper describes two runtime mechanisms for reducing the leakage current of a CMOS circuit. In both cases, it assumed that the system or environment produces a "sleep" signal that can be used to indicate that the circuit is in a standby mode. the first method, the "sleep" signal is used to shift in a new set external inputs and pre-selected internal signals into the circuit with the goal of setting the logic values of all of the internal signals so as to minimize the total leakage current in the circuit. This minimization is possible because the leakage current of a CMOS gate is a strong function of the input combination applied to inputs. In the second method, NMOS and PMOS transistors are added to some of the gates in the circuit to increase the controllability of the internal signals of the circuit and decrease the leakage current of the gates using the "stack effect". This however, done carefully so that the minimum leakage is achieved subject to a delay constraint for all input-output paths in the circuit. In both cases, Boolean satisfiability is used to formulate the problems, which are subsequently solved by employing a highly efficient SAT solver. Experimental results on the circuits in the MCNC91 benchmark suite demonstrate that it is possible to reduce the leakage current by up to 70% in VLSI circuits at the expense a very small overhead.
Categories and Subject Descriptors:
B.7.1. [Integrated Circuits]: Types and Design Styles, VLSI
General Terms:
Algorithms and Design

Embedded Tutorial 1

Session Chair: Christian Piguet (CSEM & EPFL, Switzerland)

Future Directions in Clocking Multi-Ghz Systems [p. 219]

V.G. Oklobdzija (University of California), J. Sparso (Technical University of Denmark)

This tutorial addresses the problems and possible solutions of clocking digital systems operating at multi-GHz frequencies. The first part of the tutorial will address techniques for managing clock uncertainties and clock power in synchronous circuits. There are two trends that are disturbing: (a) the power taken by the clock distribution network and clocked storage elements (flip-flops and latches) is increasing relatively to the rest of the logic, (b) clock uncertainties are taking a significant portion of the cycle away from useful logic operations. There are no radical solutions in sight. We present the ways of designing clock storage elements that are capable of absorbing significant portion of clock uncertainties and passing delay from one logic stage to the other. At multi-GHz frequencies of operation it will be difficult to precisely control the timing boundaries between the logic stages. Thus the ability to extend the operation into the time period allocated for the next pipeline stage is important. This is known as time borrowing. Also, the ability to incorporate logic into the clocked storage elements is of critical importance given that the number of logic stages in a pipeline running at multi-GHz frequencies, is decreasing to less than ten.

Embedded Tutorial 2

Session Chair: Mary Jane Irwin (Penn State University)

Compilers for Power and Energy Management [p. 220]

U. Kremer (Rutgers University)

Optimizing compilers perform program analyses and transformations at different levels of program abstraction, ranging from source code, intermediate code such as three address code, to assembly and machine code. Analyses and transformations can have different scopes. They can be performed within a single basic block (local), across basic blocks but within a procedure (global), or across procedure boundaries (interprocedural). Traditionally, optimizing compilers try to reduce overall program execution time or resource usage such as memory. The compilation process itself can be done before program execution (static compilation), or during program execution (dynamic compilation). This large design space is the main challenge for compiler writers. Many tradeoffs have to be considered in order to justify the development and implementation of a particular optimization pass or strategy. However, every compiler optimization needs to address the following three issues:
1. opportunity: When can the optimization be applied?
2. safety: Does the optimization preserve program semantics?
3. profitability: When applied, how much performance improvement can be expected?

Session 9: Analog Electronics

Session Chair: Paul Hurst (UC Davis)
Session Organizer: Satyen Mukherjee (Philips)

9.1 Oversampled Gain-Boosting [p. 221]

O. Oliaei (Motorola Labs)

A dynamic gain-enhancement technique suitable for low voltage low-power oversampling circuits, particularly sigma-delta converters, is presented. This method makes use of a discrete-time integrator to improve gradually the output resistance of the main amplifier over successive clocks.
Categories and Subject Descriptors
B.7.1 [Hardware]: Integrated Circuits
General Terms
Circuit Design
Keywords
Switched-Capacitor, MOS amplifier, bootstrapping, ADC, DAC, sigma-delta, gain boosting, gain enhancement, OTA.

9.2 ±0.5V ~ ±1.5V UHF CMOS LV/LP Four-Quadrant Analog Multiplier in Modified Bridged-Triode Scheme [p. 227]

S.C. Li, J.C. Cha (National Yunlin Univ. of Science and Technology)

A new LV/LP CMOS four-quadrant analog multiplier designed in a modified bridged-triode scheme (MBTS) is presented. It brings in the benefits in terms of linearity, power consumption, frequency response and total harmonic distortion (THD). The fabricated chip in TSMC 0.35�m n-well SPQM CMOS technology has a nonlinearity error less than 0.8% over �0.5V input range under a nominal supply voltage of �1.5V, and consumes the total power dissipation of 2.7 mW only.
Categories & Subject Descriptors
B.7.1 [Integrated Circuits]: Types and Design Styles � Algorithms implemented in hardware, Input/output circuits.
General Terms
Design, Performance, Measurement.
Keywords
Analog multiplier, Modified Bridged-Triode Scheme (MBTS).

9.3 A Power and Resolution Adaptive Flash Analog-to-Digital Converter [p. 233]

J. Yoo, D. Lee, K. Choi, J. Kim (Pennsylvania State University)

A new power and resolution adaptive flash ADC, named PRA-ADC, is proposed. The PRA-ADC enables exponential power reduction with linear resolution reduction. Unused parallel voltage comparators are switched to standby mode. The voltage comparators consume only the leakage power during the standby mode. The PRA-ADC, capable of operating at 5-bit, 6-bit, 7-bit, and 8-bit precision, dissipates 69 mW at 5-bit and 435 mW at 8-bit. The PRA-ADC was designed and simulated with 0.18 um CMOS technology. The PRA-ADC design is applicable to RF portable communication devices, allowing tighter management of power and efficiency.
Categories and Subject Descriptors
B.7.1 [Integrated Circuits]: Types and Design Styles| VLSI
General Terms
Design
Keywords
Analog-to-Digital Converter, Flash ADC, Threshold Inverter Quantization, TIQ Comparator, Adaptive

9.4 Design Techniques for Low Power High Bandwith Upconversion in CMOS [p. 237]

C. De Ranter, M. Steyaert (Katholieke Universiteit Leuven)

An upconvertor topology for low power, high bandwidth applications is presented. Using specific circuit techniques and local circuit-level optimization, the power consumption of the total system comprising an on-chip LC-type VCO, a polyphase network quadrature generator, a linear mixer block and an RF-current buffer, has been minimized. A chip has been designed and manufactured in a 0.25�m CMOS technology. The VCO oscillates between 1.68 GHz and 2 GHz. Driven by an external LO, the transmitter operates from 900 MHz up to 2 GHz. At 2 GHz, the upconvertor transmits -12 dBm into 50 . with a linearity of more than -35 dBc for base band signals up to 33 MHz.
Categories and Subject Descriptors
B.7.m [Integrated Circuits]: Miscellaneous�Analog RF CMOS Design
General Terms
Design
Keywords
Low power, Analog, Upconversion, Oscillators, RF Design, CMOS

Session 10: Design Contest Presentation

Session Chair: Vivek Tiwari (Intel)

Poster Session 3: Logic and Microarchitecture

Session Chair: Vojin G. Oklobdzija (UC Davis)

P3.1 TLB and Snoop Energy-Reduction using Virtual Caches in Low- Power Chip Multiprocessors [p. 243]

M. Ekman (Chalmers University of Technology), F. Dahlgren (Ericsson Mobile Platforms), P. Stenstr�m (Chalmers University of Technology)

In our quest to bring down the power consumption in low-power chip-multiprocessors, we have found that TLB and snoop accesses account for about 40% of the energy wasted by all L1 data-cache accesses. We have investigated the prospects of using virtual caches to bring down the number of TLB accesses. A key observation is that while the energy wasted in the TLBs are cut, the energy associated with snoop accesses becomes higher. We then contribute with two techniques to reduce the number of snoop accesses and their energy cost. Virtual caches together with the proposed techniques are shown to reduce the energy wasted in the L1 caches and the TLBs by about 30%.
Categories and Subject Descriptors
C.5.3 Microcomputers---Microprocessors
General Terms
Performance, Design
Keywords
low-power, CMP, snoop, virtual caches

P3.2 A Preactivating Mechanism for a VT-CMOS Cache using Address Prediction [p. 247]

R. Fujioka, K. Katayama, R. Kobayashi, H. Ando, T. Shimada (Nagoya University)

It has become an important requirement to achieve high performance and low-power consumption at the same time. The dynamic leakage cut-off (DLC) scheme, which controls transistors� threshold voltage by the line on demand, is a technique that potentially satisfies that requirement for a cache. Yet, conventional DLC causes access time to significantly lengthen, and consequently processor performance is unacceptably degraded. This paper proposes a mechanism that suppresses the performance degradation by preactivating cache lines using address prediction before access requests. Our evaluation results show significant performance improvements are achieved with little increase of power consumption.
Keywords
leakage current, L1 data cache, address prediction

P3.3 Dynamic Vt SRAM: A Leakage Tolerant Cache Memory for Low Voltage Microprocessors [p. 251]

C.H. Kim, K. Roy (Purdue University)

This paper presents a Dynamic Vt SRAM (DTSRAM) architecture to reduce the subthreshold leakage in cache memories. The Vt of each cache line is controlled separately by means of body biasing. In order to minimize the energy and delay overhead, a cache line is switched to high Vt only when it is not likely to be accessed anymore. Simulation results from SimpleScalar framework show that even after considering the energy overhead, the DTSRAMcan save 72% of the cache leakage with a performance loss less than 1%. Layout of the DTSRAM shows that the area penalty is minimal.

P3.4 Asymmetric-Frequency Clustering: A Power-Aware Back-End for High-Performance Processors [p. 255]

A. Baniasadi (Northwestern University), A. Moshovos (University of Toronto)

We introduce asymmetric frequency clustering (AFC), a micro-architectural technique that reduces the dynamic power dissipated by a processor's back-end while maintaining high performance. We present a dual-cluster, dual-frequency machine comprising a performance oriented cluster and a power-aware one. The power-aware cluster operates at half the frequency of the performance oriented cluster and uses a lower voltage supply. We show that this organization significantly reduces back-end power dissipation by executing non-performance-critical instructions in the power-aware cluster. AFC localizes the two frequency/voltage domains. Consequently, it mitigates many of the complexities associated with maintaining multiple supply voltage and frequency domains on the same chip. Key to the success of this technique are methods that assign as many instructions as possible to the slower/ lower power cluster without impacting overall performance. We evaluate our techniques using a subset of SPEC2000 and SPEC95. AFC provides a 16% back-end power reduction with 1.5% performance loss compared to a conventional, dual-clustered processor where each cluster has schedulers of the same width and length.
Categories and Subject Descriptors
C.1.1 [Single Data Stream Architectures] Pipeline processors.
General Terms
Design
Keywords
Power-Aware Architectures, Processor Back-End, Instruction Criticality, Assymetric Frequency Clustering, High-Performance Processors.

Poster Session 4: Analysis, Estimation and Optimization

Session Chair: Ed Cheng (Synopsys)

P4.1 Power Analysis Techniques for SoC with Improved Wiring Models [p. 259]

T. Sakamoto, T. Yamada, M. Mukuno, Y. Matsushita, Y. Harada (Sanyo Electric), H. Yasuura (Kyushu University)

This paper proposes two techniques for improving the accuracy of gate-level power analysis for system-on-a-chip (SoC). (1) The creation of custom wire load models for clock nets (2) The use of layout information (actual net capacitance and input signal transition time) The analysis time is reduced to less than one three-hundredth of the transistor-level power analysis time. The error is within 5% of that of a real chip, (the same level in transistor-level power analysis) if technique (2) is used. The analytical error between technique (1) and (2) is within 1%.
Categories and Subject Descriptors
B.7.2 [Integrated Circuits]: Design Aids � Simulation, Verification, Placement and routing, Layout.
General Terms
Verification, Experimentation, Design
Keywords
SoC, power analysis, gate-level, custom wire load model

P4.2 A Microarchitectural-Level Step-Power Analysis Tool [p. 263]

W. El-Essawy, D.H. Albonesi (University of Rochester), B. Sinharoy (IBM Corporation)

Clock gating is an effective means for reducing average power consumption. However, clock gating can exacerbate maximum cycle-to-cycle current swings, or the step-power (Ldi/dt) problem. We present a microarchitecture-level step-power simulator and demonstrate its use in exploring how design alternatives impact relative step-power levels. We show how the tool can be used to identify major sources of high microprocessor step-power events. Our experiments indicate that branch mispredictions are a major cause of high step-power occurrences. We also show that high step-power events are infrequent which suggest that architectural techniques may limit step-power at potentially low performance cost.
Categories and Subject Descriptors
C.5.3 [Computer Systems Organization]: COMPUTER SYSTEM IMPLEMENTATION Microcomputers; I.6.5 [Computing Methodology]: SIMULATION AND MODELING Model Development
General Terms
Reliability Design
Keywords
step-power, Ldi/dt, inductive noise, microprocessors, clock-gating, architectural simulation

P4.3 Power Estimation of Sequential Circuits using Hierarchical Colored Hardware Petri Net Modeling [p. 267]

A K. Murugavel, N. Ranganathan (University of South Florida)

A Hierarchical Colored Hardware Petri net (HCHPN) based model was proposed in [8] for estimating switching activity in combinational circuits. In this paper, we model sequential circuits as HCHPNs incorporating real delays for both gates and interconnects. Thus, the given sequential circuit is first modeled as a HCHPN and simulated for switching activity estimation in the petri net domain which leads to better accuracy and faster simulation. Experimental results for ISCAS'89 benchmark circuits show that the proposed HCHPN model yields accuracy on an average within 4.4% of that of PowerMill. The per-pattern simulation time for HCHPNs is about 2.4 times lesser than that of PowerMill.
Categories and Subject Descriptors
B.7 [Hardware]: Integrated Circuits; B.7.2 [Integrated Circuits]: Design Aids�simulation

P4.4 High-Level Area Estimation [p. 271]

K.M. Büyüksahin (University of Illinois at Urbana-Champaign), F. N. Najm (University of Toronto)

Early power estimation requires one to estimate the area (gate count) of a design from a high-level description. We propose a method to do this that makes use of the concept of Boolean networks (BN) and introduces an invariant area complexity measure which captures the gate-count requirement of a design. The method can be adapted to be used at different points on the area/delay tradeoff curve, with different synthesizer/mapper tools, and different target gate libraries. The area model is experimentally verified and tested using a number of ISCAS and MCNC benchmark circuits and two different target cell libraries, on two different synthesis systems.
Categories and Subject Descriptors
B.5.2 [RTL Implementation]: Design aids
General Terms
Design
Keywords
Area estimation, Boolean networks

P4.5 Retiming-Based Logic Synthesis for Low-Power [p. 275]

Y.-L. Hsu, S.-J. Wang (National Chung-Hsing University)

Power management has become a great concern in VLSI design in recent years. In this paper, we consider the logic level design technique for low power applications. We present a retiming based optimization method, in which part of the circuit is selected and moved so that it produces logic signals one clock cycle before they are actually applied. If these values can solely determine the output logic level, then the other part of the circuit can be turned off to save power. We explore acceptable retimed circuit structures, in which circuit function is not changed. An algorithm is proposed to select the optimal logic block to be retimed. We experiment the low-power circuit structure with some MCNC benchmark circuits, and results indicate an improvement over previous methods. Our method achieves a significant reduction in switching activity, and the reduction can be more than 70% in some case. The required area overhead is very small.
Categories and Subject Descriptors
B.6.3 [Logic Design]: Design Aids � automatic synthesis, optimization, switching theory.
General Terms
Algorithms, Design.
Keywords
Low-power, retiming, logic design, switching activity.

P4.6 Activity-Sensitive Clock Tree Construction for Low Power [p. 279]

C. Chen, C. Kang (University of Windsor), M. Sarrafzadeh (University of California at Los Angeles)

This paper presents an activity-sensitive clock tree construction technique for low power design of VLSI clock networks. We introduce the term of node difference based on module activity information, and show its relationship with the power consumption. A binary clock tree is built using the node difference between different modules to optimize the power consumption due to the interconnections (i.e., clock gating signals and clock edges). We also develop a method to determine gating signals with minimum number of transitions. After the clock tree is constructed, the gating signals are optimized for further power savings.
Categories and Subject Descriptors
B.7.1 [Integrated Circuits]: Types and Design Styles � VLSI.
General Terms
Algorithms.
Keywords
Clock tree, low power, clock gating, activity pattern.

Invited Talk

Session Chair: Vivek De (Intel)

Session 11: Signal Processing

Session Chair: Wanda Gass (Texas Instruments)
Session Organizer: Sanjive Agarwala (Texas Instruments)

11.1 Low-Power VLSI Decoder Architectures for LDPC Codes [p. 284]

M.M. Mansour, N.R. Shanbhag (University of Illinois at Urbana-Champaign)

Iterative decoding of low-density parity check codes (LDPC) using the message-passing algorithm have proved to be extraordinarily effective compared to conventional maximum-likelihood decoding. However, the lack of any structural regularity in these essentially random codes is a major challenge for building a practical low-power LDPC decoder. In this paper, we jointly design the code and the decoder to induce the structural regularity needed for a reduced-complexity parallel decoder architecture. This interconnect-driven code design approach eliminates the need for a complex interconnection network while still retaining the algorithmic performance promised by random codes. Moreover, we propose a new approach for computing reliability metrics based on the BCJR algorithm that reduces the message switching activity in the decoder compared to existing approaches. Simulations show that the proposed approach results in power savings of up to 85.64% over conventional implementations.
Categories and Subject Descriptors
B.7.1 [Types and Design Styles]: VLSI; E.4 [Coding and Information Theory]: Error control codes
General Terms
Design
Keywords
LDPC codes, lower power architectures, BCJR algorithm.

11.2 A Low Power Normalized-LMS Decision Feedback Equalizer for a Wireless Packet Modem [p. 290]

D. Garrett, C. Nicol (Lucent Technologies), A. Blanksby, C. Howland (Agere Systems)

This paper presents a decision feedback equalizer (DFE) for a high-speed packet modem utilizing the normalized least mean squared (NLMS) tap update algorithm. The equalizer supports up to 43.2 Mbps uncoded data over a wireless channel with a 10% training preamble (48 Mbps with no training). In this work the rapid convergence of the NLMS algorithm is combined a technique for early termination of the tap training process to yield a low power DFE implementation. The low power techniques result in a 43% power reduction over a baseline design. Furthermore, low power synthesis techniques result in an additional 30% power savings on top of the algorithmic power savings.
Categories and Subject Descriptors
B.2.4 [Arithmetic and Logic Structures]: High-speed arithmetic - algorithms, cost/performance.
General Terms
Algorithms, Performance, Design.
Keywords
Low power, NLMS, equalization, early termination.

11.3 High Performance and Low Power FIR Filter Design based on Sharing Multiplication [p. 295]

J. Park, W. Jeong, H. Choo, H. Mahmoodi-Meim, Y. Wang, K. Roy (Purdue University)

We present a high performance and low power FIR filter design, which is based on computation sharing multiplier (CSHM). CSHM specifically targets computation re-use in vector-scalar products and is effectively used in our FIR filter design. Efficient circuit level techniques: a new carry select adder and conditional capture flip-flop (CCFF), are also used to further improve power and performance. The proposed FIR filter architecture was implemented in 0.25 m technology. Experimental results on a 10 tap low pass CSHM FIR filter show speed and power improvement of 19% and 17%, respectively, with respect to an FIR filter based on Wallace tree multiplier.
Keywords
Computation sharing, FIR filter design, high performance and low power carry select adder, conditional capture flip-flop

11.4 A Low-Power Digital Matched Filter for Spread-Spectrum Systems [p. 301]

S. Goto, T. Yamada, N. Takayama, Y. Matsushita, Y. Harada (Sanyo Electric, Co., Ltd.), H. Yasuura (Kyushu University)

A Digital Matched Filter (DMF) is an essential device for Direct-Sequence Spread-Spectrum (DS-SS) communication systems. Reducing the power consumption of a DMF is especially critical for battery-powered terminals. The reception registers and the correlation-calculating unit dissipate the majority of the power in a DMF. In this paper we discuss this problem and propose a low power architectural approach to a DMF. The total switching activity factor and the switched capacitance are reduced. As a result of power analysis at the gate level, the implementation of the proposed architecture in a standard 0.18-�m CMOS technology achieved a reduction in the power consumption of more than 70 %.
Categories and Subject Descriptors
B.5.1 [Register-Transfer-Level Implementation]: Design - arithmetic and logic units, control design, styles.
General Terms
Algorithms, Management, Design, Experimentation.
Keywords
matched filter, spread-spectrum, CDMA, VLSI, low power.

Session 12: Simulation and Estimation Techniques

Session Chair: Pai Chou (UCI)
Session Organizer: Wolfgang Nebel (OFFIS, Oldenburg University)

12.1 Parametric Timing and Power Macromodels for High Level Simulation of Low Swing Interconnects [p. 307]

D. Bertozzi, L. Benini, B. Ricco (University of Bologna)

The impact of global on-chip interconnections on power consumption and speed of integrated circuits is becoming a serious concern. Designers need therefore to quickly estimate how performance and power are affected by a given choice of the interconnection parameters (length, voltage swing, driver and receiver schematics and sizing). This work focuses on the entire communication channel (driver, interconnect, receiver), and provides high level parametric VHDL simulation models for low-swing signaling schemes. These SPICE-derived power and timing macromodels transfer electrical-level information to the RTL simulation in an event-driven fashion, as transitions occur at the input of the interconnect driver. The accuracy reached by this back annotation technique is within 5% with respect to SPICE results, with only 4% simulation speed penalty in the worst case.

12.2 Compact Models for Estimating Microprocessor Frequency and Power [p. 313]

W. Athas, L. Youngs (Apple Computer), A. Reinhart (Motorola Labs)

This paper describes compact mathematical models for estimating the frequency performance and power dissipation of a microprocessor as a function of the supply voltage. The objective is to estimate the frequency and/or power performance across a wide range of supply voltages and operating frequencies using only a small number of configurable parameters and equations. These compact equations are amenable to hand calculations and spreadsheet manipulation. The configurable parameters are derived from actual measurements of microprocessor chips and are calculated using the least-squares curve-fitting method.
Categories and Subject Descriptors
C.4 [Performance of Systems], B.7 [Integrated Circuits], I.6 [Simulation and Modeling], G.4 [Mathematical Software], J.6 [Computer-Aided Engineering]
General Terms
Algorithms, Design, Experimentation, Performance
Keywords
Low-power, microprocessors, VLSI, ASIC, curve-fitting, delay modeling, power estimation

12.3 Efficient Estimation of Signal Transition Activity in MAC Architectures [p. 319]

A. Garcia, L.D. Kabulepa, M. Glesner (Darmstadt University of Technology)

Because of the increasing demand of portable digital systems, it is of great interest to extend the existing high-level power estimation techniques to handle architectures with non linear components, as they appear in relevant practical applications. In this paper we focus on the estimation of the transition activity in MAC structures implementing FIR filters. Based on a divide and conquer approach, an accurate yet efficient estimation procedure is developed. The technique has been evaluated for different synthetic and real data sets. In all cases, our results depict only very slight discrepancies with respect to precise bit level simulations.
Categories and Subject Descriptors
B.8.2 [Hardware]: Performance Analysis & Design Aids
General Terms
Design Performance
Keywords
Low power, power estimation, transition activity, MAC

12.4 Novel Modeling Techniques for RTL Power Estimation [p. 323]

M. Eiermann, W. Stechele (Technical University of Munich)

In this work, we propose efficient macromodeling techniques for RTL power estimation, based only on word and bit level switching information of the module inputs. We present practicable combinations of these two properties for the construction of power macromodels. It is demonstrated, that our developed models reduce the estimation error compared to the Hamming-distance model at least by 64%. The total average errors (compared to PowerMill) achieved over a wide range of test modules and input stimuli are less than 4.6%. This is comparable to complex models, which however, have to make use of several more signal properties.
Categories and Subject Descriptors
I.6.5 [Simulation and Modeling]: Model Development - modeling methodologies.
General Terms
Design, Experimentation, Verification.
Keywords
Power estimation, power modeling, RTL macromodels, low power.