ISLPED 2001 Abstracts

Sessions: [Keynote Speech] [1] [2] [Poster Session 1] [Poster Session 2] [Invited Talk 1] [3] [4] [5] [6] [Invited Talk 2] [7] [8] [Poster Session 3] [Poster Session 4] [9] [10] [11] [12]


Keynote Speech

Session Chair: Vivek De (Intel)
Wireless Beyond the Third Generation: Facing the Energy Challenge [p. 1]
Jan Rabaey (University of California, Berkeley)

After a stellar growth over the last decade driven by voice as the killer app, wireless communications is now rapidly moving into a new era propelled by data networking. For a wide host of devices, wireless will serve as the "last interconnection hop" to the high datarate wired networks. The basic trends in these devices can be best summarized under the following two headers: "ubiquity" and "more bits/sec". Both of these have some important ramifications on energy dissipation. In this paper and accompanying presentation, we will outline the predominant trends in wireless, analyze the energy challenge of those, and examine a number of emerging solutions.
GENERAL TERMS - Design
KEYWORDS - Wireless, communications, energy.


Session 1: Energy Reduction in Processor Pipelines

Session Chair: Pradip Bose (IBM)
Session Organizer: Steve Kosonocky (IBM)
Micro-Operation Cache: A Power Aware Frontend for Variable Instruction Length ISA [p. 4]
Baruch Solomon, Avi Mendelson, Doron Orenstien, Yoav Almog, Ronny Ronen (Intel Corporation)

We introduce the Micro-Operation Cache (Uop Cache - UC) designed to reduce processor's frontend power and energy consumption without performance degradation. The UC caches basic blocks of instructions - pre-decoded into micro-operations (uops). The UC fetches a single basic-block worth of uops per cycle. Fetching complete pre-decoded basic-blocks eliminates the need to repeatedly decode variable length instructions and simplifies the process of predicting, fetching, rotating and aligning fetched instructions. The UC design enables even a small structure to be quite effective. Results: a moderate-sized UC eliminates about 75% instruction decodes across a broad range of benchmarks and over 90% in multimedia applications and high-power tests. For existing Intel P6 family processors, the eliminated work may save about 10% of the full-chip power consumption with no performance degradation.
General Terms: Performance, Design
Keywords: instruction fetch, instruction cache, microoperation cache, power reduction.

L1 Data Cache Decomposition for Energy Efficiency [p. 10]
Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas (University of Illinois at Urbana-Champaign)

The L1 data cache is a time-critical module and, at the same time, a major consumer of energy. To reduce its energy-delay product, we apply two principles of low-power design: specialize part of the cache structure and break the cache down into smaller caches. To this end, we propose a new L1 data cache structure that combines a Specialized Stack Cache (SSC) and a Pseudo Set-Associative Cache (PSAC). Individually, our SSC and PSAC designs have a lower energy-delay product than previously-proposed related designs. In addition, their combined operation is very effective. Relative to a conventional 2-way 32 KB data cache, a design containing a 4-way 32 KB PSAC and a 512 B SSC reduces the energy-delay product of several applications by an average of 44%.

Instruction Flow-Based Front-end Throttling for Power-Aware High-Performance Processors [p. 16]
Amirali Baniasadi (Northwestern University), Andreas Moshovos (University of Toronto)

We present a number of power-aware instruction front-end (fetch/decode) throttling methods for high-performance dynamically-scheduled superscalar processors. Our methods reduce power dissipation by selectively turning on and off instruction fetch and decode. Moreover, they have a negligible impact on performance as they deliver instructions just in time for exploiting the available parallelism. Previously proposed front-end throttling methods rely on branch prediction confidence estimation. We introduce a new class of methods that exploit information about instruction flow (rate of instructions passing through stages). We show that our methods can boost power savings over previously proposed methods. In particular, for an 8-way processor a combined method reduces traffic by 14%, 20%, 6% and 6% for the fetch, decode, issue and complete stages respectively while performance remains mostly unaffected. The best previously proposed method reduces traffic by 10%, 15%, 4% and 4% respectively.

Energy Reduction in Queues and Stacks by Adaptive Bitwidth Compression [p. 22]
Vasily G. Moshnyaga (Fukuoka University)

A new micro-architectural technique to reduce energy dissipated by queues and stacks is proposed. Similarly to related research which targets the transition activity in bit-lines, the technique is based on bitwidth compression. However unlike them, it utilizes the fixed accessing order embodied in queues and stacks to exploit input data correlation. The technique dynamically adjusts the required bitwidth to the number of bits which changed in comparison to the last access. It is neither restricted to specific bit-patterns such as zero-byte or precharging value and works efficiently on read and write without large area, timing or power overhead. Simulations show that using this technique, we can save the energy of instruction queue by up to 30% and the energy of video data queue by 20%.


Session 2: Voltage and Instruction Scheduling

Session Chair: TBD
Session Organizer: Luca Benini (University di Bologna)
Energy Priority Scheduling for Variable Voltage Processors [p. 28]
Johan Pouwelse, Koen Langendoen, Henk Sips (Delft University of Technology)

Clock (and voltage) scheduling is an important technique to reduce energy consumption of variable-voltage processors. It is difficult, however, to achieve good results at the OS and hardware level when applications show bursty behavior. We take the approach that such applications must be made power aware and specify their future demands to a central scheduler controlling the clock speed and processor voltage. This paper describes our energy priority scheduling (EPS) heuristic that orders tasks according to how tight their deadlines are and how often tasks overlap. We schedule low-priority tasks first, since they can be easily preempted to accommodate for high-priority tasks later. The EPS heuristic does not always yield the optimal schedule, but has low complexity and can be used as an incremental on-line algorithm. We implemented EPS on a StrongARM-based variable-voltage platform. Measurements show that EPS reduces energy consumption with 50% for a bursty video decoding application without missing any frame deadlines.

Dynamic Voltage Scheduling Technique for Low-Power Multimedia Applications Using Buffers [p. 34]
Chaeseok Im, Huiseok Kim, Soonhoi Ha (Seoul National University)

As multimedia applications are used increasingly in many embedded systems, power efficient design for the applications becomes more important than ever. This paper proposes a simple dynamic voltage scheduling technique, which suits the multimedia applications well. The proposed technique fully utilizes the idle intervals with buffers in a variable speed processor. The main theme of this paper is to determine the minimum buffer size to achieve the maximum energy saving in three cases: single-task, multiple subtasks, and multi-task. Experimental results show that the proposed technique is expected to obtain significant power reduction for several real-world multimedia applications.

Power-Aware Modulo Scheduling for High-Performance VLIW Processors [p. 40]
Han-Saem Yun, Jihong Kim (Seoul National University)

For high-performance processors, the step power and peak power, which are closely related to the chip reliability, are important design constraints, often more than the average power. In VLIW processors where a single instruction may contain a variable number of operations, the step power and peak power vary significantly depending on the parallel schedule generated by a parallelizing compiler. In this paper, we propose a power-aware modulo scheduling algorithm for high-performance VLIW processors. The proposed algorithm reduces both the step power and peak power by producing a more balanced parallel schedule while not compromising performance. Experimental results show that the proposed scheduling technique significantly improves the power characteristics of high-performance processors over an existing power-unaware modulo scheduling technique.

Hard Real-Time Scheduling for Low-Energy Using Stochastic Data and DVS Processors [p. 46]
Flavius Gruian (Lund University)

The work presented in this paper addresses scheduling for reduced energy of hard real-time tasks with fixed priorities assigned in a rate monotonic or deadline monotonic manner. The approach we describe can be exclusively implemented in the RTOS. It targets energy consumption reduction by using both on-line and off-line decisions, taken both at task level and at task-set level.We consider sets of independent tasks running on processors with dynamic voltage supplies (DVS). Taking into account the real behavior of a realtime system, which is often better than the worst case, our methods employ stochastic data to derive energy efficient schedules. The experimental results show that our approach achieves more important energy reductions than other policies from the same class.
Keywords
Low-energy, hard real-time, RTOS, scheduling


Poster Session 1

Session Chair: TBD
Analysis and Design of Low-Energy Flip-Flops [p. 52]
Dejan Markovi¾, Borivoje Nikoli¾, Robert W. Brodersen (University of California, Berkeley)

This paper develops a methodology for selecting and optimizing flip-flops for low-energy systems with constant throughput. Characterization metrics, relevant to low-energy systems are discussed, providing insight into timing and energy parameters at both the circuit and system levels. Transistor sizes are optimized for minimal delay under constrained energy consumption. This methodology is applied to characterization of various flip-flop styles and their comparison in 0.25µm CMOS technology under scaled supply voltages. A transmission-gate master-slave latchpair has the largest internal race margin, lowest energy consumption, and has energy-delay product comparable to much faster pulse-triggered latches.
Keywords
VLSI, Digital CMOS, flip-flops, low-power design, low-voltage.

Analysis of Clocked Timing Elements for Dynamic Voltage Scaling Effects over Process Parameter Variation [p. 56]
Hoang Q. Dao (University of California, Davis), Kevin Nowka (IBM Austin Research Lab), Vojin G. Oklobdzija (University of California, Davis)

In power-constrained systems, the power efficiency of latches and flip-flops is pivotal. Characteristics of three selected latches and FFs were analyzed for their behavior under voltage scaling and different process corners in a 0.18um CMOS technology. The relative performance amongst the latches/FFs was consistent across the different supply voltages. At low-voltage power-delay product was degraded by about 25%. Energy-delay-product was approximately doubled at low-voltage ö for all latches/FFs over all process corners. This result was smaller in comparison to the ideal voltage scaling characteristics mainly because the effects of velocity saturation were less severe at low voltage. All three designs suffered more due to process variation under low-voltage conditions.
Categories and Subject Descriptors
Digital circuit: clocked-timing elements
General Terms
Measurement, Performance, Reliability
Keywords
Clocked timing elements, voltage scaling, process variation

A Low-Power Motion Estimation Block for Low Bit-Rate Wireless Video* [p. 60]
R. Steven Richmond, Dong Sam Ha (Virginia Tech)

This paper presents a low-power design of a motion estimation block targeting for a low-bit rate video codec H.263. The block is based on the Four-Step Search algorithm. The proposed design offers up to 38 % power reduction for logic blocks alone over a "baseline" implementation of the Four-Step Search (4SS) algorithm and up to 58 % power reduction over a baseline model of the Three- Step Search (TSS) algorithm. In addition, our design reduces power dissipation of an on-chip memory by up to 32% over the 4SS and 27% over the TSS.

Power-aware Partitioned Cache Architectures [p. 64]
S. Kim, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam, M. J. Irwin, E. Geethanjali (Pennsylvania State University)

This paper focuses on partitioning the cache resources architecturally for energy and energy-delay optimizations. Specifically, we investigate ways of splitting the cache into several smaller units, each of which is a cache by itself (called subcache). Subcache architectures not only reduce the per-access energy costs but can potentially improve the locality behavior as well. We present a unified framework for designing , implementing and evaluating different subcache architectures. Different techniques for data placement, subcache prediction, and selective probing are proposed and evaluated using a diverse set of applications. The results show that intelligent subcache mechanisms proposed in this paper are effective.

A Low-Leakage Dynamic Multi-Ported Register File in 0.13µm CMOS [p. 68]
Atila Alvandpour, Ram Krishnamurthy, K. Soumyanath, Shekhar Borkar (Intel Corporation)

Increasing leakage currents combined with reduced noise margins are seriously degrading the robustness of dynamic circuits. This paper describes a dynamic implementation of a 256X32b 4-read/write-port Register-File for ~6GHz operation at 1.2V in a 0.13mm technology. The pre-charged local bit-lines utilize an efficient conditional keeper-technique, where a large fraction of the keeper is turned ON only if the dynamic output remains High in the evaluation phase. Using this technique, we are able to improve upon all-low-Vt performance by 4%, while maintaining Dual-Vt usage. Thus, the robustness is improved by 96% and the active leakage power is reduced by 5X.

Energy-Efficient Load and Store Reuse [p. 72]
Jun Yang, Rajiv Gupta (The University of Arizona)

A load and store reuse mechanism can be used for filtering memory references to reduce memory activity including on-chip cache activity. The challenging aspect of this task is to ensure that energy savings achieved in memory are not offset by energy used by the reuse hardware. In this paper we present the design of a reuse mechanism which has been carefully tuned to achieve net energy savings. In contrast to traditional filter cache designs which trade-off energy reductions with higher execution times, our approach reduces both energy and execution time.


Poster Session 2

Session Chair: TBD
Compiler Support for Block Buffering [p. 76]
Mahmut Kandemir (The Pennsylvania State University), J. Ramanujam (Louisiana State University), Uger Sezer (University of Wisconsin ö Madison)

On-chip caches consume a significant fraction of energy in current microprocessors. Hence, hardware techniques such as block buffering have been developed and shown to be effective in reducing on-chip cache energy consumption. We are not aware of any software solutions to exploit block buffering. This paper presents a compiler-based approach that modifies both code and variable layout to effectively exploit block buffering, and is aimed at the class of embedded codes that make heavy use of scalar variables. Unlike previous work that uses only storage pattern optimization, our solution integrates both code restructuring and storage pattern optimization. Experimental results on a set of complete programs demonstrate that our solution leads to significant energy savings.

Automatic Source Code Specialization for Energy Reduction [p. 80]
Eui-Young Chung (Stanford University), Luca Benini (University di Bologna), Giovanni De Micheli (Stanford University)

This paper presents a framework to reduce the computational effort of software programs, using value profiling and partial evaluation. Our tool reduces computational effort by specializing a program for highly expected situations and such a reduction translates into both energy and performance improvement. Procedure calls executed frequently with same parameter values are defined as highly expected situations (common cases). The choice of the best transformation of common cases is achieved by solving three search problems. The first identifies effective common cases to be specialized, the second searches for an optimal solution for effective common case, and the third examines the interplay among the specialized cases. Our technique improves both energy consumption and performance of the source code up to more than twice and in average about 25% over the original program. Also, our pruning techniques reduce the searching time by 80% compared to exhaustive approach.

FV Encoding for Low-Power Data I/O [p. 84]
Jun Yang, Rajiv Gupta (The University of Arizona)

The power consumed by I/O pins of a CPU is significant due to high capacitances associated with the pins. While highly effective techniques for reducing address bus switching exist [1], similarly effective techniques for data bus have not been developed. We have discovered a characteristic of values transmitted over the data bus according to which a small number of distinct values, called frequent values, account for 58-68% of transmissions over the external data bus. To exploit this characteristic we have developed a method for dynamic identi cation of frequent values and their use in encoding data values using FV (frequent value) encoding scheme. Our experiments show that FV encoding of 32 frequent values yields an average reduction of 42.7% (with on-chip data cache) and 67.63% (without on-chip data cache) in data bus switching activity for SPEC95 benchmarks.

Time-to-Failure Estimation for Batteries in Portable Electronic Systems [p. 88]
Daler Rakhmatov, Sarma B. K. Vrudhula (University of Arizona)

Nonlinearity of the energy source behavior in portable systems needs to be modeled in order for the system software to make energy-conscious decisions. We describe an analytical battery model for predicting the battery time-to-failure under variable discharge conditions. Our model can be used to estimate the impact of various system load profiles on the energy source lifetime. The quality of our model is evaluated based on the simulation of a lithium-ion battery.

Architecture Strategies for Energy-Efficient Packet Forwarding in Wireless Sensor Networks [p. 92]
Vlasios Tsiatsis, Scott A. Zimbeck, Mani B. Srivastava (University of California, Los Angeles)

The energy-efficient communication among wireless sensor nodes determines the lifetime of a sensor network and exhibits patterns highly dependable on the sensor application and networking software. This software is responsible for processing the sensor data and disseminating the data to other nodes or a central repository. In this paper we propose a node architecture that takes advantage of both the intelligence of the radio hardware and the needs of applications to efficiently handle the packet forwarding. It exploits principles widely used in modern firewall network architectures and as our analysis shows achieves considerable energy savings.
Keywords
Energy-efficient packet forwarding, sensor networks

Modulation Scaling for Energy Aware Communication Systems [p. 96]
Curt Schurgers, Olivier Aberthorne, Mani B. Srivastava (University of California, Los Angeles)

In systems that require low energy consumption, voltage scaling is an invaluable circuit technique. It also offers energy awareness, trading off energy and performance. In wireless handheld devices, the communication portion of the system is a major power hog. We introduce a new technique, called modulation scaling, which exhibits benefits similar to those of voltage scaling. It allows us to trade off energy against transmission delay and as such introduces the notion of energy awareness in communications. Throughout our discussion, we emphasize the analogy with voltage scaling. As an example application, we present an energy aware wireless packet scheduling system.
Keywords
energy awareness, adaptive modulation, scaling


Invited Talk 1

Session Chair: Murli Tirumala (Intel)
Cooling and Power Considerations for Semiconductors into the Next Century [p. 100]
Christian Belady (Hewlett Packard)

With the insatiable desire for higher computer or switch performance comes the undesirable side effect of higher power especially with the pervasiveness of CMOS technology. As a result, cooling and power delivery have become integral in the design of electronics. Figure 1 shows the National/International Technology Roadmap For Semiconductorsâ projection for processor chip power. Note that between the year 2000 and 2005 that the total power of the chip is expected to increase 60%, which will put additional emphasis on the power and cooling systems of our electronics. Further inspection of this figure also shows that the heat flux will more than double during this period. The increases in power and heat flux are driven by two factors, higher frequency and reduced feature sizes.


Session 3: Low Power RF Circuits and Systems

Session Chair: Frank Chang (UCLA) Session Organizer: Satyen Mukherjee (Phillips)
Energy Efficient Modulation and MAC for Asymmetric RF Microsensor Systems [p. 106]
Andrew Y. Wang, SeongHwan Cho, Charles G. Sodini, Anantha P. Chandrakasan (Massachusetts Institute of Technology)

Wireless microsensor systems are used in a variety of civil and military applications. Such microsensors are required to operate for years from a small energy source. To minimize the energy dissipation of the sensor node, RF front-end circuitry must be designed based on system level optimization of the entire network. This paper presents several energy minimization techniques derived from the unique properties of a practical short range asymmetric microsensor system. These include energy efficient modulation schemes, appropriate multiple access protocols, and a fast turn-on transmitter architecture.

A 1 V, 1.9 GHz Mixer Using a Lateral Bipolar Transistor in CMOS [p. 112]
Song Ye (University of Toronto), Koji Yano (Yamanashi University), C. Andre T. Salama (University of Toronto)

This paper describes a low power mixer implemented in a standard 0.25 um CMOS process. The mixer uses lateral bipolar transistors in CMOS to form the core of the circuit. No additional processing steps are needed to obtain the BJT when the MOSFET is properly designed. The mixer exhibits 6.5 dB gain, operating at 1.9 GHz from a 1 V supply and a power dissipation of 1.3 mW. Such a mixer is a likely candidate for low power portable wireless applications.
Categories and Subject Descriptors
1.3 [Analog, MEMS and Mixed Signal Electronics]: RF circuits, Wireless systems, MEMS circuits, AD/DA Converters, Mixed-signal circuits, DC-DC conversion.
General Terms
Measurement, Design, Experimentation.
Keywords
RF, CMOS, mixer, lateral bipolar transistor, low power.

A 60dB, 246MHz CMOS Variable Gain Amplifier for Subsampling GSM Receivers [p. 117]
Mohamed A. I. Mostafa (Texas A&M University), Sherif H. K. Embabi (Texas Instruments Inc.), Mostafa A. I. Elmala (Texas A&M University)

This VGA is designed for a GSM subsampling receiver. It operates at an IF frequency of 246MHz. The VGA provides a 60dB digitally controlled gain range in 2dB steps. The VGA is implemented in a 0.35µm CMOS process. The current is 9mA@3V. The overall gain accuracy is less than 0.3dB. The noise figure at maximum gain is 8.7dB. The IIP3 is ö4dBm at minimum gain.
Categories and Subject Descriptors
1.3 [Analog, MEMS and Mixed Signal Electronics]: RF circuits, Wireless systems, and mixed-signal circuits.
General Terms
Performance, Design, Experimentation, Standardization.
Keywords
VGA, CMOS, subsampling, GSM, IF, receiver.


Session 4: Modeling and Estimation Techiques

Session Chair: Wolfgang Nebel (Univ. Oldenburg)
Session Organizer: Radu Marculescu (Carnegie Mellon University)
VTCMOS Characteristics and Its Optimum Conditions Predicted by a Compact Analytical Model [p. 123]
Hyunsik Im, T. Inukai, H. Gomyo, T. Hiramoto, T. Sakurai (University of Tokyo)

A very compact analytical model of variable threshold voltage CMOS (VTCMOS) is proposed to study the active on-current, linking it with the stand-by off-current characteristics. Comparisons of modeled results to numerical simulations and experimental data are made with an excellent agreement. It is clearly demonstrated using the model that speed degradation due to low supply voltage can be compensated by the VTCMOS scheme with even smaller power. Influence of the short channel effect (SCE) on the performance of VTCMOS is investigated in terms of a new parameter, dS/d‹, both qualitatively and quantitatively. It is found that the SCE degrades the VTCMOS performance. Issues on the optimum conditions of VTCMOS are discussed.
Keywords
Body Effect, Variable threshold voltage CMOS (VTCMOS), Substrate bias, Low power, and Analytical model

Memory Controller Policies for DRAM Power Management [p. 129]
Xiaobo Fan, Carla S. Ellis, Alvin R. Lebeck (Duke University)

The increasing importance of energy efficiency has produced a multitude of hardware devices with various power management features. This paper investigates memory controller policies for manipulating DRAM power states in cache-based systems. We develop an analytic model that approximates the idle time of DRAM chips using an exponential distribution, and validate our model against trace-driven simulations. Our results show that, for our benchmarks, the simple policy of immediately transitioning a DRAM chip to a lower power state when it becomes idle is superior to more sophisticated policies that try to predict DRAM chip idle time.

Run-Time Power Estimation in High Performance Microprocessors [p. 135]
Russ Joseph, Margaret Martonosi (Princeton University)

Power concerns are becoming increasingly pressing in high-performance processors. Building power-aware and even power-adaptive computer architectures requires being able to track power consumption and attribute energy consumption to the portions of the chip that are responsible for it. This paper presents the Castle project which aims to deduce the actual runtime power dissipated by different processor units on the CPU chip by leveraging existing hardware. Namely, we examine the use of hardware performance counters as proxies for power meters. We discuss which performance counters count power-relevant events, and how to estimate event counts for power-relevant events not well supported by current, commonly available performance counters. We also discuss sampling-based approaches for estimating signal transition activity within the processor. Overall, we find that these performance counters can be quite useful in providing good power apportionment estimates for programs as they run.

Fast, Flexible, Cycle-Accurate Energy Estimation [p. 141]
Phillip Stanley-Marbell, Michael S. Hsiao (Rutgers University)

Designing energy efficient hardware and software systems demands different tools at various levels in the design hierarchy. There is however a dearth of tools to enable investigation and implementation of energy efficient software and hardware architectures. Presented is a fast, exible, cycle-accurate architectural simulator, Myrmigki, that models a commercial microcontroller and microprocessor family, and enables cycle-accurate power dissipation analyses through a combination of instruction level power analysis and circuit activity estimation. Myrmigki is intended to be used to study the effect of microarchitectural features on the energy e®ciency of hardware and software systems. It provides facilities for dynamic voltage scaling, clock speed setting and per-cycle architecture reconfiguration, and is easily extended to add new microarchitectural features and model new instruction set architectures. The simulator provides over an order of magnitude speedup over a contemporary state-of-the-art power estimating simulator, while providing estimates within 10% of measurements from prototype hardware that it models.


Session 5: Low Power Digital Circuits

Session Chair: Borivoje Nikolic (University of California, Berkeley)
Session Organizer: Tadahiro Kuroda (Keio University)
Comparative Delay and Energy of Single Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops for High-Performance Microprocessors [p. 147]
James Tschanz, Siva Narendra, Zhanping Chen, Shekhar Borkar, Vivek De (Intel Corporation), Manoj Sachdev (University of Waterloo)

Flip-flops and latches are crucial elements of a design from both a delay and energy standpoint. We compare several styles of single edge-triggered flip-flops, including semidynamic and static with both implicit and explicit pulse generation. We present an implicit-pulsed, semidynamic flip-flop (ip-DCO) which has the fastest delay of any flip-flop considered, along with a large amount of negative setup time. However, an explicit-pulsed static flip-flop (ep-SFF) is the most energy-efficient and is ideal for the majority of critical paths in the design. In order to further reduce the power consumption, dual edge-triggered flip-flops are evaluated. It is shown that classic dual edge-triggered designs suffer from a large area penalty and reduced performance, prohibiting their use in critical paths. A new explicit-pulsed dual edge-triggered flip-flop is presented which provides the same performance as the single edge-triggered version with significantly less energy consumption in the flip-flop as well as in the clock distribution network.
Keywords
Flip-flops, latches, clocking, dual edge-triggered, low power.

Theory and Practical Implementation of Harmonic Resonant Rail Driver [p. 153]
Joong-Seok Moon, Peter A. Beerel, (University of Southern California), William C. Athas (Apple Computer)

This paper presents a new algorithm for designing efficient harmonic resonant rail drivers. The circuit solution is coupled to a standard pulse source and uses only discrete passive components. It can thus be externally tuned to minimize the consumed power in the target IC. A new efficient algorithm based on current-fed pulse-forming network theory is proposed to find the value of each discrete component for a target frequency and a given load capacitance. The proposed driver topology can be used to generate any desired periodic 50% duty-cycle waveform by superimposing multiple harmonics of the desired waveform, however, this paper focuses on the generation of square-wave clock signals. We have tested the driver with a capacitive load between 38.3pF and 97.8pF. The overall dissipation for our second-order harmonic rail driver is 19% of fCV2 at 15MHz and 97.8pF load.
Keywords
Harmonic-resonant rail driver, energy-recovery circuit, pulseforming network, clock generation.

A Resonant Clock Generator for Single-Phase Adiabatic Systems [p. 159]
Conrad H. Ziesle, Marios C. Papaefthymiou, (University of Michigan), Suhwan Kim (IBM T.J. Watson Research Center)

Recently discovered high-speed single-phase adiabatic logic families require efficient sinusoidal power-clock generators. In this paper we propose a low-power resonant clock-generator built around a zero-voltage switching push-pull power conversion topology. We describe a novel energy-efficient control circuit for this power converter, based on an asynchronous CMOS state machine. We also describe an integrated sub-micron CMOS implementation of our power converter and control circuits. Simulation results show efficiencies in excess of 90%, even under suboptimal tuning conditions, for frequencies over 200MHz. We have fabricated our clock generator in a 0.5m standard CMOS process. Using an external surface-mount inductor as the resonant element, we have verified the correct operation of the clock generator when driving a single-phase adiabatic 8-bit multiplier.
Categories and Subject Descriptors
B.0 [Hardware]: General Keywords
Adiabatic logic, Clock generator, CMOS, Low energy, Resonant, Single phase, VLSI, Dynamic circuitry, SCAL, SCAL-D, TSEL.

Enhanced Multi-Threshold (MTCMOS) Circuits Using Variable Well Bias [p. 165]
Stephen V. Kosonocky, Mike Immediato (IBM T.J. Watson Research Center), Peter Cottrell, Terence Hook, Randy Mann, Jeff Brown (IBM Microelectronics)

Advanced CMOS technology can enable high levels of performance with reduced active power at the expense of increased standby leakage. MTCMOS has previously been described as a method of reducing leakage in standby modes, by addition of a power supply interrupt switch. Enhancements using variable well bias and layout techniques are described and demonstrate increased performance and reduced leakage over conventional MTCMOS circuits.
Keywords
MTCMOS, multi-threshold, variable well bias, leakage control, low power digital circuit design.


Session 6: Bus Encoding

Session Chair: Masahiro Asada (Univ. of Tokyo)
Session Organizer: Renu Mehra (Synopsys)
Encodings for High-Performance Energy-Efficient Signaling [p. 170]
Alessandro Bogliolo (University of Ferrara)

Energy efficiency, performance and signal integrity are conflicting critical requirements for on-chip signaling. We propose a code-based solution that improves bit rate while reducing communication energy and preserving noise margins. Our technique is based on the observation that RC lines can be used at twice their limiting bit rate to transmit bit streams with no isolated bits. We propose new encodings (called minimum run-length guaranteed codes, MRLG) that eliminate isolated bits, thus enabling double-bit-rate signaling. We show that our encodings can be combined with any low-power code to achieve both energy reduction and performance improvement.

Low-Energy Encoding for Deep-Submicron Address Buses [p. 176]
Luca Macchiarulo, Enrico Macii, Massimo Poncino (Politecnico di Torino)

In this paper, we introduce a new encoding scheme that explicitly targets the minimization of the bus energy due to the crosstalk capacitances between adjacent bus lines. The key transformation operated by the code consists of a permutation of the bus lines, implemented directly during physical design; as a desirable consequence, no additional encoding/decoding logic is required at the bus boundaries, thus implying that no latency penalty is introduced on the processor-memory path. An additional feature of the permutation-based code is that the encoding function can be determined without any knowledge of the binary stream being transmitted. Therefore, the code can be effectively exploited in general-purpose computing systems. The proposed code works best on address buses; savings obtained for different address traces generated by two different processors are in the order of 26% with respect to the unencoded streams.

Irredundant Address Bus Encoding for Low Power [p. 182]
Yazdan Aghaghiri, Massoud Pedram, (University of Southern California), Farzan Fallah (Fujitsu Laboratories of America)

This paper proposes efficient encoding techniques for decreasing power dissipation on global buses. The best target for these techniques is a wide and highly capacitive memory bus. Building on T0 and Offset-Xor encoding techniques, we present three irredundant bus-encoding techniques. Our methods decrease switching activity up to 83% without the need for redundant bus lines. The power dissipation of encoder and decoder circuitry has also been calculated and shown to be small in comparison with the power savings on the memory address bus itself.

Low Power Address Encoding using Self-Organizing Lists [p. 188]
Mahesh Mamidipaka, Dan Hirschberg, Nikil Dutt (University of California, Irvine)

Off-chip bus transitions are a major source of power dissipation for embedded systems. In this paper, new adaptive encoding schemes are proposed that significantly reduce transition activity on data and multiplexed address buses, that do not add redundancy in space or time and which have minimal delay overhead. These adaptive techniques are based on self-organising lists to achieve reduction in transition activity by exploiting the spatial and temporal locality of the addresses. Unlike previous approaches that focus on instruction address buses, experiments demonstrate significant reduction in transition activity of up to 54% in data address buses and up to 59% in multiplexed address buses. The average reductions are twice those obtained using current schemes on a data address bus and more than twice those obtained on a multiplexed address bus.


Invited Talk 2

Session Chair: Ingrid Verbauwhede (UCLA)
Wireless Sensor Networks: Application Driver for Low Power Distributed Systems [p. 194]
Deborah Estrin (University of California, Los Angeles)

Wireless sensor networks allow deployment of sensing elements close to the phenomena of interest. Sensing close to the signal generation point should lead to improved SNR in general, and enable detection in otherwise obstructed environments. This fundamental benefit of local sensing, combined with the decreasing cost and increasing availability of low cost microsensors/actuators and processors, suggests that effective systems will exploit densely distributed elements. However, dense sensing capability is only scalable if the elements are networked to support collaborative processing near the sensory inputs. [1] Therefore, in many contexts low-power wireless communication is a critical enabler of these systems because it overcomes the logistical infeasibility of deploying wires in remote, dynamic, and mobile-node, contexts.


Session 7: Technology for Low Power

Session Chair: Fari Assaderaghi (SiliconWave)
Session Organizer: Rajiv Joshi (IBM)
Scaling of Stack Effect and its Application for Leakage Reduction [p. 195]
Siva Narendra (Massachusetts Institute of Technology & Intel), Shekhar Borkar, Vivek De, Dimitri Antoniadis (Intel), Anantha Chandrakasan (Massachusetts Institute of Technology)

Technology scaling demands a decrease in both Vdd and Vt to sustain historical delay reduction, while restraining active power dissipation. Scaling of Vt however leads to substantial increase in the sub-threshold leakage power and is expected to become a considerable constituent of the total dissipated power. It has been observed that the stacking of two off devices has smaller leakage current than one off device. In this paper we present a model that predicts the scaling nature of this leakage reduction effect. Device measurements are presented to prove the modelâs accuracy. Use of stack effect for leakage reduction and other implications of this effect are discussed.

Variable Threshold Voltage CMOS (VTCMOS) in Series Connected Circuits [p. 201]
Takashi Inukai, Toshiro Hiramoto, Takayasu Sakurai (University of Tokyo)

Characteristics of variable threshold voltage CMOS (VTCMOS) in the series connected circuits are investigated by means of device simulation. It is newly found that the performance degradation due to the body effect in series connected circuit is suppressed by utilizing VTCMOS. Lowering the threshold voltage (Vth) enhances the drive current and alleviates the degradation due to the series connected configuration. Therefore, larger body effect factor (‹) results in lower Vth and higher on-current even in the series connected circuits. These characteristics are attributed to the velocity saturation phenomenon which reduces the drain saturation voltage (Vdsat).
Keywords
variable threshold voltage CMOS, series connected circuits, degradation factor, body effect factor, substrate bias, velocity saturation

Effectiveness of Reverse Body Bias for Leakage Control in Scaled Dual Vt CMOS ICS [p. 207]
A. Keshavarzi, S. Ma, S. Narendra, B. Bloechel, K. Mistry, T. Ghani, S. Borkar, V. De (Intel Corporation)

We examine the effectiveness of opportunistic use of reverse body bias (RBB) to reduce leakage power during active operation, burn-in, and standby in 0.18µm single-Vt and 0.13µm dual-Vt logic process technologies. We investigate its dependencies on channel length, target Vt, temperature and technology generation. We show that RBB becomes less effective for leakage reduction at shorter channel lengths and lower Vt at both high and room temperatures, especially when target intrinsic leakage currents are high. RBB effectiveness also diminishes with technology scaling primarily because of worsening short-channel effects (SCE), particularly when target Vt values are low. We present a model that relates different transistor leakage components to full-chip leakage current, and validate the model through testchip measurements across a range of RBB values.

Double-Gate Fully-Depleted SOI Transistors for Low-Power High-Performance Nano-Scale Circuit Design [p. 213]
Rongtian Zhang, Kaushik Roy, David B. Janes (Purdue University)

Double-gate fully-depleted (DGFD) SOI circuits are regarded as the next generation VLSI circuits. This paper investigates the impact of scaling on the demand and challenges of DGFD SOI circuit design for low power and high performance. We study how the added back-gate capacitance affects the circuit power and performance; how to trade off the enhanced short-channel effect immunity with the added back-channel leakage; and how the coupling between the front- and back-gates affects circuit reliability. Our analyses over different technology generations using MEDICI device simulator show that DGFD SOI circuits have significant advantages in driving high output load. DGFD SOI circuits also show excellent ability in controlling leakage current. However, for low output load, no gain is obtained for DGFD SOI circuits. Also, it is necessary to optimize the back-gate oxide thickness for best leakage control. Moreover, threshold variation may cause reliability problem for thin back-gate oxide DGFD SOI circuits operated at low power supply voltage.


Session 8: Architectural Techniques

Session Chair: Sumit Roy (Cadence)
Session Organizer: M. Poncino (Politecnico di Torino)
A Self-Optimizing Embedded Microprocessor using a Loop Table for Low Power [p. 219]
Frank Vahid, Ann Gordon-Ross (University of California, Riverside)

We describe an approach for a microprocessor to tune itself to its fixed application to reduce power in an embedded system. We define a basic architecture and methodology supporting a microprocessor self-optimizing mode. We also introduce a loop table as a tunable component, although self-optimization can be done for other tunable components too. We highlight experimental results illustrating good power reductions with no performance penalty.
Keywords
System-on-a-chip, self-optimizing architecture, embedded systems, parameterized architectures, cores, low-power, tuning, platforms.

Low Power Pipelining of Linear Systems: A Common Operand Centric Approach [p. 225]
Daehong Kim, Kiyoung Choi, (Seoul National University), Dongwan Shin (University of California, Irvine)

In this paper, we propose a systematic pipelining method for a linear system to minimize power and maximize throughput, given a constraint on the number of pipeline stages and a set of resource constraints. The method first retimes operations such that as many operations as possible take common operands as their inputs, and then performs the operand sharing based on the list scheduling. Experimental results show that the proposed approach reduces the power consumption of the functional units by up to more than 20%, compared to the state-of-the-art pipelining and operand sharing techniques.
Keywords
Low power, pipelining, operand sharing, common operand

A System-level Energy Minimization Approach using Datapath Width Optimization [p. 231]
Yun Cao, Hiroto Yasuura (Kyushu University)

This paper presents a novel system-level approach that minimizes the energy consumption of embedded core-based systems through datapath width optimization. It is based on the idea of minimizing energy consumed by redundant bits, which are unused during execution of programs by means of optimizing the datapath width of processors. To minimize the redundant bits of variables in a given application program, the e.ective size of each variable is determined by variable size analysis, and Valen-C language is used to preserve the precision of computation. Analysis results of variables show that there are average 39% redundant bits in the C source program of MPEG-2 video decoder. In our experiments for several embedded applications, energy savings without performance penalty are reported range from about 10.8% to 48.3%.
Keywords
System-level energy minimization, variable size analysis, datapath optimization

Energy-Efficient Instruction Dispatch Buffer Design for Superscalar Processors [p. 237]
Gurhan Kucuk, Kanad Ghose, Dmitry V. Ponomarev (State University of New York), Peter M. Kogge (University of Notre Dame)

The instruction dispatch buffer (DB, also known as an issue queue) used in modern superscalar processors is a considerable source of energy dissipation. We consider design alternatives that result in significant reductions in the power dissipation of the DB (by as much as 60%) through the use of: (a) fast comparators that dissipate energy mainly on a tag match, (b) zero byte encoding of operands to imply the presence of bytes with all zeros and, (c) bitline segmentation. Our results are validated by the execution of SPEC 95 benchmarks on true hardware level, cycleöbyöcycle simulator for a superscalar processor and SPICE measurements for actual layouts of the DB and its variants in a 0.5 micron CMOS process.
Keywords: Lowöpower superscalar datapath, low power comparator, low power instruction scheduling, bitline segmentation


Poster Session 3

Session Chair: Sudhir Gowda (IBM)
High Density Capacitance Structures in Submicron CMOS for Low Power RF Applications [p. 243]
Tirdad Sowlati, Vickram Vathulya, Domine Leenaerts (Philips Research)

This paper presents four novel interconnect based capacitors with 2 to 3 times the capacitance density of a conventional metal sandwich capacitor and with self-resonant frequencies above 20 GHz, suitable for low power RF applications. Unlike the conventional capacitor, the capacitance density of these structures increases with the scaling of the technology. The structures have been fabricated in both 0.25 µm and 0.18 µm CMOS technologies, measured and an equivalent circuit presented.
Categories and Subject Descriptors
1.3. Analog, MEMS and Mixed Signal Electronics
General Terms
Measurement, Documentation, Experimentation.
Keywords
CMOS, interconnect, RF passives, Bluetooth, HiPerLAN.

A CMOS VCO Architecture Suitable for Sub-1 Volt High-Frequency (8.7-10 GHz) RF Applications [p. 247]
Ahmed H. Mostafa, Mourad N. El-Gamal (McGill University)

This paper proposes an LC-based oscillator structure which enables operation from a supply voltage as low as 0.85V, while being suitable for high-frequency RF applications. Two VCO prototypes were fabricated in a standard 0.18 µm CMOS process. The 8.7 GHz VCO operates from a supply voltage of 0.85 V, consumes 6 mW, and exhibits -100 dBc/Hz phase noise at 600 kHz offset. The 10 GHz prototype operates from a supply voltage of 1 V, consumes 9 mW, and has -98 dBc/Hz phase noise at 600 kHz offset. A tuning range of 400-450 MHz is achieved without using varactors.

Low-Power Direct Sequence Spread-Spectrum Modem Architecture [p. 251]
Charles Chien, Igor Elgorria (Rockwell Research), Charles McConaghy (Livermore National Lab)

Emerging CMOS and MEMS technologies enable the implementation of a large number of wireless distributed microsensors that can be easily and rapidly deployed to form highly redundant, self-configuring, and ad hoc sensor networks. To facilitate ease of deployment, these sensors should operate on battery for extended periods of time. A particular challenge in maintaining extended battery lifetime lies in achieving communications with low power. This paper presents a directsequence spread-spectrum modem architecture that provides robust communications for wireless sensor networks while dissipating very low power. The modem architecture has been verified in an FPGA implementation that dissipates only 33 mW for both transmission and reception. The implementation can be easily mapped to an ASIC technology with an estimated power performance of less than 1 mW.
Keywords
Low power, modem, spread spectrum, direct sequence, sensor.

Effects of Elevated Temperature on Tunable Near-Zero Threshold CMOS [p. 255]
Vjekoslav Svilan, G. Leonard Tyler (Stanford University), James B. Burr (Sun Microsystems)

This paper explores functionality, performance, and energy efficiency of an 80,000 transistor, 0.35um, back-bias tunable, near-zero Vth , 32 x 32-bit multiplier operating at 100 deg C. Compared to operation at 28 deg C, performance at Vdd=2.0 V degrades 14 percent from 188MHz to 162MHz. At lower supply voltages, back bias is adjusted to minimize power dissipation as a function of operating frequency similarly to what we reported last year at 28®C. Comparing the operating points, the same performance at 100 deg C requires about 1.5 times the power measured at 28 deg C. It also requires about 1.2 V additional back bias and about a 20 percent increase in Vdd . The fraction of total power dissipated as leakage increases by about 1.5 times.

A Sub-1V Dual-Threshold Domino Circuit Using Product-of-Sum Logic [p. 259]
Koji Fujii, Takakuni Douseki, Yuichi Kado (NTT Telecommunications Energy Laboratories)

A sub-1 V dual-threshold Domino circuit is proposed to accelerate the operation of CMOS digital circuits at below 1 V. The circuit combines a low and high thresholdvoltage (Vt) MOSFET with standby control to make it possible to achieve high-speed evaluation and low standby leakage current. A low-Vt foot nMOSFET is used to shorten precharge time and increase throughput. A product-of-sum logic form is used for implementation of a pull-down logic to increase the noise margin. An experimental 64-bit carrylook-ahead (CLA) adder demonstrated a 0.6-V operation with a standby power of 0.4 µW and a delay time of 4.8 ns.

Mixed Multi-Threshold Differential Cascode Voltage Switch (MT-DCVS) Circuit Styles and Strategies for Low Power VLSI Design [p. 263]
W. Chen, W. Hwang, P. Kudva, G. D. Gristede, S. Kosonocky, R. V. Joshi (IBM T. J. Watson Research Center)

This paper presents mixed multi-threshold differential cascode voltage switch (MT-DCVS) circuits for low-power, high performance and deep- submicron VLSI design. These logic circuits incorporate two different sets of CMOS devices, low-Vt and regular high-Vt CMOS devices. By appropriately selecting the low-Vt and high-Vt devices and configurations in a circuit, we can gain performance of circuit while keeping the leakage current and power low. The key approaches are using low-Vt devices to gain performance, using high-Vt devices to cut off the leakage path and also using the reverse- biased low-Vt devices in their standby state. The methodology and algorithm are developed and simulated. The applications of such multi-Vt circuit techniques to the static, domino NORA DCVS and delayed reset circuits are described. The use of footer / header devices, gated-Vdd and a mixture of low-Vt and high-Vt devices to reduce power dissipation and subthreshold leakage current during standby and active modes, and the global design issues are also discussed.

Selectively Clocked Skewed Logic (SCSL): A Robust Low-Power Logic Style for High-Performance Applications [p. 267]
Naran Sirisantana, Aiqun Cao, Shawn Davidson, Cheng-Kok Koh, Kaushik Roy (Purdue University)

In very high performance designs, dynamic circuits, such as Domino Logic, are used because of their high speed. Skewed logic circuits can be used to achieve designs having performance comparable to that of Domino but with better scalability. Moreover, a selective clocking scheme may be applied to enhance the power savings for skewed logic circuits. This paper proposes Selectively Clocked Skewed Logic (SCSL), a new circuit style based on skewed logic aiming for low clock power consumption. The results on ISCAS benchmark circuits implemented with this circuit design style show that the total power consumption can be reduced by (52.05)% when compared to that of Domino circuit with comparable performance.
Categories and Subject Descriptors
1.3 [Logic and Microarchitecture Design]: Logic and RTL design.


Poster Session 4

Session Chair: Giovanni DeMicheli (Stanford)
A Profile-Based Energy-Efficient Intra-Task Voltage Scheduling Algorithm for Hard Real-Time Applications [p. 271]
Dongkun Shin, Jihong Kim (Seoul National University)

Intra-task voltage scheduling (IntraVS), which adjusts the supply voltage within an individual task boundary, is an effective technique for developing low-power applications. In this paper, we propose a novel intra-task voltage scheduling algorithm for hard real-time applications based on average-case execution information. Unlike the original IntraVS algorithm where voltage scaling decisions are based on the worst-case execution cycles, the proposed algorithm improves the energy efficiency by controlling the execution speed based on average-case execution cycles while still meeting the real-time constraints. The experimental results using an MPEG-4 decoder program show that the proposed algorithm reduces the energy consumption by up to 34% over the original IntraVS algorithm.

Compiler-Directed Dynamic Voltage/Frequency Scheduling for Energy Reduction in Microprocessors [p. 275]
Chung-Hsing Hsu, Ulrich Kremer, Michael Hsiao (Rutgers University)

Dynamic voltage and frequency scaling of the CPU has been identi.ed as one of the most e.ective ways to reduce energy consumption of a program. This paper discusses a compilation strategy that identifies scaling opportunities without significant overall performance penalty. Simulation results show CPU energy savings of 3.97%-23.75% for the SPECfp95 benchmark suite with a performance penalty of at most 2.53%.

Variable Voltage Task Scheduling Algorithms for Minimizing Energy [p. 279]
Ali Manzak, Chaitali Chakrabarti (Arizona State University)

In this paper we propose variable voltage task scheduling algorithms (periodic as well as aperiodic) that minimize energy. We first apply the existing task scheduling algorithms to obtain a feasible schedule and then distribute the available slack using an iterative algorithm that satisfies the theoretically obtained relation for minimum energy. We show experimentally that the voltage assignment obtained by our algorithm is very close (0.1% error) to that of the optimal assignment.

Design Methodology and Optimization Strategy for Dual-VTH Scheme Using Commercially Available Tools [p. 283]
Masayuki Hirabayashi, Koichi Nose, Takayasu Sakurai (University of Tokyo)

Design methodology for dual-VTH scheme using commercially available tools is presented and optimization strategy for the dual-VTH scheme is discussed. In order to suppress the power consumption, it is shown that using library cells that have various combinations of VTHâs is not needed. The cell library, which contains logic gates with all high VTH transistors and all low VTH transistors, is sufficient to reduce leakage power. 0.1V is shown to be the optimum value for VTH difference between VTH,HIGH and VTH,LOW in terms of power reduction.

Synthesis of Low-Leakage PD-SOI Circuits with Body-Biasing [p. 287]
Mario R. Casu, Gianluca Piccinini, Guido Masera, Maurizio Zamboni, (Politecnico di Torino)

In this work we propose a methodology for the reduction of leakage power dissipation through the use of smart body contacts in a partially depleted Silicon-on-Insulator (PD-SOI) technology. Reverse body biasing is used to increase threshold voltage in standby while in active mode PD-SOI gates switch with nominal Vth. As opposed to standard dual-Vth techniques used in CMOS bulk circuits, PD-SOI enables the application of body-bias to all gates included those in critical paths without delay penalties. Results are reported for the ISCAS85 combinational benchmarks.

Low-Power Technology Mapping for Mixed-Swing Logic [p. 291]
Nicola Dragone (Carnegie Mellon University & PDF Solutions), Rob A. Rutenbar, L. Richard Carley (Carnegie Mellon University), Roberto Zafalon (Carnegie Mellon University & STMicroelectronics)

Mixed-swing logic employs multiple power supply rails and device threshold voltages and allows us to create richer cell libraries with a wider range of power/speed tradeoffs. However, mapping onto such a library with a conventional technology mapper will not exploit the full potential of a mixed-swing methodology. To remedy this, we have developed a new technology mapping tool that specifically targets mixedswing logic. Our approach combines (1) efficient clustering and clusterlevel delay budgeting for the uncommitted logic, with (2) an exhaustive search for the optimal cover that is rendered practical by the clustering process. Power savings up to 3X have been demonstrated with our mixedswing solutions versus single power supply implementations.

Frequency-Domain Supply Current Macro-Model [p. 295]
Srinivas Bodapati (University of Illinois at Urbana-Champaign), Farid N. Najm (University of Toronto)

In order to perform block level analysis of the on-chip power distribution network, a high-level model is required that captures the dependence of the current waveform drawn by a logic block, per cycle, on its input vector pair. We present a frequency domain macro-modeling technique for capturing this dependence. The macro-model is based on estimating the Discrete Cosine Transform (DCT) of the current waveform and then taking the inverse transform to estimate the time domain current waveform.


Session 9: Low Power Analog Techniques

Session Chair: Supher Gouda (IBM)
Session Organizer: Ken Yang (University of California, Los Angeles)
A Low-Power, 5-70MHz, 7th-Order Filter with Programmable Boost, Group Delay, and Gain Using Instantaneous Companding [p. 299]
Rola A. Baki, Mourad N. El-Gamal (McGill University)

A seventh-order 0.05o equiripple linear-phase continuous-time filter employing, for the first time, instantaneous companding, was designed and integrated in a mature bipolar process. The amount of boost (up to 13dB) and group-delay adjustment (30%) are digitally programmable. The DC gain is controllable up to 10dB, and the -3dB frequency (fc) is tunable from 5 to 70MHz. The output swing for 1% THD is higher than 100mVpp, with a 1.5V supply. The filter consumes very low power (5-13mW for fc= 70MHz) compared to conventional implementations (e.g. 120mW for fc= 100MHz [1]).

Optimizing Bias-circuit Design of Cascode Operational Amplifier for Wide Dynamic Range Operations [p. 305]
Takeshi Fukumoto, Hiroyuki Okada, Kazuyuki Nakamura (NEC Corporation)

Proposed here is a bias circuit for use in a cascode operational amplifier to provide a wide output dynamic range. The bias circuit has been designed so that the drain-source voltage of each MOS transistor used in the gain stage is minimized to Vdsat automatically, making it possible to widen the output dynamic range.
Keywords
Amplifier, CMOS, Analog, Low voltage, Dynamic range, Cascode, Bias-circuit.

Leakage Current Cancellation Technique for Low Power Switched-Capacitor Circuits [p. 310]
Louis S. Y. Wong, Shohan Hossain, Andre Walker (St. Jude Medical)

In this paper, we describe a circuit technique to implement low power switched-capacitor circuits for low frequency operation. Low power consumption is crucial for medical implant devices. Reducing supply voltage is well known to minimize power dissipation. To facilitate low voltage operations, the transistor's Vth are becoming lower and lower. Low Vth transistors have high leakage currents which impact the performance of switchedcapacitor circuits, sample-and-hold amplifiers and many more. A new circuit technique is presented here to largely minimize the effective leakage current when the CMOS switch is turned off. It employs an active feedback loop to automatically cancel both junction and sub-threshold channel leakage. By reducing the effective leakage current, the capacitors used in the circuit can be significantly reduced, hence lowering the overall power consumption. This is a general technique and can be used in various circuit applications
Keywords
Low power, analog, leakage current, switched-capacitor circuit, sample and hold, amplifier.

A 3-Pin 1.5 V 550 µmW 176 x 144 Self-Clocked CMOS Active Pixel Image Sensor [p. 316]
Kwang-Bo Cho, Alexander Krymski, Eric R. Fossum (Photobit Technology Corporation)

This paper addresses the development of a micropower 176 x 144 self-clocked CMOS active pixel image sensor that dissipates one-to-two orders of magnitude less power than current state of the art CMOS image sensors. The chip operates from a 1.5 V voltage source and the power consumption measured for the chip running from an internal 25.2 MHz clock yielding 30 frames per second is about 550 µW. This amount enables the sensor to be run from a watch battery. It is believed that this chip is the worldâs lowest power image sensor and the first image sensor designed for a watch battery operation. The camera-on-a-chip operates as a selfclocked 3-pin sensor (GND, VDD (1.2 - 1.7 V), and DATAOUT). The die occupies 4 mm2 of silicon.
Keywords
Active Pixel Sensor, Image Sensor, CMOS, Low-Power, Low- Voltage, Self-Clocked.


Session 10: Algorithmic Transformations and Caching

Session Chair: T.N. Vijaykumar (Purdue)
Session Organizer: Babak Falsafi (Carnegie Mellon University)
Cached-Code Compression for Energy Minimization in Embedded Processors [p. 322]
Luca Benini (Universita di Bologna), Alberto Macii (Politecnico di Torino), Alberto Nannarelli (Universita di Roma)

This paper contributes a novel approach for reducing static code size and instruction fetch energy for cache-based core processors running embedded applications. Our implementation of the decompression unit guarantees fast and low-energy, on-the-y instruction decompression at each cache lookup. The decompressor is placed outside the core boundaries; therefore, processor architecture does not need any modification, making the proposed compression approach suitable to IP-based designs. Viability of our solution is assessed through extensive benchmarking performed on a number of typical embedded programs.

Energy Efficient Turbo Decoding for 3G Mobile [p. 328]
David Garrett, Bing Xu, Chris Nicol (Lucent Technologies)

The requirement of turbo decoding in 3G wireless standards has forced handset designers to consider power consumption issues in their implementations. The phenomenal performance of turbo codes comes at the expense of computation. Primarily this paper looks at methods of substantially reducing the power consumption for the decoding operation, making it feasible to integrate turbo decoders into a low power handset. The techniques presented include early termination of the turbo process, encoding of extrinsic information to reduce the memory size, and disabling portions of the MAP algorithm when the results will not affect the decoded output. The net result of these techniques is almost a 70% reduction in power over a fixed 6 iteration, 8-state baseline turbo decoder at 2 dB of signal to noise ratio (SNR).
Keywords
Turbo coding, low power, early termination, extrinsics.

Low-Power AEC-Based MIMO Signal Processing for Gigabit Ethernet 1000Base-T Transceivers [p. 334]
Lei Wang, Naresh R. Shanbhag (University of Illinois at Urbana-Champaign)

Presented in this paper is a low-power technique, denoted as MIMO-AEC, to reduce energy dissipation in multi-input-multi-output (MIMO) signal processing systems. The proposed technique extends a previously proposed adaptive error cancellation (AEC) technique to MIMO systems by employing an algorithm transformation denoted as MIMO-DECOR. The purpose of MIMO-DECOR is to reduce complexity by exploiting correlations inherent in MIMO systems, thereby improving the effectiveness of AEC. We employ the MIMO-AEC in the design of a low-power Gigabit Ethernet 1000Base- T device. Simulation results demonstrate 44:3% - 25.2% overhead reduction due to MIMO-DECOR and 69.1% - 64.2% energy savings over conventional implementation with no loss in algorithmic performance.

Power Reduction through Work Reuse [p. 340]
Emil Talpes, Diana Marculescu (Carnegie Mellon University)

Power consumption has become one of the big challenges in designing high performance processors. The rapid increase in complexity and speed that comes with each new CPU generation causes greater problems with power consumption and heat dissipation. Traditionally, these concerns are addressed through semiconductor technology improvements such as voltage reduction and technology scaling. This work proposes an alternative solution to this problem, by dealing with the power consumption in the very early stage of the microarchitecture design. More precisely, we show that by modifying the well-established out-of-order, superscalar processor architecture, significant gains can be achieved in terms of power requirements without performance penalty. Our proposed approach relies on reusing as much as possible from the work done by the front-end of a typical pipelined, superscalar out-of-order via the use of a cache nested deeply into the processor structure. Experimental results show up to 52% (20% on average) savings in average energy per committed instruction for two different pipeline structures.


Session 11: Low Power Digital Building Blocks

Session Chair: David Garrett (Lucent Technologies)
Session Organizer: Donald Steiss (Mindspring)
Clocking Strategies and Scannable Latches for Low Power Applications [p. 346]
V. Zyuban, D. Meltzer (IBM T. J. Watson Research Center)

This paper covers a range of issues in the design of clocking schemes for low-power applications. First we revisit, extend and improve the power-performance optimization methodology for latches, attempting to make it more formal and comprehensive. Data switching factor and the glitching activity are taken into consideration, using a formal analytical approach, then a notion of energy-efficient family of configurations is introduced to make the comparison of different latch styles in the power-performance space more fair, also the power of the clock distribution is taken into account. Practical issues of building a low overhead scan mechanism are considered, and the power overhead of the scannable design is analyzed. A low-power LSSD extension to single-phase latches is proposed, and results of a comparative study of LSSD-scannable latches are shown, supported by experimental data measured on a 0:18u test chip.

Ultra-Low Power DLMS Adaptive Filter for Hearing Aid Applications [p. 352]
Hyung-il Kim, Kaushik Roy (Purdue University)

We present an ultra-low power DLMS (delayed least mean square) adaptive filter working in the sub-threshold region for hearing aid applications. Sub-threshold operation was accomplished by using a parallel architecture with pseudo NMOS logic style. The parallel architecture enabled us to run the system at a lower clock rate with a reduced supply voltage, while maintaining the same throughput. Pseudo NMOS logic operating in the sub-threshold region (Sub-Pseudo NMOS) provided better power-delay product than subthreshold CMOS (Sub-CMOS) logic. Simulation results show that the system can process voice signals at a throughput of 22kHz with a supply voltage of 400mV and achieve 91% improvement in energy compared to the non-parallel architecture using standard CMOS logic.
Keywords
DLMS adaptive filter, sub-threshold operation, parallel architecture, Sub-Pseudo NMOS, Sub-CMOS

A Dynamic-SDRAM-Mode-Control Scheme for Low-Power Systems with a 32-bit RISC CPU [p. 358]
Seiji Miura, Kazushige Ayukawa, Takao Watanabe (Hitachi, Ltd.)

We have developed a dynamic-SDRAM-mode-control scheme for low-power systems with a 32-bit RISC CPU. The scheme is based on two dynamic changes of SDRAM modes: from active standby to standby and from standby to active standby. It reduces both the operating current and the latency of an SDRAM. An analysis using benchmark programs shows that the developed scheme reduces the SDRAM operating current by 40% and latency by 38% compared to those of standby mode. An SDRAM controller was developed based on this scheme and 0.18-um CMOS technology. The area of the controller is 0.28mm2 and its operating current is 2.5mA at 1.8V and 100 MHz.
Keywords:
SDRAM controller, standby mode, active-standby mode

Analysis and Implementation of Charge Recycling for Deep Sub-micron Buses [p. 364]
Paul P. Sotiriadis, Theodoros Konstantakopoulos, Anantha Chandrakasan (Massachusetts Institute of Technology)

Charge recycling has been proposed as a strategy to reduce the power dissipation in data buses. Previous work in this area was based on simplified bus models that ignored the coupling between the lines. Here we propose a new Charge Recycling Technique (CRT) appropriate for sub-micron technologies. CRT is analyzed mathematically using a bus energy model that captures the energy loss due to strong line to line capacitive coupling. In theory CRT can result to energy reduction of a factor of 2. It becomes even more energy efficient when combined with Bus Invert coding (Stan Î97, [6]). A circuit has been designed and simulated with all parasitic elements extracted from the layout. Taking into account the circuit energy overhead the net result in energy saving can be up to 32%.


Session 12: Power Supply and Delivery

Session Chair: Farid Najm (University of Toronto)
Session Organizer: Ed Huijbregts (Magma)
Estimation of Power Distribution in VLSI Interconnects [p. 370]
Youngsoo Shin, Takayasu Sakurai (University of Tokyo)

The analysis and simulation of effects induced by VLSI interconnects become increasingly important as the scale of process technologies steadily shrinks. While most analyses focus on the timing aspects of interconnects, power consumption is also important. In this paper, the power distribution estimation of interconnects is studied using a reduced-order model. The relation between power consumption and the poles and residues of a transfer function is derived, and an appropriate driver model is developed, allowing power consumption to be computed efficiently. Application of the proposed method to RC networks is demonstrated using a prototype tool.

Maximum Voltage Variation in the Power Distribution Network of VLSI Circuits with RLC Models [p. 376]
Sudhakar Bobba (Sun Microsystems Inc.), Ibrahim N. Hajj (American University of Beirut)

In this paper, we present a frequency-domain technique to estimate the worst-case time-domain voltage variation using RLC models for the power distribution network. The proposed method, unlike existing simulation-based techniques, can handle frequency-dependent RLC parameters and generate an upperbound on the maximum voltage drop over all possible input excitations. Pattern independent maximum envelope currents are used to estimate the upperbound on the maximum magnitude of the frequency components for the current waveform. These values are used to formulate a nonlinear optimization problem for the maximum voltage drop at nodes in the power distribution network. We then present a method to solve the nonlinear optimization problem using Lagrange multipliers. Comparisons with SPICE simulations are presented to validate the techniques presented in the paper.

Battery Capacity Measurement and Analysis using Lithium Coin Cell Battery [p. 382]
Sung Park, Andreas Savvides, Mani B. Srivastava (University of California, Los Angeles)

In this paper, we look at different battery capacity models that have been introduced in the literatures. These models describe the battery capacity utilization based on how the battery is discharged by the circuits that consume power. In an attempt to validate these models, we characterize a commercially available lithium coin cell battery through careful measurements of the current and the voltage output of the battery under different load profile applied by a micro sensor node. In the result, we show how the capacity of the battery is affected by the different load profile and provide analysis on whether the conventional battery models are applicable in the real world. One of the most significant finding of our work will show that DC/DC converter plays a significant role in determining the battery capacity, and that the true capacity of the battery may only be found by careful measurements.
Keywords
Embedded System, Battery, Power Estimation, Energy Estimation, DC/DC Converter, Coin Cell, Data Acquisition

On the Interaction of Power Distribution Network with Substrate [p. 388]
Rajendran Panda, Savithri Sundareswaran, David Blaauw (Motorola, Inc.)

In this paper, we investigate the interaction between a chipâs power distribution network and its substrate to understand its impact on power supply noise and substrate-coupled noise. The study is set in the context of low-voltage, low-power, mixed signal chip designs based on low resistance, epitaxial process, substrate technology. We believe the findings of this study are significant to both the chip integration engineer and the analog circuit designer. We attempt here to answer two important questions: (1) To what extent can substrate modify the power supply noise, and what parameters of substrate design, if any, are salient? (2) What is the extent of coupling from the noisy digital power supply to the analog circuits through the substrate? We propose a method to simulate the power grid along with the substrate and present findings of case studies conducted on three low-power processor designs.
Keywords: substrate analysis, power grid analysis, substrate noise, substrate coupled noise