SIGDA, Super Compendium, DAC 1999, Abstracts

DAC'99 ABSTRACTS

Sessions: [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [Tutorials]

Session 1: Advances in Model Reduction

Chair: Joel R. Phillips
Organizers: Hidetoshi Onodera, Joel Phillips

1.1 An Efficient Lyapunov Equation-Based Approach for Generating Reduced-Order Models of Interconnect [p. 1]

Jing-Rebecca Li, Frank Wang, Jacob K. White

In this paper we present a new algorithm for computing reduced-order models of interconnect which utilizes the dominant controllable subspace of the system. The dominant controllable modes are computed via a new iterative Lyapunov equation solver, Vector ADI. This new algorithm is as inexpensive as Krylov subspace-based moment matching methods, and often produces a better approximation over a wide frequency range. A spiral inductor and a transmission line example show this new method can be much more accurate than moment matching via Arnoldi.

1.2 Error Bounded Padé Approximation via Bilinear Conformal Transformation [p. 7]

Chung-Ping Chen, D.F. Wong

Since Asymptotic Waveform Evaluation (AWE) was introduced in [5], many interconnect model order reduction methods via Pade approximation have been proposed. Although the stability and precision of model reduction methods have been greatly improved, the following important question has not been answered: "What is the error bound in the time domain?". This problem is mainly caused by the "gap" between the frequency domain and the time domain, i.e. a good approximated transfer function in the frequency domain may not be a good approximation in the time domain. All of the existing methods approximate the transfer function directly in the frequency domain and hence can not provide error bounds in the time domain. In this paper, we present new moment matching methods which can provide guaranteed error bounds in the time domain. Our methods are based on the classic work by Teasdale in [1] which performs Pade approximation in a transformed domain by the bilinear conformal transformation s = 1-z/1+z.

1.3 Model-Reduction of Nonlinear Circuits Using Krylov-Space Techniques [p. 13]

Pavan K. Gunupudi, Michel S. Nakhla

A new algorithm based on Krylov subspace methods is proposed for model-reduction of large nonlinear circuits. Reduction is obtained by projecting the original system described by nonlinear differential equations into a subspace of a lower dimension. The reduced model can be simulated using conventional numerical integration techniques. Significant reduction in computational expense is achieved as the size of the reduced equations is much less than that of the original system.
1.1 Keywords: Model-reduction, nonlinear circuits, Krylov-subspace

1.4 ENOR: Model Order Reduction of RLC Circuits Using Nodal Equations for Efficient Factorization [p. 17]

Bernard N. Sheehan

ENOR is an innovative way to produce provably-passive, reciprocal, and compact representations of RLC circuits. Beginning with the nodal equations, ENOR formulates recurrence relations for the moments that involve factorizing a symmetric, positive definite matrix; this contrasts with other RLC order reduction algorithms that require expensive LU factorization. It handles floating capacitors, inductor loops, and resistor links in a uniform way. It distinguishes between active and passive ports, does Gram-Schmidt orthogonalization on the fly, controls error in the time-domain. ENOR is a superbly simple, flexible, and well-conditioned algorithm for lightning reduction of mega-sized RLC trees, meshes, and coupled interconnects-all with excellent accuracy.

Session 2: Combinatorial Problems

Chair: Fabio Somenzi
Organizers:Sharad Malik, Andrew Kahng

2.1 Why is ATPG easy? [p. 22]

Mukul R. Prasad, Philip Chong, Kurt Keutzer

Empirical observation shows that practically encountered instances of ATPG are efficiently solvable. However, it has been known for more than two decades that ATPG is an NP-complete problem. This work is one of the first attempts to reconcile these seemingly disparate results. We introduce the concept of circuit cut-width and characterize the complexity of ATPG in terms of this property. We provide theoretical and empirical results to argue that an interestingly large class of practical circuits have cut-width characteristics which ensure a provably efficient solution of ATPG on them.

2.2 Using Lower Bounds during Dynamic BDD Minimization [p. 29]

Rolf Drechsler, Wolfgang Günther

Ordered Binary Decision Diagrams (BDDs) are a data structure for representation and manipulation of Boolean functions often applied in VLSI CAD. The choice of the variable ordering largely influences the size of the BDD; its size may vary from linear to exponential. The most successful methods for finding good orderings are based on dynamic variable reordering, i.e. exchanging of neighboring variables. This basic operation has been used in various variants, like sifting and window permutation. In this paper we show that lower bounds computed during the minimization process can speed up the computation significantly. First, lower bounds are studied from a theoretical point of view. Then these techniques are incorporated in dynamic minimization algorithms. By the computation of good lower bounds large parts of the search space can be pruned resulting in very fast computations. Experimental results are given to demonstrate the efficiency of our approach.

2.3 Optimization-Intensive Watermarking Techniques for Decision Problems [p. 33]

Gang Qu, Jennifer L. Wong, Miodrag Potkonjak

Recently, a number of watermarking-based intellectual property protection techniques have been proposed. Although they have been applied to different stages in the design process and have a great variety of technical and theoretical features, all of them share two common properties: they all have been applied solely to optimization problems and do not involve any optimization during the watermarking process. In this paper, we propose the first set of optimization-intensive watermarking techniques for decision problems. In particular, we demonstrate how one can select a subset of superimposed water-marking constraints so that the uniqueness of the signature and the likelihood of satisfying an instance of the satisfiability problem are simultaneously maximized. We have developed three watermarking SAT techniques: adding clauses, deleting literals, push-out and pull-back. Each technique targets different types of signature-induced constraint superimposition on an instance of the SAT problem. In addition to comprehensive experimental validation, we theoretically analyze the potentials and limitations of the proposed watermarking techniques. Furthermore, we analyze the three proposed optimization-intensive watermarking SAT techniques in terms of their suitability for copy detection.

2.4 Efficient Algorithms for Optimum Cycle Mean and Optimum Cost to Time Ratio Problems [p. 37]

Ali Dasdan, Sandy S. Irani, Rajesh K. Gupta

The goal of this paper is to identify the most efficient algorithms for the optimum mean cycle and optimum cost to time ratio problems and compare them with the popular ones in the CAD community. These problems have numerous important applications in CAD, graph theory, discrete event system theory, and manufacturing systems. In particular, they are fundamental to the performance analysis of digital systems such as synchronous, asynchronous, data flow, and embedded real-time systems. For instance, algorithms for these problems are used to compute the cycle period of any cyclic digital system. Without loss of generality, we discuss these algorithms in the context of the minimum mean cycle problem (MCMP). We performed a comprehensive experimental study of ten leading algorithms for MCMP. We programmed these algorithms uniformly and efficiently. We systematically compared them on a test suite composed of random graphs as well as benchmark circuits. Above all, our results provide important insight into the performance of these algorithms in practice. One of the most surprising results of this paper is that Howard's algorithm, known primarily in the stochastic control community, is by far the fastest algorithm on our test suite although the only known bound on its running time is exponential. We provide two stronger bounds on its running time.

Session 3: IP-Based Design

Chair: David Ku
Organizers: Rajesh Gupta, David Ku

3.1 IP-Based Design Methodology [p. 43]

Daniel D. Gajski

3.2 IPCHINOOK: An Integrated IP-based Design Framework for Distributed Embedded Systems [p. 44]

Pai Chou, Ross Ortega, Ken Hines, Kurt Partridge, Gaetano Borriello

IPCHINOOK is a design tool for distributed embedded systems. It gains leverage from the use of a carefully chosen set of design abstractions that raise the level of designer interaction during the specification, synthesis, and simulation of the design. IPCHINOOK focuses on a component-based approach to system building that enhances the ability to reuse existing software modules. This is accomplished through a new model for constructing components that enables composition of control-flow as well as data-flow. The designer then maps the elements of the specification to a target architecture: a set of processing elements and communication channels. IPCHINOOK synthesizes all of the detailed communication and synchronization instructions. Designers get feedback via a co-simulation engine that permits rapid evaluation. By shortening the design cycle, designers are able to more completely explore the design space of possible architectures and/or improve time-to-market. IPCHINOOK is embodied in a system development environment that supports the design methodology by integrating a user interface for system specification, simulation, and synthesis tools. By raising the level of abstraction of specifications above the low-level target-specific implementation, and by automating the generation of these difficult and error-prone details, IPCHINOOK lets designers focus on global architectural and functionality decisions.

3.3 Virtual Simulation of Distributed IP-based Designs [p. 50]

Marcello Dalpasso, Alessandro Bogliolo, Luca Benini

One key issue in design flows based on reuse of third-party intellectual property (IP) components is the need to estimate the impact of component instantiation within complex designs. In this paper we introduce JavaCAD, an internet-based EDA tool built on a secure client-server architecture that enables designers to perform simulation and cost estimation of circuits containing IP components without actually purchasing them. At the same time, the tool ensures intellectual property protection for the vendors of IP components, and for the IP-users as well. Moreover, JavaCAD supports negotiation of the amount of information and the accuracy of cost estimates, thereby providing seamless transition between IP evaluation and purchase.

Session 4: Low-Power Design Using Voltage Scaling

Chair: Tadahiro Kuroda
Organizers:Anantha Chandrakasan, Mojy Chian

4.1 Common-Case Computation: A High-Level Technique for Power and Performance Optimization [p. 56]

Ganesh Lakshminarayana, Anand Raghunathan, Kamal S. Khouri, Niraj K. Jha, Sujit Dey

This paper presents a design methodology, called common-case computation (CCC), and new design automation algorithms for optimizing power consumption or performance. The proposed techniques are applicable in conjunction with any high-level design methodology where a structural register-transfer level (RTL) description and its corresponding scheduled behavioral (cycle-accurate functional RTL) description are available. It is a well-known fact that in behavioral descriptions of hardware (also in software), a small set of computations (CCCs) often accounts for most of the computational complexity. However, in hardware implementations (structural RTL or lower level), CCCs and the remaining computations are typically treated alike. This paper shows that identifying and exploiting CCCs during the design process can lead to implementations that are much more efficient in terms of power consumption or performance. We propose a CCC-based high-level design methodology with the following steps: extraction of common-case behaviors and execution conditions from the scheduled description, simplification of the common-case behaviors in a stand-alone manner, synthesis of common-case detection and execution circuits from the common-case behaviors, and composing the original design with the common-case circuits, resulting in a CCC-optimized design. We demonstrate that CCC-optimized designs reduce power consumption by up to 91.5%, or improve performance by up to 76.6% compared to designs derived without special regard for CCCs.

4.2 Layout Techniques Supporting the Use of Dual Supply Voltages for Cell-Based Designs [p. 62]

Chingwei Yeh, Yin-Shuin Kang, Shan-Jih Shieh, Jinn-Shyan Wang

Gate-level voltage scaling is an approach that allows different supply voltages for different gates in order to achieve power reduction. Previous researches focused on determining the voltage level for each gate and ascertaining the power saving capability of the approach via logic-level power estimation. In this paper, we present the layout techniques that feasiblize the approach in cell-based design environment. A new block layout style is proposed to support the voltage scaling with conventional standard cell libraries. The block layout can be automatically generated via a simulated annealing based placement algorithm. In addition, we propose a new cell layout style with built-in multiple supply rails. Using the cell layout, gate-level voltage scaling can be immediately embedded in a typical cell-based design flow. Experimental results show that proposed techniques produce very promising results.

4.3 Gate-Level Design Exploiting Dual Supply Voltages for Power-Driven Applications [p. 68]

Chingwei Yeh, Min-Cheng Chang, Shih-Chieh Chang, Wen-Bone Jone

The advent of portable and high-density devices has made power consumption a critical design concern. In this paper, we address the problem of reducing power consumption via gate-level voltage scaling for those designs that are not under the strictest timing budget. We first use a maximum-weighted independent set formulation for voltage reduction on non-critical part of the circuit. Then, we use a minimum-weighted separator set formulation to do gate sizing and integrate the sizing procedure with a voltage scaling procedure to enhance power saving on the whole circuit. The proposed methods are evaluated using the MCNC benchmark circuits. and an average of 19.12% power reduction over the circuits having only one supply voltage has been achieved.

4.4 Synthesis of Low Power CMOS VLSI Circuits Using Dual Supply Voltages [p. 72]

Vijay Sundararajan, Keshab K. Parhi

Dynamic power consumed in CMOS gates goes down quadratically with the supply voltage. By maintaining a high supply voltage for gates on the critical path and by using a low supply voltage for gates off the critical path it is possible to dramatically reduce power consumption in CMOS VLSI circuits without performance degradation. Interfacing gates operating under multiple supply voltages, however, requires the use of level converters, which makes the problem modeling difficult. In this paper we develop a formal model and develop an efficient heuristic for addressing the use of two supply voltages for low power CMOS VLSI circuits without performance degradation. Power consumption savings up to 25% over and above the best known existing heuristics are demonstrated for combinational circuits in the ISCAS85 benchmark suite.

Session 5: Panel: HW and SW in Embedded System Design: Loveboat, Shipwreck, or Ships Passing in the Night ?

Moderator: Kurt Keutzer
Organizers: Raul Camposano, Kurt Keutzer
Panel Members: Jerry Fiddler, Raul Camposano, Alberto Sangiovanni-Vincentelli, Jim Lansford [p. 76]

The merging of hardware and software on a single integrated circuit is causing many to rethink their approach to embedded system design, and some are forecasting significant changes in the dynamics of the associated electronic-design automation (EDA) and embedded software industries as well. For some, the ocean of silicon ahead is sure to host a loveboat for fully integrated hardware/software systems. In this scenario a unifying system-design-environment provides implementation-independent modeling that stands above the particulars of hardware and software implementation issues. At the implementation level, embedded software design tools and electronic-design automation tools work seamlessly together providing a variety of architectural targets for the functionality of high-level descriptions. Others see a shipwreck on the horizon, as hardheaded hardware designers and softheaded software developers clash. In this scenario the system-on-chip becomes a battleground in which the two respective communities attempt to dominate both the solution space, and the system-design budgets of their common customer. There's an old adage that the most boring scenario is the most likely. Following this adage some predict that the future holds more 'no show' than 'showdown'. The holders of this view note that as severe power constraints cause hardware-designers to eschew software solutions, and as the world of ubiquitous computing significantly broadens the market for embedded system software, then the hardware and software communities will be simply chips passing in the night.

Session 6: Routing for Deep Submicron

Chair: Patrick Groeneveld
Organizers: Patrick Groeneveld, Lou Scheffer

6.1 Reliability-Constrained Area Optimization of VLSI Power/Ground Networks Via Sequence of Linear Programmings [p. 78]

Xiang-Dong Tan, C.-J. Richard Shi, Dragos Lungeanu, Jyh-Chwen Lee, Li-Pen Yuan

This paper presents a new method for determining the widths of the power and ground routes in integrated circuits so that the area required by the routes is minimized subject to the reliability constraints. The basic idea is to transform the resulting constrained nonlinear programming problem into a sequence of linear programs. Theoretically, we show that the sequence of linear programs always converges to the optimum solution of the relaxed convex problem. Experimental results demonstrate that the sequence-of-linear-programming method is orders of magnitude faster than the best-known method based on conjugate gradients, with constantly better optimization solutions.

6.2 FAR-DS: Full-plane AWE Routing with Driver Sizing [p. 84]

Jiang Hu, Sachin S. Sapatnekar

We propose a Full-plane AWE Routing with Driver Sizing (FAR-DS) algorithm for performance driven routing in deep sub-micron technology. We employ a fourth order AWE delay model in the full plane, including both Hanan and non-Hanan points. Optimizing the driver size simultaneously extends our work into a two-dimensional space, enabling us to achieve the desired balance between wire and driver cost reduction, while satisfying the timing constraints. Compared to SERT, experimental results showed that our algorithm can provide an average reduction of 23% in the wire cost and 50% in the driver cost under stringent timing constraints.

6.3 Noise-Constrained Performance Optimization by Simultaneous Gate and Wire Sizing Based on Lagrangian Relaxation [p. 90]

Hui-Ru Jiang, Jing-Yang Jou, Yao-Wen Chang

Noise, as well as area, delay, and power, is one of the most important concerns in the design of deep submicron ICs. Currently existing algorithms can not handle simultaneous switching conditions of signals for noise minimization. In this paper, we model not only physical coupling capacitance, but also simultaneous switching behavior for noise optimization. Based on Lagrangian relaxation, we present an algorithm that can optimally solve the simultaneous noise, area, delay, and power optimization problem by sizing circuit components. Our algorithm, with linear memory requirement overall and linear runtime per iteration, is very effective and efficient. For example, for a circuit of 6144 wires and 3512 gates, our algorithm solves the simultaneous optimization problem using only 2.1 MB memory and 47 minute runtime to achieve the precision of within 1% error on a SUN UltraSPARC-I workstation.

6.4 Simultaneous Routing and Buffer Insertion with Restrictions on Buffer Locations [p. 96]

Hai Zhou, D. F. Wong, I-Min Liu, Adnan Aziz

During the routing of global interconnects, macro blocks form useful routing regions which allow wires to go through but forbid buffers to be inserted. They give restrictions on buffer locations. In this paper, we take these buffer location restrictions into consideration and solve the simultaneous maze routing and buffer insertion problem. Given a block placement defining buffer location restrictions and a pair of pins (a source and a sink), we give a polynomial time exact algorithm to find a buffered route from the source to the sink with minimum Elmore delay.

6.5 Crosstalk Minimization using Wire Perturbations [p. 100]

Prashant Saxena, C.L. Liu

We study the variation of the crosstalk in a net and its neighbors when one of its trunks is perturbed, showing that the trunk's perturbation range can be efficiently divided into subintervals having monotonic or unimodal crosstalk variation. We can therefore determine the optimum trunk location without solving any non-linear equations. Using this, we construct and experimentally verify an algorithm to minimize the peak net crosstalk in a gridless channel.

Session 7: Asynchronous Logic Synthesis

Chair: Prabhakar Kudva
Organizers: Timothy Kam, Luciano Lavagno

7.1 Practical Advances in Asynchronous Design and in Asynchronous/Synchronous Interfaces [p. 104]

Erik Brunvand, Steven Nowick, Kenneth Yun

Asynchronous systems are being viewed as an increasingly viable alternative to purely synchronous systems. This paper gives an overview of the current state of the art in practical asynchronous circuit and system design in four areas: controllers, datapaths, processors, and the design of asynchronous/synchronous interfaces.

7.2 Automatic Synthesis and Optimization of Partially Specified Asynchronous Systems [p. 110]

Alex Kondratyev, Jordi Cortadella, Michael Kishinevsky, Luciano Lavagno, Alexander Yakovlev

A method for automating the synthesis of asynchronous control circuits from high level (CSP-like) and/or partial STG (involving only functionally critical events) specifications is presented. The method solves two key subtasks in this new, more flexible, design flow: handshake expansion, i.e. inserting reset events with maximum concurrency, and event reshuffling under interface and concurrency constraints, by means of concurrency reduction. In doing so, the algorithm optimizes the circuit both for size and performance. Experimental results show a significant increase in the solution space explored when compared to existing CSP-based or STG-based synthesis tools.

7.3 CAD Directions for High Performance Asynchronous Circuits [p. 116]

Ken Stevens, Shai Rotem, Steven M. Burns, Jordi Cortadella, Ran Ginosar, Michael Kishinevsky, Marly Roncken

This paper describes a novel methodology for high performance asynchronous design based on timed circuits and on CAD support for their synthesis using Relative Timing. This methodology was developed for a prototype iA32 instruction length decoding and steering unit called RAPPID ("Revolving Asynchronous Pentium R Processor Instruction Decoder") that was fabricated and tested successfully. Silicon results show significant advantages - in particular, performance of 2.5-4.5 instructions per nS - with manageable risks using this design technology. RAPPID achieves three times faster performance and half the latency dissipating only half the power and requiring a minor area penalty as a comparable 400MHz clocked circuit. Relative Timing is based on user-defined and automatically extracted relative timing assumptions between signal transitions in a circuit and its environment. It supports the specification, synthesis, and verification of high-performance asynchronous circuits, such as pulse-mode circuits, that can be derived from an initial speed-independent specification. Relative timing presents a "middle-ground" between clocked and asynchronous circuits, and is a fertile area for CAD development. We discuss possible directions for future CAD development.

Session 8: System-Level Power Optimization

Chair: Rajesh K. Gupta
Organizers: David Ku, Rajesh Gupta

8.1 A Low Power Hardware/Software Partitioning Approach for Core-based Embedded Systems [p. 122]

Jörg Henkel

We present a novel approach that minimizes the power consumption of embedded core-based systems through hardware/ software partitioning. Our approach is based on the idea of mapping clusters of operations/instructions to a core that yields a high utilization rate of the involved resources (ALUs, multipliers, shifters, ...) and thus minimizing power consumption. Our approach is comprehensive since it takes into consideration the power consumption of a whole embedded system comprising a microprocessor core, application specific (ASIC) core(s), cache cores and a memory core. We report high reductions of power consumption between 35% and 94% at the cost of a relatively small additional hardware overhead of less than 16k cells while maintaining or even slightly increasing the performance compared to the initial design.

8.2 Synthesis of Low-Overhead Interfaces for Power-Efficient Communication over Wide Buses [p. 128]

L. Benini, A. Macii, E. Macii, M. Poncino, R. Scarsi

In this paper we present algorithms for the synthesis of encoding and decoding interface logic that minimizes the average number of transitions on heavily-loaded global bus lines. The approach automatically constructs low-transition activity codes and hardware implementation of encoders and decoders, given information on word-level statistics. We present an accurate method that is applicable to low-width buses, as well as approximate methods that scale well with bus width. Furthermore, we introduce an adaptive architecture that automatically adjusts encoding to reduce transition activity on buses whose word-level statistics are not known a-priori. Experimental results demonstrate that our approach well outperforms low-power encoding schemes presented in the past.

8.3 Power Conscious Fixed Priority Scheduling for Hard Real-Time Systems [p. 134]

Youngsoo Shin, Kiyoung Choi

Power efficient design of real-time systems based on programmable processors becomes more important as system functionality is increasingly realized through software. This paper presents a power-efficient version of a widely used fixed priority scheduling method. The method yields a power reduction by exploiting slack times, both those inherent in the system schedule and those arising from variations of execution times. The proposed run-time mechanism is simple enough to be implemented in most kernels. Experimental results show that the proposed scheduling method obtains a significant power reduction across several kinds of applications.

8.4 Memory Exploration for Low Power, Embedded Systems [p. 140]

Wen-Tsong Shiue, Chaitali Chakrabarti

In embedded system design, the designer has to choose an on-chip memory configuration that is suitable for a specific application. To aid in this design choice, we present a memory exploration strategy based on three performance metrics, namely, cache size, the number of processor cycles and the energy consumption. We show how the performance is affected by cache parameters such as cache size, line size, set associativity and tiling, and the off-chip data organization. We show the importance of including energy in the performance metrics, since an increase in the cache line size, cache size, tiling and set associativity reduces the number of cycles but does not necessarily reduce the energy consumption. These performance metrics help us find the minimum energy cache configuration if time is the hard constraint, or the minimum time cache configuration if energy is the hard constraint.
Keywords: Design automation, Low power design, Memory hierarchy, Low power embedded systems, Memory exploration and optimization, Cache simulator, Off-chip data assignment.

Session 9: Networked Consumer Applications

Chair: Asawaree Kalavade
Organizers: Raul Camposano, Asawaree Kalavade

9.1 Distributed Application Development with Inferno [p. 146]

Ravi Sharma

Distributed computing has taken a new importance in order to meet the requirements of users demanding information "anytime, anywhere." Inferno facilitates the creation and support of distributed services in the new and emerging world of network environments. These environments include a world of varied terminals, network hardware, and protocols. The Namespace is a critical Inferno concept that enables the participants in this network environment to deliver resources to meet the many needs of diverse users. This paper discusses the elements of the Namespace technology. Its simple programming model and network transparency is demonstrated through the design of an application that can have components in several different nodes in a network. The simplicity and flexibility of the solution is highlighted.
Keywords: Inferno, InfernoSpaces, distributed applications, Styx, networking protocols.

9.2 Embedded Application Design Using a Real-Time OS [p. 151]

David Stepner, Nagarajan Rajan, David Hui

You read about it everywhere: distributed computing is the next revolution, perhaps relegating our desktop computers to the museum. But in fact the age of distributed computing has been around for quite a while. Every time we withdraw money from an ATM, start our car, use our cell phone, or microwave our dinner, microprocessors are at work performing dedicated functions. These are examples of just a very few of the thousands of "embedded systems." Until recently the vast majority of these embedded systems used 8- and 16-bit microprocessors, requiring little in the way of sophisticated software development tools, including an Operating System (OS). But the breaking of the $5 threshold for 32-bit processors is now driving an explosion in high-volume embedded applications. And a new trend towards integrating a full system-on-a-chip (SOC) promises a further dramatic expansion for 32-bit embedded applications as we head into the 21 st century.

9.3 The Jini^TM Architecture: Dynamic Services in a Flexible Network [p. 157]

Ken Arnold

This paper gives an overview of the Jini^TM architecture, which provides a federated infra-structure for dynamic services in a network. Services may be large or small.
1.1 Keywords: Jini, Java, networks, distribution, distributed computing

Session 10: Functional Verification of Microprocessors

Chair: Rajesh Raina
Organizers: Raj Raina, Vivek Tiwari

10.1 Verifying Large-Scale Multiprocessors Using an Abstract Verification Environment [p. 163]

Dennis Abts, Mike Roberts

The complexity of large-scale multiprocessors has burdened the design and verification process making complexity-effective functional verification an elusive goal. We propose a solution to the verification of complex systems by introducing an abstracted verification environment called Raven. We show how Raven uses standard C/C++ to extend the capability of contemporary discrete-event logic simulators. We introduce new data types and a diagnostic programming interface (DPI) that provide the basis for Raven. Finally, we show results from an interconnect router ASIC used in a large-scale multiprocessor.

10.2 Functional Verification of the Equator MAP1000 Microprocessor [p. 169]

Jian Shen, Jacob Abraham, Dave Baker, Tony Hurson, Martin Kinkade, Gregorio Gervasio, Chen-Chau Chu, Guanghui Hu

The Advanced VLIW architecture of the Equator MAP1000 processor has many features that present significant verification challenges. We describe a functional verification methodology to address this complexity. In particular, we present an efficient method to generate directed assembly tests and a novel technique using the processor itself to control self-tests and check the results at speed using native instructions only. We also describe the use of emulation in both pre-silicon and post-silicon verification stages.

10.3 Micro Architecture Coverage Directed Generation of Test Programs [p. 175]

Shmuel Ur, Yoav Yadin

In this paper, we demonstrate a method for generation of assembler test programs that systematically probe the micro architecture of a PowerPC superscalar processor. We show innovations such as ways to make small models for large designs, predict, with cycle accuracy the movement of instructions through the pipes (taking into account stalls and dependencies) and generation of test programs such that each reaches a new micro architectural state. We compare our method to the established practice of massive random generation and show that the quality of our tests, as measured by transition coverage, is much higher. The main contribution of this paper is not in theory, as the theory has been discussed in previous papers, but in describing how to translate this theory into practice in a practical way, a task that was far from trivial.

10.4 Verification of a Microprocessor Using Real World Applications [p. 181]

You-Sung Chang, Seungjong Lee, In-Cheol Park, Chong-Min Kyung

In this paper, we describe a fast and convenient verification methodology for microprocessor using large-size, real application programs as test vectors. The verification environment is based on automatic consistency checking between the golden behavioral reference model and the target HDL model, which are run in an hand-shaking fashion. In conjunction with the automatic comparison facility, a new HDL saver is proposed to accelerate the verification process. The proposed saver allows 'restart' from the nearest checkpoint before the point of inconsistency detection regardless of whether any modification on the source code is made or not. It is to be contrasted with conventional saver that does not allow restart when some design change, or debugging is made. We have proved the effectiveness of the environment through applying it to a real-world example, i.e., Pentium-compatible processor design process. It was shown that the HDL verification with the proposed saver can be faster and more flexible than the hardware emulation approach. In short, it was demonstrated that restartability with source code modification capability is very important in obtaining the short debugging turnaround time by eliminating a large number of redundant simulations.

10.5 High-Level Test Generation for Design Verification of Pipelined Microprocessors [p. 185]

David Van Campenhout, Trevor Mudge, John P. Hayes

This paper addresses test generation for design verification of pipe-lined microprocessors. To handle the complexity of these designs, our algorithm integrates high-level treatment of the datapath with low-level treatment of the controller, and employs a novel "pipe-frame" organization that exploits high-level knowledge about the operation of pipelines. We have implemented the proposed algorithm and used it to generate verification tests for design errors in a representative pipelined microprocessor.
Keywords: design verification, sequential test generation, high-level test generation, pipelined microprocessors.

10.6 Developing an Architecture Validation Suite - Application to the PowerPC Architecture [p. 189]

Laurent Fournier, Anatoly Koyfman, Moshe Levinger

This paper describes the efforts made and the results of creating an Architecture Validation Suite for the PowerPC architecture. Although many functional test suites are available for multiple architectures, little has been published on how these suites are developed and how their quality should be measured. This work provides some insights for approaching the difficult problem of building a high quality functional test suite for a given architecture. By defining a set of generic coverage models that combine program-based, specification-based, and sequential bug-driven models, it establishes the groundwork for the development of architecture validation suites for any architecture.

Session 11: Interconnect and Passive Component Modeling

Chair: Alan Mantooth
Organizers: Hidetoshi Onodera, Alan Mantooth

11.1 Passive Reduced-Order Models for Interconnect Simulation and their Computation via Krylov-Subspace Algorithms [p. 195]

Roland W. Freund

This paper studies a projection technique based on block Krylov subspaces for the computation of reduced�order models of multiport RLC circuits. We show that these models are always passive, yet they still match at least half as many moments as the corresponding reduced-order models based on matrixPadé approximation. For RC, RL, and LC circuits, the reduced-order models obtained by projection and matrix-Padé approximation are identical. For general RLC circuits, we show how the projection technique can easily be incorporated into the SyMPVL algorithm to obtain passive reduced-order models, in addition to the high-accuracy matrix-Padé approximations that characterize SyMPVL, at essentially no extra computational costs. Connections between SyMPVL and the recently proposed reduced-order modeling algorithmPRIMA are also discussed. Numerical results for interconnect simulation problems are reported.

11.2 Model Order-Reduction of RC(L) Interconnect including Variational Analysis [p. 201]

Ying Liu, Lawrence T. Pileggi, Andrzej J. Strojwas

As interconnect feature sizes continue to scale to smaller dimensions, long interconnect can dominate the IC timing performance, but the interconnect parameter variations make it difficult to predict these dominant delay extremes. This paper presents a model order-reduction technique for RLC interconnect circuits that includes variational analysis to capture manufacturing variations. Matrix perturbation theory is combined with dominant-pole-analysis and Krylov-subspace-analysis methods to produce reduced-order models with direct inclusion of statistically independent manufacturing variations. The accuracy of the resulting variational reduced-order models is demonstrated on several industrial examples.

11.3 Robust Rational Function Approximation Algorithm for Model Generation [p. 207]

Carlos P. Coelho, Joel R. Phillips, L. Miguel Silveira

The problem of computing rational function approximations to tabulated frequency data is of paramount importance in the modeling arena. In this paper we present a method for generating a state space model from tabular data in the frequency domain that solves some of the numerical difficulties associated with the traditional fitting techniques used in linear least squares approximations. An extension to the MIMO case is also derived.

Session 12: New Concepts in Behavioral and Logic Synthesis

Chair: Timothy Y. Kam
Organizers: Kazutoshi Wakabayashi, Timothy Kam

12.1 Behavioral Network Graph: Unifying the Domains of High-Level and Logic Synthesis [p. 213]

Reinaldo A. Bergamaschi

High-level synthesis operates on internal models known as control/data flow graphs (CDFG) and produces a register-transfer-level (RTL) model of the hardware implementation for a given schedule. For high-level synthesis to be efficient it has to estimate the effect that a given algorithmic decision (e.g., scheduling, allocation) will have on the final hardware implementation (after logic synthesis). Currently, this effect cannot be measured accurately because the CDFGs are very distinct from the RTL/gate-level models used by logic synthesis, precluding interaction between high-level and logic synthesis. This paper presents a solution to this problem consisting of a novel internal model for synthesis which spans the domains of high-level and logic synthesis. This model is an RTL/gate-level network capable of representing all possible schedules that a given behavior may assume. This representation allows high-level synthesis algorithms to be formulated as logic transformations and effectively interleaved with logic synthesis.

12.2 Soft Scheduling in High Level Synthesis [p. 219]

Jianwen Zhu, Daniel D. Gajski

In this paper, we establish a theoretical framework for a new concept of scheduling called soft scheduling. In contrasts to the traditional schedulers referred as hard schedulers, soft schedulers make soft decisions at a time, or decisions that can be adjusted later. Soft scheduling has a potential to alleviate the phase coupling problem that has plagued traditional high level synthesis (HLS), HLS for deep submicron design and VLIW code generation. We then develop a specific soft scheduling formulation, called threaded schedule, under which a linear, optimal (in the sense of online optimality) algorithm is guaranteed.

12.3 Graph Coloring Algorithms for Fast Evaluation of Curtis Decompositions [p. 225]

Marek Perkowski, Rahul Malvi, Stan Grygiel, Mike Burns, Alan Mishchenko

Finding the minimum column multiplicity for a bound set of variables is an important problem in Curtis decomposition. To investigate this problem, we compared two graph-coloring programs: one exact, and another one based on heuristics which can give, however, provably exact results on some types of graphs. These programs were incorporated into the multi-valued decomposer MVGUD. We proved that the exact graph coloring is not necessary for high-quality functional decomposers. Thus we improved by orders of magnitude the speed of the column multiplicity problem, with very little or no sacrifice of decomposition quality. Comparison of our experimental results with competing decomposers shows that for nearly all benchmarks our solutions are best and time is usually not too high.

Session 13: Performance and Power Optimization

Chair: Narendra V Shenoy
Organizers: Timothy Kam, Luciano Lavagno

13.1 Maximizing Performance by Retiming and Clock Skew Scheduling [p. 231]

Xun Liu, Marios C. Papaefthymiou, Eby G. Friedman

The application of retiming and clock skew scheduling for improving the operating speed of synchronous circuits under setup and hold constraints is investigated in this paper. It is shown that when both long and short paths are considered, circuits optimized by the simultaneous application of retiming and clock scheduling can achieve shorter clock periods than optimized circuits generated by applying either of the two techniques separately. A mixed-integer linear programming formulation and an efficient heuristic are given for the problem of simultaneous retiming and clock skew scheduling under setup and hold constraints. Experiments with benchmark circuits demonstrate the efficiency of this heuristic and the effectiveness of the combined optimization. All of the test circuits show improvement. For more than half of them, the maximum operating speed increases by more than 21% over the optimized circuits obtained by applying retiming or clock skew scheduling separately.

13.2 A Practical Approach to Multiple-Class Retiming [p. 237]

Klaus Eckl, Jean Christophe Madre, Peter Zepter, Christian Legl

Retiming is an optimization technique for synchronous circuits introduced by Leiserson and Saxe in 1983. Although powerful, retiming is not very widely used because it does not handle in a satisfying way circuits whose registers have load enable, synchronous and asynchronous set/clear inputs. We propose an extension of retiming whose basis is the characterization of registers into register classes. The new approach called multiple-class retiming handles circuits with an arbitrary number of register classes. We present results on a set of industrial FPGA designs showing the effectiveness and efficiency of multiple-class retiming.

13.3 Performance-driven Integration of Retiming and Resynthesis [p. 243]

Peichen Pan

We present a novel approach to performance optimization by integrating retiming and resynthesis. The approach is oblivious of register boundaries during resynthesis. In addition, it guides resynthesis by a criterion that is directly tied to the performance target. The proposed approach obtains provable results. Experimental results further demonstrate the effectiveness of our approach.

13.4 Kernel-Based Power Optimization of RTL Components: Exact and Approximate Extraction Algorithms [p. 247]

L. Benini, G. De Micheli, E. Macii, G. Odasso, M. Poncino

Sequential logic optimization based on the extraction of computational kernels has proved to be very promising when the target is power minimization. Efficient extraction of the kernels is at the basis of the optimization paradigm; the extraction procedures proposed so far exploit common logic synthesis transformations, and thus assume the availability of a gate-level description of the circuit being optimized. In this paper we present exact and approximate algorithms for the automatic extraction of computational kernels directly from the functional specification of a RTL component. We show the effectiveness of such algorithms by reporting the results of an extensive experimentation we have carried out on a large set of standard benchmarks, as well as on some designs with known functionality.

Session 14: Software-Driven Architecture

Chair: Patrick Scaglia
Organizers: Patrick Scaglia, Jim Rowson

14.1 Customized Instruction-Sets For Embedded Processors [p. 253]

Joseph A. Fisher

It is generally believed that there will be little more variety in CPU architectures, and thus the design of Instruction-set Architectures (ISAs) will have no role in the future of embedded CPU design. Nonetheless, it is argued in this paper that architectural variety will soon again become an important topic, with the major motivation being increased performance due to the customization of CPUs to their intended use. Five major barriers that could hinder customization are described, including the problems of existing binaries, toolchain development and maintenance costs, lost savings/higher chip cost due to the lower volumes of customized processors, added hardware development costs, and some factors related to the product development cycle for embedded products. Each is discussed, along with potential, sometimes surprising, solutions.
Keywords: Embedded processors, custom processors, instruction-level parallelism, VLIW, mass customization of toolchains

14.2 System-Level Hardware/Software Trade-offs [p. 258]

Samuel P. Harbison

Operating systems and development tools can impose overly general requirements that prevent an embedded system from achieving its hardware performance entitlement. It is time for embedded processor designers to become more involved with system software and tools.
Keywords: Digital signal processors, instruction set architecture, compiler, real-time operating system, software configuration.

Session 15: Panel: Functional Verification: Real Users, Real Problems, Real Opportunities

Chair: Jonah Mcleod
Organizers: Ghulam Nurie, Mike Murray
Panel Members: Nozar Azarakhsh, Glen Ewing, Paul Gingras, Scott Reedstrom, Chris Rowen [p. 260]

Achieving timely and comprehensive functional design verification is a ubiquitous problem in electronics. This panel offers perspectives on verification from designers of cardiac pacemakers, communications satellites, compute servers, networking equipment, and IP. The panelists will begin by dissecting the bottlenecks in their verification processes. For example, are simulators too slow? Or do test vector generation and coverage analysis consume the most time? The panelists will present their ideas for new EDA products which might accelerate verification. Finally, the panelists will discuss what compromises they would accept in order to achieve this acceleration. Would they learn a new HDL? Restrict their design styles? Forsake legacy designs?

Session 16: Floorplanning

Chair: Louis Scheffer
Organizers: Lou Scheffer, Patrick Groeneveld

16.1 A Timing-Driven Soft-Marco Resynthesis Method in Interaction with Chip Floorplanning [p. 262]

Hsiao-Pin Su, Allen C.-H. Wu, Youn-Long Lin

In this paper, we present a complete chip design method which incorporates a soft-macro resynthesis method in interaction with chip floorplanning for area and timing improvements. We develop a timing-driven design flow to exploit the interaction between HDL synthesis and physical design tasks. During each design iteration, we resynthesize soft macros with either a relaxed or a tightened timing constraint which is guided by the post-layout timing information. The goal is to produce area-efficient designs while satisfying the timing constraints. Experiments on a number of industrial designs have demonstrated that by effectively relaxing the timing constraint of the non-critical modules and tightening the timing constraint of the critical modules, a design can achieve 13% to 30% timing improvements with little to no increase in chip area.

16.2 An O-Tree Representation of Non-Slicing Floorplan and Its Applications [p. 268]

Pei-Ning Guo, Chung-Kuan Cheng, Takeshi Yoshimura

We present an ordered tree, O-tree, structure to represent non-slicing floorplans. The O-tree uses only n(2 + [lg n]) bits for a floorplan of n rectangular blocks. We define an admissible placement as a compacted placement in both x and y direction. For each admissible placement, we can find an O-tree representation. We show that the number of possible O-tree combinations is O(n!2^{2n - 2} / n^1.5 ). This is very concise compared to a sequence pair representation which has O((n!)²) combinations. The approximate ratio of sequence pair and O-tree combinations is O(n²(n/4e)ⁿ). The complexity of O-tree is even smaller than a binary tree structure for slicing floorplan which has O(n! 2^{5n -3}/ n^1.5) combinations. Given an O-tree, it takes only linear time to construct the placement and its constraint graph. We have developed a deterministic floorplanning algorithm utilizing the structure of O-tree. Empirical results on MCNC benchmarks show promising performance with average 16% improvement in wire length, and 1% less in dead space over previous CPU-intensive cluster refinement method.

16.3 Module Placement for Analog Layout Using the Sequence-Pair Representation [p. 274]

Florin Balasa, Koen Lampaert

This paper addresses the problem of device-level placement for analog layout. Different from most of the existent approaches employing basically simulated annealing optimization algorithms operating on at Gellat-Jepsen spatial representations [2], we are using a more recent topological representation called sequence-pair [7], which has the advantage of not being restricted to slicing floorplan topologies. In this paper, we are explaining how specific features essential to analog placement, as the ability to deal with symmetry and device matching constraints, can be easily handled by employing the sequence-pair representation. Several analog examples substantiate the effectiveness of our placement tool, which is already in use in an industrial environment.

Session 17: Scheduling Algorithms for High Level Synthesis

Chair: Kazutoshi Wakabayashi
Organizers: Timothy Kam, Kazutoshi Wakabayashi

17.1 Genetic List Scheduling Algorithm for Scheduling and Allocation on a Loosely Coupled Heterogeneous Multiprocessor System [p. 280]

Martin Grajcar

Our problem consists of a partially ordered set of tasks communicating over a shared bus which are to be mapped to a heterogeneous multiprocessor system. The goal is to minimize the makespan, while satisfying constrains implied by data dependencies and exclusive resource usage. We present a new efficient heuristic approach based on list scheduling and genetic algorithms, which finds the optimum in few seconds on average even for large examples (up to 96 tasks) taken from [3]. The superiority of our algorithm compared to some other algorithms is demonstrated.
Keywords: heterogeneous system design, heuristic, genetic algorithms, list scheduling

17.2 Performance-Driven Scheduling with Bit-Level Chaining [p. 286]

Sanghun Park, Kiyoung Choi

This paper presents a new scheduling algorithm that maximizes the performance of a design under resource constraints in high-level synthesis. The algorithm tries to achieve the maximal utilization of resources and the minimal waste of clock slack time. Moreover, it exploits the technique of bit-level chaining to target high-speed designs. The algorithm tries non-integer multiple-cycling and chaining, which allows multiple cycle execution of chained operations, to further increase the performance at the cost of small increase in the complexity of the control unit. Experimental results on several datapath-intensive designs show significant improvement in execution time, over the conventional scheduling algorithms.

17.3 A Model for Scheduling Protocol-Constrained Components and Environments [p. 292]

Steve Haynal, Forrest Brewer

This paper presents a technique for highly constrained event sequence scheduling. System resource protocols as well as an external interface protocol are described by non-deterministic finite automata (NFA). All valid schedules which adhere to interfacing constraints and resource bounds for flow graph described behavior are determined exactly. A model and scheduling results are presented for an extensive design example.
Keywords: Interface protocols, protocol-constrained scheduling, automata.

17.4 A Reordering Technique for Efficient Code Motion [p. 296]

Luiz C. V. dos Santos, Jochen A. G. Jess

Emerging design problems are prompting the use of code motion and speculative execution in high-level synthesis to shorten schedules and meet tight time-constraints. However, some code motions are not worth doing from a worst-case execution perspective. We propose a technique that selects the most promising code motions, thereby increasing the density of optimal solutions in the search space.

Session 18: Symbolic Model Checking

Chair: Andreas Kuehlmann
Organizers: Andreas Kuehlmann, Kunle Olukatun

18.1 Coverage Estimation for Symbolic Model Checking [p. 300]

Yatin Hoskote, Timothy Kam, Pei-Hsin Ho, Xudong Zhao

Although model checking is an exhaustive formal verification method, a bug can still escape detection if the erroneous behavior does not violate any verified property. We propose a coverage metric to estimate the "completeness" of a set of properties verified by model checking. A symbolic algorithm is presented to compute this metric for a subset of the CTL property specification language. It has the same order of computational complexity as a model checking algorithm. Our coverage estimator has been applied in the course of some real-world model checking projects. We uncovered several coverage holes including one that eventually led to the discovery of a bug that escaped the initial model checking effort.

18.2 Improving Symbolic Traversals by Means of Activity Profiles [p. 306]

Gianpiero Cabodi, Paolo Camurati, Stefano Quer

Symbolic techniques have undergone major improvements in the last few years. Nevertheless they are still limited by the size of the involved BDDs, and extending their applicability to larger and real circuits is a key issue. Within this framework, we introduce "activity profiles" as a novel technique to characterize transition relations. In our methodology a learning phase is used to collect activity measures, related to time and space cost, for each BDD node of the transition relation. We use inexpensive reachability analysis as learning technique, and we operate within inner steps of image computations involving the transition relation and state sets. The above informations can be used for several purposes. In particular, we present an application of activity profiles in the field of reachability analysis itself. We propose transition relation subsetting and partial traversals of the state transition graph. We show that a sequence of partial traversals is able to complete a reachability analysis problem with smaller memory requirement and improved time performance.

18.3 Improved Approximate Reachability Using Auxiliary State Variables [p. 312]

Shankar G. Govindaraju, David L. Dill, Jules P. Bergmann

Approximate reachability techniques trade off accuracy for the capacity to deal with bigger designs. Cho et al [4] proposed partitioning the set of state bits into mutually disjoint subsets and doing symbolic forward reachability on the individual subsets to obtain an over approximation of the reachable state set. Recently [7] this was improved upon by dividing the set of state bits into various subsets that could possibly overlap, and doing symbolic reachability over the overlapping subsets. In this paper, we further improve on this scheme by augmenting the set of state variables with auxiliary state variables. These auxiliary state variables are added to capture some important internal conditions in the combinational logic. Approximate symbolic forward reachability on overlapping subsets of this augmented set of state variables yields much tighter approximations than earlier methods.

18.4 Symbolic Model Checking using SAT procedures instead of BDDs [p. 317]

A. Biere, A. Cimatti, E. M. Clarke, M. Fujita, Y. Zhu

In this paper, we study the application of propositional decision procedures in hardware verification. In particular, we apply bounded model checking, as introduced in [1], to equivalence and invariant checking. We present several optimizations that reduce the size of generated propositional formulas. In many instances, our SAT-based approach can significantly outperform BDD-based approaches. We observe that SAT-based techniques are particularly efficient in detecting errors in both combinational and sequential designs.

Session 19: DSP Design Techniques and Implementations

Chair: Jan M. Rabaey
Organizers: David Blaauw, Niel Weste

19.1 Power Efficient Mediaprocessors: Design Space Exploration [p. 321]

Johnson Kin, Chunho Lee, William H. Mangione-Smith, Miodrag Potkonjak

We present a framework for rapidly exploring the design space of low power application-specific programmable processors (ASPP), in particular media processors. We focus on a category of processors that are programmable yet optimized to reduce power consumption for a specific set of applications. The key components of the framework presented in this paper are a retargetable instruction level parallelism (ILP) compiler, processor simulators, a set of complete media applications written in a high level language and an architectural component selection algorithm. The fundamental idea behind the framework is that with the aid of a retargetable ILP compiler and simulators it is possible to arrange architectural parameters (e.g., the issue width, the size of cache memory units, the number of execution units, etc.) to meet low power design goals under area constraints.

19.2 Global Multimedia System Design Exploration using Accurate Memory Organization Feedback [p. 327]

Arnout Vandecappelle, Miguel Miranda, Erik Brockmeyer, Francky Catthoor, Diederik Verkest

Successful exploration of system-level design decisions is impossible without fast and accurate estimation of the impact on the system cost. In most multimedia applications, the dominant cost factor is related to the organization of the memory architecture. This paper presents a systematic approach which allows effective system-level exploration of memory organization design alternatives, based on accurate feedback by using our earlier developed tools. The effectiveness of this approach is illustrated on an industrial application. Applying our approach, a substantial part of the design search space has been explored in a very short time, resulting in a cost-efficient solution which meets all design constraints.

19.3 Implementation of a Scalable MPEG-4 Wavelet-Based Visual Texture Compression System [p. 333]

L. Nachtergaele, B. Vanhoof, M. Peón, G. Lafruit, J. Bormans, I. Bolsens

The realization of new MPEG-4 functionality, applicable to 3D graphics texture compression and image database access over the Internet, is demonstrated in a PC-based compression system. Applying our system-level design methodologies effectively removes all implementation bottlenecks. A first-of-a-kind ASIC, called Ozone, accelerates the Embedded Zero Tree based encoding and is capable of compressing 30 color CIF images per second.

19.4 A 10Mbit/s Upstream Cable Modem with Automatic Equalization [p. 337]

Patrick Schaumont, Radim Cmar, Serge Vernalde, Marc Engels

A fully digital QAM16 burst receiver ASIC is presented. The BO4 receiver demodulates at 10 Mbit/s and uses an advanced signal processing architecture that performs per burst automatic equalization. It is a critical building block in a broadband access system for HFC networks. The chip was designed using a C++ based flow and is implemented as a 80 Kgate 0.7u CMOS standard cell design.

Session 20: Panel: Cell Libraries - Build vs. Buy; Static vs. Dynamic

Chair: Kurt Keutzer
Organizers: Emil Girzcyc
Panel Members: Kurt Wolf, David Pietromonaco, Jay Maxey, Jeff Lewis, Martin Lefebvre, Jeff Burns [p. 341]

Cell libraries determine the final density, performance, and power of most IC designs much as the construction materials determine the quality of a building. Nevertheless, the importance of libraries has often been a tertiary consideration in design projects - falling behind both design skill and tool quality. Choosing the right cell library for your project can have a significant impact on the characteristics of the circuit you design, and thus, the success of your product. Design teams need to consider a host of technical and business factors when selecting a library. Technical considerations include density, speed, power, design for reliability, and support for the designer's tools and flow. Business considerations include price, risk, time to market, and control of one's own destiny. This panel examines technical, as well as current business issues, associated with cell libraries. On the technical front, the advantages and disadvantages of static libraries versus ``on the fly" or dynamic libraries will be discussed and quantified. On the business front, while designers have traditionally used the cell libraries provided by their silicon source (internal division or semiconductor vendor), recent changes in technology and business practices make several cell-library sources available to design groups: silicon vendors, third party library vendors, and internally created. This panel will explore the business issues associated with the library choice and debate when designers should use each available source of cell libraries.

Session 21: Partitioning and Linear Placement

Chair: Chung-kuan Cheng
Organizers: Sharad Malik, Andrew Kahng

21.1 Multilevel k-way Hypergraph Partitioning [p. 343]

George Karypis, Vipin Kumar

In this paper, we present a new multilevel k-way hypergraph partitioning algorithm that substantially outperforms the existing state-of-the-art K-PM/LR algorithm for multi-way partitioning. both for optimizing local as well as global objectives. Experiments on the ISPD98 benchmark suite show that the partitionings produced by our scheme are on the average 15% to 23% better than those produced by the K-PM/LR algorithm, both in terms of the hyperedge cut as well as the (Kmetric. Furthermore, our algorithm is significantly faster, requiring 4 to 5 times less time than that required by K-PM/LR.

21.2 Hypergraph Partitioning for VLSI CAD: Methodology for Heuristic Development, Experimentation and Reporting [p. 349]

Andrew E. Caldwell, Andrew B. Kahng, Andrew Kennings, Igor L. Markov

We illustrate how technical contributions in the VLSI CAD partitioning literature can fail to provide one or more of: (i) reproducible results and descriptions, (ii) an enabling account of the key understanding or insight behind a given contribution, and (iii) experimental evidence that is not only contrasted with the state-of-the-art, but also meaningful in light of the driving application. Such failings can lead to reporting of spurious and misguided conclusions. For example, new ideas may appear promising in the context of a weak experimental testbed, but in reality do not advance the state of the art. The resulting inefficiencies can be detrimental to the entire research community. We draw on several models (chiefly from the metaheuristics community) [5] for experimental research and reporting in the area of heuristics for hard problems, and suggest that such practices can be adopted within the VLSI CAD community. Our focus is on hypergraph partitioning.

21.3 Hypergraph Partitioning With Fixed Vertices [p. 355]

Andrew E. Caldwell, Andrew B. Kahng, Igor L. Markov

We empirically assess the implications of fixed terminals for hypergraph partitioning heuristics. Our experimental testbed incorporates a leading-edge multilevel hypergraph partitioner [14] [3] and IBM-internal circuits that have recently been released as part of the ISPD-98 Benchmark Suite [2, 1]. We find that the presence of fixed terminals can make a partitioning instance considerably easier (possibly to the point of being "trivial"): much less effort is needed to stably reach solution qualities that are near best-achievable. Toward development of partitioning heuristics specific to the fixed-terminals regime, we study the pass statistics of flat FM-based partitioning heuristics. Our data suggest that with more fixed terminals, the improvements in a pass are more likely to occur near the beginning of the pass. Restricting the length of passes - which degrades solution quality in the classic (free-hypergraph) context - is relatively safe for the fixed-terminals regime and considerably reduces run time of our FM-based heuristic implementations. We believe that the distinct nature of partitioning in the fixed-terminals regime has deep implications (i) for the design and use of partitioners in top-down placement, (ii) for the context in which VLSI hypergraph partitioning research is pursued, and (iii) for the development of new benchmark instances for the research community.

21.4 Relaxation and Clustering in a Local Search Framework: Application to Linear Placement [p. 360]

Sung-Woo Hur, John Lillis

This paper presents two primary results relevant to physical design problems in CAD/VLSI through a case study of the linear placement problem. First a local search mechanism which incorporates a neighborhood operator based on constraint relaxation is proposed. The strategy exhibits many of the desirable features of analytical placement while retaining the flexibility and non-determinism of local search. The second and orthogonal contribution is in netlist clustering. We characterize local optima in the linear placement problem through a simple visualization tool { the displacement graph. This characterization reveals the relationship between clusters and local optima and motivates a dynamic clustering scheme designed specifically for escaping such local optima. Promising experimental results are reported.

Session 22: Technology Dependent Synthesis and Optimization

Chair: Massoud Pedram
Organizers: M. Marek-Sadowska, Jason Cong

22.1 An alpha-approximate Algorithm for Delay-Constraint Technology Mapping [p. 367]

Sumit Roy, Krishna Belkhale, Prithviraj Banerjee

Variants of delay-cost functions have been used in a class of technology mapping algorithms [1, 2, 3, 4]. We illustrate that in an industrial environment the delay-cost function can grow unboundedly and lead to very large run-times. The key contribution of this work is a novel bounded compression algorithm. We introduce a concept of alpha delay-cost curve, (alpha-DC-curve) that requires up to exponentially less delay-cost points to be stored compared to that stored by the delay function. We prove that the solution obtained by this exponential compaction of the delay-function is bounded to alpha% of the optimal solution. We also suggest a large set of CAD applications which may benefit from using alpha-DC-curve. Finally, we demonstrate the effectiveness of our compaction scheme on one such application, namely technology mapping for low power. Experimental results on industrial environment show that we are more than 17 times faster than [2] on certain MCNC circuit.

22.2 Technology Mapping for FPGAs with Nonuniform Pin Delays and Fast Interconnections [p. 373]

Jason Cong, Yean-Yow Hwang, Songjie Xu

In this paper we study the technology mapping problem for FPGAs with nonuniform pin delays and fast interconnects. We develop the PinMap algorithm to compute the delay optimal mapping solution for FPGAs with nonuniform pin delays in polynomial time based on the efficient cut enumeration. Compared with FlowMap [5] without considering the nonuniform pin delays, PinMap is able to reduce the circuit delay by 15% without any area penalty. For mapping with fast interconnects, we present two algorithms, an iterative refinement based algorithm, named ChainMap, and a Boolean matching based algorithm, named HeteroBM, which combines the Boolean matching techniques proposed in [2] and [3] and the heterogeneous technology mapping mechanism presented in [1]. It is shown that both ChainMap and HeteroBM are able to significantly reduce the circuit delay by making efficient use of the FPGA fast interconnects resources.

22.3 Automated Phase Assignment for the Synthesis of Low Power Domino Circuits [p. 379]

Priyadarshan Patra, Unni Narayanan

High performance circuit techniques such as domino logic have migrated from the microprocessor world into more mainstream ASIC designs. The problem is that domino logic comes at a heavy cost in terms of total power dissipation. For mobile and portable devices such as laptops and cellular phones, a high power dissipation is an unacceptable price to pay for high performance. Hence, we study synthesis techniques that allow designers to take advantage of the speed of domino circuits while at the same time to minimize total power consumption. Specifically, in this paper we present three results related to automated phase assignment for the synthesis of low power domino circuits: (1) We demonstrate that the choice of phase assignment at the primary outputs of a circuit can significantly impact power dissipation in the domino block (2) We propose a method for efficiently estimating power dissipation in a domino circuit and (3) We apply the method to determine a phase assignment that minimizes power consumption in the final circuit implementation. Preliminary experimental results on a mixture of public domain benchmarks and real industry circuits show potential power savings as high as 34% over the minimum area realization of the logic. Furthermore, the low power synthesized circuits still meet timing constraints.

Session 23: Advances in Symbolic Simulation

Chair: Kunle Olukotun
Organizers: Kunle Olukotun, Andreas Kuehlmann

23.1 Enhancing Simulation with BDDs and ATPG [p. 385]

Malay K. Ganai, Adnan Aziz, Andreas Kuehlmann

We introduce SImulation Verification with Augmentation (SIVA), a tool for checking safety properties on digital hardware designs. SIVA integrates simulation with symbolic techniques for vector generation. Specifically, the core algorithm uses a combination of ATPG and BDDs to generate input vectors which cover behavior not excited by simulation. Experimental results demonstrate considerable improvement in state space coverage compared with either simulation or formal verification in isolation.
Keywords: Formal verification, ATPG, simulation, BDDs, coverage.

23.2 Cycle-Based Symbolic Simulation of Gate-Level Synchronous Circuits [p. 391]

Valeria Bertacco, Maurizio Damiani, Stefano Quer

Symbolic methods are often considered the state-of-the-art technique for validating digital circuits. Due to their complexity and unpredictable run-time behavior, however, their potential is currently limited to small-to-medium circuits. Logic simulation privileges capacity, it is nicely scalable, flexible, and it has a predictable run-time behavior. For this reason, it is the common choice for validating large circuits. Simulation, however, typically visits only a small fraction of the state space: The discovery of bugs heavily relies on the expertise of the designer of the test stimuli. In this paper we consider a symbolic simulation approach to the validation problem. Our objective is to trade-off between formal and numerical methods in order to simulate a circuit with a "very large number" of input combinations and sequences in parallel. We demonstrate larger capacity with respect to symbolic techniques and better efficiency with respect to cycle-based simulation. We show that it is possible to symbolically simulate very large trace sets in parallel (over 100 symbolic inputs) for the largest ISCAS benchmark circuits, using 96 Mbytes of memory.

23.3 Exploiting Positive Equality and Partial Non-Consistency in the Formal Verification of Pipelined Microprocessors [p. 397]

Miroslav N. Velev, Randal E. Bryant

We study the applicability of the logic of Positive Equality with Uninterpreted Functions (PEUF) [2][3] to the verification of pipelined microprocessors with very large Instruction Set Architectures (ISAs). Abstraction of memory arrays and functional units is employed, while the control logic of the processors is kept intact from the original gate-level designs. PEUF is an extension of the logic of Equality with Uninterpreted Functions, introduced by Burch and Dill [4], that allows us to use distinct constants for the data operands and instruction addresses needed in the symbolic expression for the correctness criterion. We present several techniques that make PEUF scale very efficiently for the verification of pipelined microprocessors with large ISAs. These techniques are based on allowing a limited form of non-consistency in the uninterpreted functions, representing initial memory state and ALU behaviors. Our tool required less than 30 seconds of CPU time and 5 MB of memory to verify a 5-stage MIPS-like pipelined processor that implements 191 instructions of various classes. The verification was done by correspondence checking - a formal method, where a pipelined microprocessor is compared against a non-pipelined specification.

23.4 Formal Verification Using Parametric Representations of Boolean Constraints [p. 402]

Mark D. Aagaard, Robert B. Jones, Carl-Johan H. Seger

We describe the use of parametric representations of Boolean predicates to encode data-space constraints and significantly extend the capacity of formal verification. The constraints are used to decompose verifications by sets of case splits and to restrict verifications by validity conditions. Our technique is applicable to any symbolic simulator. We illustrate our technique on state-of-the-art Intel (R) designs, without removing latches or modifying the circuits in any way.

Session 24: System Level Design Methodology and Case Studies

Chair: Ivo Bolsens
Organizers: Asawaree Kalavade, Kenji Yoshida

24.1 Vertical Benchmarks for CAD [p. 408]

Christopher Inacio, Herman Schmit, David Nagle, Andrew Ryan, Donald E. Thomas, Yingfai Tong, Ben Klass

Vertical benchmarks are complex system designs represented at multiple levels of abstraction. More effective than component-based CAD benchmarks, vertical benchmarks enable quantitative comparison of CAD techniques within or across design flows. This work describes the notion of vertical benchmarks and presents our benchmark, which is based on a commercial DSP, by comparing two alternative design flows.

24.2 A Framework for User Assisted Design Space Exploration [p. 414]

X. Hu, G. W. Greenwood, S. Ravichandran, G. Quan

Much effort in hardware/software co-design has been devoted to developing "push-button" types of tools for automatic hardware/software partitioning. However, given the highly complex nature of embedded system design, user guided design exploration can be more effective. In this paper , we propose a framework for designer assisted partitioning that can be used in conjunction with any given search strategy. A key component of this framework is the visualization of the design space, without enumerating all possible design configurations. Furthermore, this design space representation provides a straightforward way for a designer to identify promising partitions and hence guide the subsequent exploration process. Experiments have shown the effectiveness of this approach.

24.3 Fast Prototyping: a System Design Flow Applied to a Complex System-On-Chip Multiprocessor Design [p. 420]

Benoit Clement, Richard Hersemeule, Etienne Lantreibecq, Bernard Ramanadin, Pierre Coulomb, Francois Pogodalla

This paper describes a new design flow that significantly reduces time-to-market for highly complex multiprocessor-based System-On-Chip designs. This flow, called Fast Prototyping, enables concurrent hardware and software development, early verification and productive re-use of intellectual property. We describe how using this innovative system design flow, that combines different technologies, such as C modeling, emulation, hard Virtual Component re-use and CoWare N2C^TM, we achieve better productivity on a multi-processor SOC design.
1.1 Keywords: System design, Hardware/Software (HW/SW) co-design, Virtual Component (VC) re-use, Fast Prototyping, system verification, system modeling.

24.4 Verification and Management of a multimillion-Gate Embedded Core Design [p. 425]

Johann Notbauer, Thomas Albrecht, Georg Niedrist, Stefan Rohringer

Verification is one of the most critical and time-consuming tasks in today's design processes. This paper demonstrates the verification process of a 8.8 million gate design using HW-simulation and cycle simulation-based HW/SW-coverification. The main focuses are overall methodology, testbench management, the verification task itself and defect management. The chosen verification process was a real success: the quality of the designed hard- and software was increased and furthermore the time needed for integration and test of the design in the context of the overall system was greatly reduced.

Session 25: Panel: Parasitic Extraction Accuracy: How Much Is Enough?

Chair: Paul Franzon
Organizer: Mark Basel
Panel Mambers: Mark Basel, Aki Fujimura, Sharad Mehrotra, Ron Preston, Robin C. Sarma, Marty Walker [p. 429]

The effect of parasitic elements on chip performance is well known, however the relative importance of this effect is becoming more critical to a chip's performance. To cope with this new design hazard there are a number of parasitic extraction tools and methodology approaches available to the circuit designer. Some developed for the general market and some developed for internal use. With each product having its own claims and approaches, deciding on a tool or extraction strategy is a confusing exercise. The purpose of this panel is to help the designer and CAD manager determine how to properly compare extractors and how to put them to use. The panel will address a number of questions including; What is the best way to accurately handle parasitic extraction while dealing with increasingly large and complex (SOC, mixed signal) chips ? How to determine and achieve the required extraction accuracy for a particular design situation ? How is extraction accuracy measured? How can each extractor be compared and contrasted with some degree of confidence ? Can circuit design techniques and/or tool methodologies be used to reduce the extraction effort ? How should process variations or inductive effects can handled ? What's the best way to deal with the data volume problem ?

Session 26: Circuit Optimization for Power and Performance

Chair: Andrew B. Kahng
Organizers: Andrew Kahng, Hidetoshi Onodera

26.1 Mixed-V_th (MVT) CMOS Circuit Design Methodology for Low Power Applications [p. 430]

Liqiong Wei, Zhanping Chen, Kaushik Roy, Yibin Ye, Vivek De

Dual threshold technique has been proposed to reduce leakage power in low voltage and low power circuits by applying a high threshold voltage to some transistors in non-critical paths, while a low-threshold is used in critical path(s) to maintain the performance. Mixed-Vth (MVT) static CMOS design technique allows different thresholds within a logic gate, thereby increasing the number of high threshold transistors compared to the gate-level dual threshold technique. In this paper, a methodology for MVT CMOS circuit design is presented. Different MVT CMOS circuit schemes are considered and three algorithms are proposed for the transistor-level threshold assignment under performance constraints. Results indicate that MVT CMOS design technique can provide about 20% more leakage reduction compared to the corresponding gate-level dual threshold technique.

26.2 Stand-by Power Minimization through Simultaneous Threshold Voltage Selection and Circuit Sizing [p. 436]

Supamas Sirichotiyakul, Tim Edwards, Chanhee Oh, Jingyan Zuo, Abhijit Dharchoudhury, Rajendran V. Panda, David Blaauw

We present a new approach for estimation and optimization of the average stand-by power dissipation in large MOS digital circuits. To overcome the complexity of state dependence in average leakage estimation, we introduce the concept of "dominant leakage states" and use state probabilities. Our method achieves speed-ups of 3 to 4 orders of magnitude over exhaustive SPICE simulations while maintaining accuracies within 9% of SPICE. This accurate estimation is used in a new sensitivity-based leakage and performance optimization approach for circuits using dual V_tprocesses. In tests on a variety of industrial circuits, this approach was able to obtain 81-100% of the performance achievable with all low V t transistors, but with 1/3 to 1/6 the stand-by current.
Keywords: Low-power-design, Dual-V_t, Leakage

26.3 Leakage Control With Efficient Use of Transistor Stacks in Single Threshold CMOS [p. 442]

Mark C. Johnson, Dinesh Somasekhar, Kaushik Roy

The state dependence of leakage can be exploited to obtain modest leakage savings in CMOS circuits. However, one can modify circuits considering state dependence and achieve larger savings. We identify a low leakage state and insert leakage control transistors only where needed. Leakage levels are on the order of 35% to 90% lower than those obtained by state dependence alone.

26.4 A Practical Gate Resizing Technique Considering Glitch Reduction for Low Power Design [p. 446]

Masanori Hashimoto, Hidetoshi Onodera, Keikichi Tamaru

We propose a method for power optimization that considers glitch reduction by gate sizing based on the statistical estimation of glitch transitions. Our method reduces not only the amount of capacitive and short-circuit power consumption but also the power dissipated by glitches which has not been exploited previously. The effect of our method is verified experimentally using 8 benchmark circuits with a 0.6 um standard cell library. Our method reduces the power dissipation from the minimum-sized circuits further by 9.8% on average and 23.0% maximum. We also verify that our method is effective under manufacturing variation.

26.5 Gradient-Based Optimization of Custom Circuits Using a Static-Timing Formulation [p. 452]

A. R. Conn, I. M. Elfadel, W. W. Molzen, Jr., P.R. O'Brien, P.N. Strenski, C. Visweswariah, C.B. Whan

This paper describes a method of optimally sizing digital circuits on a static-timing basis. All paths through the logic are considered simultaneously and no input patterns need be specified by the user. The method is unique in that it is based on gradient-based, nonlinear optimization and can accommodate transistor-level schematics without the need for pre-characterization. It employs efficient time-domain simulation and gradient computation for each channel-connected component. A large-scale, general-purpose, nonlinear optimization package is used to solve the tuning problem. A prototype tuner has been developed that accommodates combinational circuits consisting of parameterized library cells. Numerical results are presented.

Session 27: Interaction of Synthesis and Layout

Chair: Lukas Van Ginneken
Organizers: Jason Cong, M. Marek-Sadowska

27.1 Simultaneous Circuit Partitioning/Clustering with Retiming for Performance Optimization [p. 460]

Jason Cong, Honching Li, Chang Wu

Partitioning and clustering are crucial steps in circuit layout for handling large scale designs enabled by the deep submicron technologies. Retiming is an important sequential logic optimization technique for reducing the clock period by optimally repositioning ip ops [7]. In our exploration of a logical and physical co-design flow, we developed a highly efficient algorithm on combining retiming with circuit partitioning or clustering for clock period minimization. Compared with the recent result by Pan et al. [10] on quasi-optimal clustering with retiming, our algorithm is able to reduce both runtime and memory requirement by one order of magnitude without losing quality. Our results show that our algorithm can be over 1000X faster for large designs.

27.2 Wave Steering in YADDs: A Novel Non-Iterative Synthesis and Layout Technique [p. 466]

Arindam Mukherjee, Ranganathan Sudhakar, Malgorzata Marek-Sadowska, Stephen I. Long

In this paper we present a new synthesis and layout approach that avoids the normal iterations between synthesis, technology mapping and layout, and increases routing by abutment. It produces shorter and more predictable delays, and sometimes even layouts with reduced areas. This scheme equalizes delays along different paths, which makes low granularity pipelining a reality, and hence we can clock these circuits at much higher frequencies, compared to what is possible in a conventionally designed circuit. Since any circuit can be clocked at a fixed rate, this method does not require timing-driven synthesis. We propose the logic and layout synthesis schemes and algorithms, discuss the physical layout part of the process, and support our methodology with simulation results.

27.3 MERLIN: Semi-Order-Independent Hierarchical Buffered Routing Tree Generation Using Local Neighborhood Search [p. 472]

Amir H. Salek, Jinan Lou, Massoud Pedram

This paper presents a solution to the problem of performance-driven buffered routing tree generation in electronic circuits. Using a novel bottom-up construction algorithm and a local neighborhood search strategy, this method finds the best solution of the problem in an exponential size solution sub-space in polynomial time. The output is a hierarchical buffered rectilinear Steiner routing tree that connects the driver of a net to its sink nodes. The two variants of the problem, i.e. maximizing the driver required time subject to a total buffer area constraint and minimizing the total buffer area subject to a minimum driver required time constraint, are handled by propagating three-dimensional solution curves during the construction phase. Experimental results prove the effectiveness of this technique compared to the other solutions for this problem.

27.4 Buffer Insertion With Accurate Gate and Interconnect Delay Computation [p. 479]

Charles J. Alpert, Anirudh Devgan, Stephen T. Quay

Buffer insertion has become a critical step in deep submicron design, and several buffer insertion/sizing algorithms have been proposed in the literature. However, most of these methods use simplified interconnect and gate delay models. These models may lead to inferior solutions since the optimized objective is only an approximation for the actual delay. We propose to integrate accurate wire and gate delay models into Van Ginneken's buffer insertion algorithm [18] via the propagation of moments and driving point admittances up the routing tree. We have verified the effectiveness of our approach on an industry design.

Session 28: Interconnect Issues for Deep-Submicron Design

Chair: Mojy C. Chian
Organizers: Anantha Chandrakasan, Tadahiro Kuroda

28.1 Reducing Cross-Coupling Among Interconnect Wires in Deep-Submicron Datapath Design [p. 485]

Joon-Seo Yim, Chong-Min Kyung

As the CMOS technology enters the deep submicron design era, the lateral inter-wire coupling capacitance becomes the dominant part of load capacitance and makes RC delay on the bus structures very data-dependent. Reducing the cross-coupling capacitance is crucial for achieving high-speed as well as lower power operation. In this paper, we propose two interconnect layout design methodologies for minimizing the coupling effect" in the design of full-custom datapath. Firstly, we describe the control signal ordering scheme which was shown to minimize the switching power consumption by 10% and wire delay by 15% for a given set of benchmark examples. Secondly, a track assignment algorithm based on evolutionary programming was used to minimize the cross-coupling capacitance. Experimental results have shown that the chip performance improvement as much as 40% can be obtained using the proposed interconnect schemes in various stages of the datapath layout optimization.

28.2 A Novel VLSI Layout Fabric for Deep Sub-Micron Applications [p. 491]

Sunil P. Khatri, Amit Mehrotra, Robert K. Brayton, Alberto Sangiovanni-Vincentelli, Ralph H.J.M. Otten

We propose a new VLSI layout methodology which addresses the main problems faced in Deep Sub-Micron (DSM) integrated circuit design. Our layout "fabric" scheme eliminates the conventional notion of power and ground routing on the integrated circuit die. Instead, power and ground are essentially "pre-routed" all over the die. By a clever arrangement of power/ground and signal pins, we almost completely eliminate the capacitive effects between signal wires. Additionally, we get a power and ground distribution network with a very low resistance at any point on the die. Another advantage of our scheme is that the arrangement of conductors ensures that on-chip inductances are uniformly negligible. Finally, characterization of the circuit delays, capacitances and resistances becomes extremely simple in our scheme, and needs to be done only once for a design. We show how the uniform parasitics of our fabric give rise to a reliable and predictable design. We have implemented our scheme using public domain layout software. Preliminary results show that it holds much promise as the layout methodology of choice in DSM integrated circuit design.

28.3 Improved Delay Prediction for On-Chip Buses [p. 497]

Real G. Pomerleau, Paul D. Frazon, Griff L. Bilbro

In this paper, we introduce a simple procedure to predict wiring delay in bi-directional buses and a way of properly sizing the driver for each of its port. In addition, we propose a simple calibration procedure to improve its delay prediction over the Elmore delay of the RC tree. The technique is fast, accurate, and ideal for implementation in floorplanner during behavioral synthesis.
Keywords: RC wiring delay, High-Level Synthesis, Floorplanning, Buffer Optimization, Interconnect optimization.

28.4 Noise-Aware Repeater Insertion and Wire Sizing for On-Chip Interconnect Using Hierarchical Moment-Matching [p. 502]

Chung-Ping Chen, Noel Menezes

Recently, several algorithms for interconnect optimization via repeater insertion and wire sizing have appeared based on the Elmore delay model. Using the Devgan noise metric [6] a noise-aware repeater insertion technique has also been proposed recently. Recognizing the conservatism of these delay and noise models, we propose a moment-matching based technique to interconnect optimization that allows for much higher accuracy while preserving the hierarchical nature of Elmore-delay-based techniques. We also present a novel approach to noise computation that accurately captures the effect of several attackers in linear time with respect to the number of attackers and wire segments. Our practical experiments with industrial nets indicate that the corresponding reduction in error afforded by these more accurate models justifies this increase in runtime for aggressive designs which is our targeted domain. Our algorithm yields delay and noise estimates within 5% of circuit simulation results.

28.5 Interconnect Estimation and Planning for Deep Submicron Designs [p. 507]

Jason Cong, David Zhigang Pan

This paper reports two sets of important results in our exploration of an interconnect-centric design flow for deep submicron (DSM) designs: (i) We obtain efficient yet accurate wiring area estimation models for optimal wire sizing (OWS). We also propose a simple metric to guide area-efficient performance optimization; (ii) Guided by our interconnect estimation models, we study the interconnect architecture planning problem for wire-width designs. We achieve a rather surprising result which suggests that two pre-determined wire widths per metal layer are sufficient to achieve near-optimal performance. This result will greatly simplify the routing architecture and tools for DSM designs. We believe that our interconnect estimation and planning results will have a significant impact on DSM designs.

Session 29: Design Entry and Specification

Chair: Asawaree Kalavade
Organizers: Kenji Yoshida, Ivo Bolsens

29.1 ECL: A Specification Environment for System-Level Design [p. 511]

Luciano Lavagno, Ellen, M. Sentovich

We propose a new specification environment for system-level design called ECL. It combines the Esterel and C languages to provide a more versatile means for specifying heterogeneous designs. It can be viewed as the addition to C of explicit constructs from Esterel for waiting, concurrency and pre-emption, and thus makes these operations easier to specify and more apparent. An ECL specification is compiled into a reactive part (an extended finite state machine representing most of the ECL program), and a pure data looping part, thus nicely supporting a mix of control and data. The reactive part can be robustly estimated and synthesized to hardware or software, while the data looping part is implemented in software as specified.

29.2 Representation of Function Variants for Embedded System Optimization and Synthesis [p. 517]

K. Richter, D. Ziegenbein, R. Ernst, L. Thiele, J. Teich

Many embedded systems are implemented with a set of alternative function variants to adapt the system to different applications or environments. This paper proposes a novel approach for the coherent representation and selection of function variants in the different phases of the design process. In this context, the modeling of re-configuration of system parts is supported in a natural way. Using a real example from the video processing domain, the approach is explained and validated.

29.3 Vex - A CAD Toolbox [p. 523]

Jules P. Bergmann, Mark A. Horowitz

The increasing size and complexity of designs is making the use of hardware description languages (HDLs), such as Verilog and VHDL, more prevalent. They are able to describe both the initial design and intermediate representations of the design as it is readied for fabrication. For large designs, there inevitably are problems with the tool flow that require custom tools to be created. These tools must be able to access and modify the HDL for the design, requirements that often dwarf the tools� actual functionality, making them difficult to create without a large effort or cutting corners. During the FLASH project at Stanford we created Vex -- a toolbox of components for dealing with Verilog, tied together with an interactive scripting language -- that simplifies the creation of these tools. It was used to create a number of tools that were critical to our design's tape-out and has also been useful in creating design exploration and research tools.

29.4 Constraint Management for Collaborative Electronic Design [p. 529]

Juan Antonio Carballo, Stephen W. Director

Today's complex design processes feature large numbers of varied, interdependent constraints, which often cross interdisciplinary boundaries. Therefore, a computer-supported constraint management methodology that automatically detects violations early in the design process, provides useful violation notification to guide redesign efforts, and can be integrated with conventional CAD software can be a great aid to the designer. We present such a methodology and describe its implementation in the Minerva II design process manager, along with an example design session.

Session 30: Panel: MEMS CAD Beyond Multi-Million Transistors

Chair: Kris Pister
Organizers: Takahide Inoue, Kris Pister
Panel Members: Albert P. Pisano, Nicholas Swart, Mike Horton, John Rychcik, John R. Gilbert, Gerry K. Fedder [p. 535]

Existing MEMS products boast more than a million electrical and mechanical components on a single chip. With several billion dollars in sales in 1997 and exponential growth it is clear that MEMS fabrication technology has leveraged decades of IC expertise to great advantage. As a result, the fabrication capabilities far outstrip the design capabilities in both industry and university environments. MEMS CAD tools are only now beginning to leverage corresponding decades of IC CAD expertise to address the exciting and unique electro-mechanical co-design problems from the physical through system level design. Perspectives represented in this panel include the industry needs at the system and device levels, tool developers at all levels, and the government research vision.
Questions to be addressed include:
- How do the current tools help or impede the development of new products ?
- What are the breakthrough markets for MEMS and how will they challenge the existing CAD tools ?
- What are the challenges for the next generation?

Session 31: Extraction of Capacitance and Substrate Models

Chair: David D. Ling
Organizers: Joel Phillips, Louis Scheffer

31.1 A Multiscale Method for Fast Capacitance Extraction [p. 537]

Johannes Tausch, Jacob White

The many levels of metal used in aggressive deep submicron process technologies has made fast and accurate capacitance extraction of complicated 3-D geometries of conductors essential, and many novel approaches have been recently developed. In this paper we present an accelerated boundary-element method, like the well-known FASTCAP program, but instead of using an adaptive fast multipole algorithm we use a numerically generated multiscale basis for constructing a sparse representation of the dense boundary-element matrix. Results are presented to demonstrate that the multiscale method can be applied to complicated geometries, generates a sparser boundary-element matrix than the adaptive fast multipole method, and provides an inexpensive but effective preconditioner. Examples are used to show that the better sparsification and the effective preconditioner yield a method that can be 25 times faster than FASTCAP while still maintain accuracy in the smallest coupling capacitances.

31.2 Efficient Capacitance Computation for Structures with Non-Uniform Adaptive Surface Meshes [p. 543]

Vikram Jandhyala, Scott Savage, Eric Bracken, Zoltan Cendes

Circuit parasitic extraction problems are typically formulated using discretized integral equations that use basis functions defined over tesselated surface meshes. The Fast Multipole Method (FMM) accelerates the solution process by rapidly evaluating potentials and fields due to these basis functions. Unfortunately, the FMM suffers from the drawback that its efficiency degrades if the surface mesh has disparately-sized elements in close proximity to each other. Closely-spaced non-uniformly sized elements can appear in realistic situations for a variety of reasons: owing to mesh refinement, due to accurate modeling requirements for fine structural features, and because of the presence of thin doubly-walled structures. In this paper, modifications to the standard multilevel FMM are presented that permit efficient potential and field evaluation over specific non-uniform meshes. The efficiency of the new technique is demonstrated through examples involving large surface meshes with non-uniformly sized elements in close proximity.

31.3 Substrate Modeling and Lumped Substrate Resistance Extraction for CMOS ESD/Latchup Circuit Simulation [p. 549]

Tong Li, Ching-Han Tsai, Elyse Rosenbaum, Sung-Mo Kang

Due to interactions through the common silicon substrate, the layout and placement of devices and substrate contacts can have significant impacts on a circuit's ESD (Electrostatic Discharge) and latchup behavior in CMOS technologies. Proper substrate modeling is thus required for circuit-level simulation to predict the circuit's ESD performance and latchup immunity. In this work we propose a new substrate resistance network model, and develop a novel substrate resistance extraction method that accurately calculates the distribution of injection current into the substrate during ESD or latchup events. With the proposed substrate model and resistance extraction, we can capture the three-dimensional layout parasitics in the circuit as well as the vertical substrate doping profile, and simulate these effects on circuit behavior at the circuit-level accurately. The usefulness of this work for layout optimization is demonstrated with an industrial circuit example.

Session 32: High-Level Techniques for Low Power

Chair: Sunil D. Sherlekar
Organizers: Sunil Sherlekar, Farid Najm

32.1 Dynamic Power Management Based On Continuous-Time Markov Decision Processes [p. 555]

Qinru Qiu, Massoud Pedram

This paper introduces a continuous-time, controllable Markov process model of a power-managed system. The system model is composed of the corresponding stochastic models of the service queue and the service provider. The system environment is modeled by a stochastic service request process. The problem of dynamic power management in such a system is formulated as policy optimization problem and solved using an efficient "policy iteration" algorithm. Compared to previous work on dynamic power management, our formulation allows better modeling of the various system components, the power-managed system as whole, and its environment. In addition it captures dependencies between the service queue and service provider status. Finally, the resulting power management policy is asynchronous, hence it is more power-efficient and more useful in practice. Experimental results demonstrate the effectiveness of our policy optimization algorithm compared to a number of heuristic (time-out and N-policy) algorithms.

32.2 Parallel Mixed-Level Power Simulation Based on Spatio-Temporal Circuit Partitioning [p. 562]

Mauro Chinosi, Roberto Zafalon, Carlo Guardiani

In this work we propose a technique for spatial and temporal partitioning of a logic circuit based on the nodes activity computed by using a simulation at an higher level of abstraction. Only those components that are activated by a given input vector are added to the detailed simulation netlist. The methodology is suitable for parallel implementation on a multi-processor environment and allows to arbitrarily switch between fast and detailed levels of abstraction during the simulation run. The experimental results obtained on a significant set of benchmarks show that it is possible to obtain a considerable reduction in both CPU time and memory occupation together with a considerable degree of accuracy. Furthermore the proposed technique easily fits in the existing industrial design flows.

32.3 Low-Power Behavioral Synthesis Optimization Using Multiple Precision Arithmetic [p. 568]

Milos Ercegovac, Darko Kirovski, George Potkonjak

Many modern multimedia applications such as image and video processing are characterized by a unique combination of arithmetic and computational features: fixed-point arithmetic, a variety of short data types, high degree of instruction-level parallelism, strict timing constraints, and high computational requirements. Computationally intensive algorithms usually boost device's power dissipation which is often key to the efficiency of many communications and multimedia applications. Although recently virtually all general-purpose processors have been equipped with multiprecision operations, the current generation of behavioral synthesis tools for application-specific systems does not utilize this power/performance optimization paradigm.
In this paper, we explore the potential of using multiple precision arithmetic units to effectively support synthesis of low-power application-specific integrated circuits. We propose a new architectural scheme for collaborate addition of sets of variable precision data. We have developed a novel resource allocation and computation assignment methodology for a set of multiple precision arithmetic units. The optimization algorithms explore the trade-off allocating low-width bus structures and executing multiple-cycle operations. Experimental results indicate strong advantages of the proposed approach.

Session 33: System-Level Verification and Test

Chair: Kenji Yoshida
Organizers: Asawaree Kalavade, Ivo Bolsens

33.1 A Methodology for the Verification of a 'System on Chip' [p. 574]

Daniel Geist, Giora Biran, Tamara Arons, Michael Slavkin, Yvgeny Nustov, Monica Farkas, Karen Holtz, Andy Long, Dave King, Steve Barret

This paper summarizes the verification effort of a complex ASIC designated to be an "all in one" ISDN network router. This ASIC is unique because it actually consists of many independent components, called "cores" (including the processor). The integration of these components onto one chip results in an ISOC (Integrated System On a Chip). The complexity of verifying an ISOC is virtually impossible without a proper methodology. This paper presents the methodology developed for verifying the router. In particular, the verification method as well as the tools that were built to execute this method are presented. Finally, a summary of the verification results is given.
1.1 Keywords: Systems on chip,verification, test and debugging.

33.2 ICEBERG: An Embedded In-Circuit Emulator Synthesizer for Microcontrollers [p. 580]

Ing-Jer Huang, Tai-An Lu

This paper presents a synthesis tool ICEBERG for embedded in-circuit emulators (ICE's), that are part of the development environment for microcontroller (or microprocessor)-based systems (PIPER-II). the tool inserts and integrates the necessary in-circuit emulation circuitry into a given RTL core of a microcontroller, and thus turning the core into an embedded ICE. The ICE, based on the IEEE 1149.I JTAG architecture, provides standard debugging mechanisms, including boundary scan paths, partial scan paths, single stepping, internal resource monitoring and modification, breakpoint detection, and mode switching between debugging and free running modes. ICEBERG has been successfully applied to synthesize the embedded ICE for an industrial microcontroller HT48100 from its RTL core.

33.3 Microprocessor Based Testing for Core-Based System on Chip [p. 586]

C. A. Papachristou, F. Martin, M. Nourani

The purpose of this paper is to develop a flexible design for test methodology for testing a core-based system on chip (SOC). The novel feature of the approach is the use an embedded microprocessor/memory pair to test the remaining components of the SOC. Test data is downloaded using DMA techniques directly into memory while the microprocessor uses the test data to test the core. The test results are tranferred to a MISR for evaluation. The approach has several important advantages over conventional ATPG such as achieving at-speed testing, not limiting the chip speed to the tester speed during test and achieving great flexibility since most of the testing process is based on software. Experimental results on an example system are discussed.

Session 34: Logic and High-Level Synthesis Methodologies

Chair: Mahadevamurty Nemani
Organizers: Randy Harr, Telle Whitney

34.1 Using Partitioning to Help Convergence in the Standard-Cell Design Automation Methodology [p. 592]

Hema Kapadia, Mark A. Horowitz

This paper explores a standard-cell design methodology based on netlist partitioning as a solution for the problem of lack of convergence in the conventional methodology in deep submicron technologies. A synthesized design block is partitioned along unpredictable nets that are identified from the netlist structure. the size of each partition is restricted so that the longest possible local net in a partition can be sufficiently driven by an average library gate, hence allowing statistical wire-load modeling for the local nets. The block is resynthesized using a hybrid wire-load model that takes into account accurate wire-load information on the unpredictable nets derived after floorplanning the partitions, and uses custom statistical wire-load models within each partition. Final placement is restricted to respect the initial floorplan. The methodology was implemented using existing commercial tools for synthesis and layout. Experimental results show high correlation between synthesis estimates and post-placement measurements of wire-loads and gate delays with the new methodology. The trade-offs of partitioning, current limitations of the methodology and future work to overcome these limitations are also discussed.

34.2 Comparing RTL and Behavioral Design Methodologies in the Case of a 2M-Transistors ATM Shaper [p. 598]

Imed Moussa, Zoltan Sugar, Rodolph Suescun, Mario Diaz-Nava, Marco Pavesi, Salvatore Crudo, Luca Gazzi, Ahmed Amine Jerraya

This paper describes the experience and the lessons learned during the design of an ATM traffic shaper circuit using behavioral synthesis. The experiment is based on the comparison of the results of two parallel design flows starting from the same specification. The first used a classical design method based on RTL synthesis. The second design flow is based on behavioral synthesis. The experiment has shown that behavioral synthesis is able to produce efficient design in terms of gate count and timing while bringing a threefold reduction in design effort when compared to RTL design methodology.

34.3 Engineering Change: Methodology and Applications to Behavioral and System Synthesis [p. 604]

Darko Kirovski, Miodrag Potkonjak

Due to the unavoidable need for system debugging, performance tuning, and adaptation to new standards, the engineering change (EC) methodology has emerged as one of the crucial components in synthesis of systems-on-chip. We introduce a novel design methodology which facilitates design-for-EC and post-processing to enable EC with minimal perturbation. Initially, as a synthesis pre-processing step, the original design specification is augmented with additional design constraints which ensure flexibility for future correction. Upon alteration of the initial design, a novel post-processing technique achives the desired functionality with a near-minimal perturbation of the initially optimized design. The key contribution we introduce is a constraint manipulation technique which enables reduction of an arbitrary EC problem into its corresponding classical synthesis problem. As a result, in both pre- and post-processing for EC, classical synthesis algorithms can be used to enable flexibility and perform the correction process. We demonstrate the developed EC methodology on a set of behavioral and system synthesis tasks.

Session 35: Reconfigurable Systems

Chair: Telle Whitney
Organizers: Vivek Tiwari, Randy Harr

35.1 Reconfigurable Computing: What, Why, and Implications for Design Automation [p. 610]

André DeHon, John Wawrzynek

Reconfigurable Computing is emerging as an important new organizational structure for implementing computations. It combines the post-fabrication programmability of processors with the spatial computational style most commonly employed in hardware designs. The result changes traditional "hardware" and "software" boundaries, providing an opportunity for greater computational capacity and density within a programmable media. Reconfigurable Computing must leverage traditional CAD technology for building spatial designs. Beyond that, however, reprogrammablility introduces new challenges and opportunities for automation, including binding-time and specialization optimizations, regularity extraction and exploitation, and temporal partitioning and scheduling.

35.2 An Automated Temporal Partitioning and Loop Fission approach for FPGA based Reconfigurable Synthesis of DSP Applications [p. 616]

Meenakshi Kaul, Ranga Vemuri, Sriram Govindarajan, Iyad E. Ouaiss

We present an automated temporal partitioning and loop transformation approach for developing dynamically reconfigurable designs starting from behavior level specifications. An Integer Linear Programming (ILP) model is formulated to achieve near-optimal latency designs. We, also present a loop restructuring method to achieve maximum throughput for a class of DSP applications. This restructuring transformation is performed on the temporally partitioned behavior and results in near-optimization of throughput. We discuss efficient memory mapping and address generation techniques for the synthesis of reconfigurable designs. A Case study on the Joint Photographic Experts Group (JPEG) image compression algorithm demonstrates the effectiveness of our approach.

35.3 Dynamically Reconfigurable Architecture for Image Processor Applications [p. 623]

Alexandro M. S. Adário, Eduardo L. Roehe, Sergio Bampi

This work presents an overview of the principles that underlie the speed-up achievable by dynamic hardware reconfiguration, proposes a more precise taxonomy for the execution models for reconfigurable platforms, and demonstrates the advantage of dynamic reconfiguration in the new implementation of a neighborhood image processor, called DRIP. It achieves a real-time performance, which is 3 times faster than its pipelined non-reconfigurable version.
Keywords: Reconfigurable architecture, image processing, FPGA

Session 36: RF Simulation

Chair: L. Miguel Silveira
Organizers: Alan Mantooth, Hidetoshi Onodera

36.1 Multi-Time Simulation of Voltage-Controlled Oscillators [p. 629]

Onuttom Narayan, Jaijeet Roychowdhury

We present a novel formulation, called the WaMPDE, for solving systems with forced autonomous components. An important feature of the WaMPDE is its ability to capture frequency modulation (FM) in a natural and compact manner. This is made possible by a key new concept: that of warped time, related to normal time through separate time scales. Using warped time, we obtain a completely general formulation that captures complex dynamics in autonomous nonlinear systems of arbitrary size or complexity. We present computationally efficient numerical methods for solving large practical problems using the WaMPDE. Our approach explicitly calculates a time-varying local frequency that matches intuitive expectations. Applied to VCOs, WaMPDE-based simulation results in speedups of two orders of magnitude over transient simulation.

36.2 Efficient Computation of Quasi-Periodic Circuit Operating Conditions via a Mixed Frequency/Time Approach [p. 635]

Dan Feng, Joel Phillips, Keith Nabors, Ken Kundert, Jacob White

Design of communications circuits often requires computing steady-state responses to multiple periodic inputs of differing frequencies. Mixed frequency-time (MFT) approaches are orders of magnitude more efficient than transient circuit simulation, and perform better on highly nonlinear problems than traditional algorithms such as harmonic balance. We present algorithms for solving the huge nonlinear equation systems the MFT approach generates from practical circuits.

36.3 Time-Mapped Harmonic Balance [p. 641]

Ognen J. Nastov, Jacob K. White

Matrix-implicit Krylov-subspace methods have made it possible to efficiently compute the periodic steady-state of large circuits using either the time-domain shooting-Newton method or the frequency-domain harmonic balance method. However, the harmonic balance methods are not so efficient at computing steady-state solutions with rapid transitions, and the low-order integration methods typically used with shooting-Newton methods are not so efficient when high accuracy is required. In this paper we describe a Time-Mapped Harmonic Balance method (TMHB), a fast Krylov-subspace spectral method that overcomes the inefficiency of standard harmonic balance in the case of rapid transitions. TMHB features a non-uniform grid to resolve the sharp features in the signals. Results on several examples demonstrate that the TMHB method achieves several orders of magnitude improvement in accuracy compared to the standard harmonic balance method. The TMHB method is also several times faster than the standard harmonic balance method in reaching identical solution accuracy.

Session 37: Advanced Test Generation and Diagnosis

Chair: Janusz Rajski
Organizers: Yervant Zorian

37.1 Test Generation for Gigahertz Processors Using an Automatic Functional Constraint Extractor [p. 647]

Raghuram S. Tupuri, Arun Krishnamachary, Jacob A. Abraham

As the sizes of general and special purpose processors increase rapidly, generating high quality manufacturing tests which can be run at native speeds is becoming a serious problem. One solution is a novel method for functional test generation in which a transformed module is built manually, and which embodies functional constraints described using virtual logic. Test generation is then performed on the transformed module using commercial tools and the transformed module patterns are translated back to the processor level. However, the technique is useful only if the virtual logic can be generated automatically. This paper describes an automatic functional constraint extraction algorithm and a procedure to build the transformed module. We describe the tool, FALCON, used to extract the functional constraints of a given embedded module from a Verilog RTL model. The constraint extraction for embedded modules of benchmark processors using FALCON takes only a few seconds. We show that this method can generate functional patterns in a time several orders of magnitude less than one using a conventional, at view of the circuit.

37.2 PROPTEST: A Property Based Test Pattern Generator for Sequential Circuits Using Test Compaction [p. 653]

Ruifeng Guo, Sudhakar M. Reddy, Irith Pomeranz

We describe a property based test generation procedure that uses static compaction to generate test sequences that achieve high fault coverages at a low computational complexity. A class of test compaction procedures are proposed and used in the property based test generator. Experimental results indicate that these compaction procedures can be used to implement the proposed test generator to achieve high fault coverage with relatively smaller run times.

37.3 Multiple Error Diagnosis Based on Xlists [p. 660]

Vamsi Boppana, Rajarshi Mukherjee, Jawahar Jain, Masahiro Fujita, Pradeep Bollineni

In this paper, we present multiple error diagnosis algorithms to overcome two significant problems associated with current error diagnosis techniques targeting large circuits: their use of limited error models and a lack of solutions that scale well for multiple errors. Our solution is based on a non-enumerative analysis technique, based on logic simulation (3-valued and symbolic), for simultaneously analyzing all possible errors at sets of nodes in the circuit. Error models are introduced in order to address the "locality" aspect of error location and to identify sets of nodes that are "local" with respect to each other. Theoretical results are provided to guarantee the diagnosis of modeled errors and robust diagnosis approaches are shown to address the cases when errors do not correspond to the modeled types. Experimental results on benchmark circuits demonstrate accurate and extremely rapid location of errors of large multiplicity.

Session 38: RTL Circuit Simulation

Chair: Luciano Lavagno
Organizers: Niel Weste, Teresa Meng

38.1 Simulation Vector Generation from HDL Descriptions for Observability-Enhanced-Statement Coverage [p. 666]

Farzan Fallah, Pranav Ashar, Srinivas Devadas

Validation of RTL circuits remains the primary bottleneck in improving design turnaround time, and simulation remains the primary methodology for validation. Simulation-based validation has suffered from a disconnect between the metrics used to measure the error coverage of a set of simulation vectors, and the vector generation process. This disconnect has resulted in the simulation of virtually endless streams of vectors which achieve enhanced error coverage only infrequently. Another drawback has been that most error coverage metrics proposed have either been too simplistic or too inefficient to compute. Recently, an effective observability-based statement coverage metric was proposed along with a fast companion procedure for evaluating it. The contribution of our work is the development of a vector generation procedure targeting the observability-based statement coverage metric. Our method uses repeated coverage computation to minimize the number of vectors generated. For vector generation, we propose a novel technique to set up constraints based on the chosen coverage metric. Once the system of interacting arithmetic and Boolean constraints has been set up, it can be solved using hybrid linear programming and Boolean satisfiability methods. We present heuristics to control the size of the constraint system that needs to be solved. We present experimental results which show the viability of automatically generating vectors using our approach for industrial RTL circuits. We envision our system being used during the design process, as well as during post-design debugging.

38.2 A Two-State Methodology for RTL Logic Simulation [p. 672]

Lionel Bening

This paper describes a two-state methodology for register transfer level (RTL) logic simulation in which the use of the X-state is completely eliminated inside ASIC designs. Examples are presented to show the gross pessimism and optimism that occurs with the X in RTL simulation. Random two-state initialization is offered as a way to detect and diagnose startup problems in RTL simulation. Random two-state initialization (a) is more productive than the X-state in gate-level simulation, and (b) provides better coverage of startup problems than X-state in RTL simulation. Consistent random initialization is applied (a) as a way to duplicate a startup state using a slower diagnosis-oriented simulator after a faster detection-oriented simulator reports the problem, and (b) to verify that the problem is corrected for that startup state after the design change intended to fix the problem. In addition to combining the earlier ideas of two-state simulation, and random initialization with consistent values across simulations, an original technique for treatment of tri-state Z's arriving into a two-state model is introduced.
Keywords: RTL, simulation, 2-state, X-state, pessimism, optimism, random, initialization.

38.3 An Approach for Extracting RT Timing Information to Annotate Algorithmic VHDL Specifications [p. 678]

Cordula Hansen, Francisco Nascimento, Wolfgang Rosenstiel

This paper presents a new approach for extracting timing information defined in a simulation vector set on register transfer level (RTL) and reusing them in the behavioral specification. Using a VHDL RTL simulation vector set and a VHDL behavioral specification as entry, the timing information are extracted and as well as the specification transformed in a Partial Order based Model (POM). The POM expressing the timing information is then mapped on the specification POM. The result contains the behavioral specification and the RTL timing and is retransformed in a corresponding VHDL specification. Additionally, timing information contained in the specification can be checked using the RTL simulation vectors.

Session 39: CAD Solutions Using FPGAs

Chair: Rajesh K. Gupta
Organizers: Telle Whitney, Vivek Tiwari

39.1 A Massively-Parallel Easily-Scalable Satisfiability Solver Using Reconfigurable Hardware [p. 684]

Miron Abramovici, Jose T. de Sousa, Daniel Saab

Satisfiability (SAT) is a computationally expensive algorithm central to many CAD and test applications. In this paper, we present the architecture of a new SAT solver using reconfigurable logic. Our main contributions include new forms of massive fine-grain parallelism and structured design techniques based on iterative logic arrays that reduce compilation times from hours to a few minutes. Our architecture is easily scalable. Our results show several orders of magnitude speed-up compared with a state-of-the-art software implementation, and with a prior SAT solver using reconfigurable hardware.

39.2 Dynamic Fault Diagnosis on Reconfigurable Hardware [p. 691]

Fatih Kocan, Daniel G. Saab

In this paper, we introduce a new approach for locating and diagnosing faults in combinational circuits. The approach is based on automatically designing a circuit which implements a closest-match fault location algorithm specialized for the combinational circuit under diagnosis (CUD). This approach eliminates the need for large storage required by a software based fault diagnosis. In this paper, we show the approach's feasibility in terms of hardware resources, speed, and how it compares with software based techniques.

39.3 Hardware Compilation for FPGA-Based Configurable Computing Machines [p. 697]

Xiaohan Zhu, Bill Lin

Configurable computing machines are an emerging class of hybrid architectures where a field programmable gate array (FPGA) component is tightly coupled to a general-purpose microprocessor core. In these architectures, the FPGA component complements the general-purpose microprocessor by enabling a developer to construct application-specific gate-level structures on-demand while retaining the flexibility and rapid reconfigurability of a fully programmable solution. High computational performance can be achieved on the FPGA component by creating custom data paths, operators, and interconnection pathways that are dedicated to a given problem, thus enabling similar structural optimization benefits as ASICs. In this paper, we present a new programming environment for the development of applications on this new class of configurable computing machines. This environment enables developers to develop hybrid hardware/software applications in a common integrated development framework. In particular, the focus of this paper is on the hardware compilation part of the problem starting from a software-like algorithmic process-based specification.

Session 40: Technology Directions

Chair: Bryan D. Ackland
Organizers: Bryan D. Ackland

40.1 0.18um CMOS and Beyond [p. 703]

D. J. Eaglesham

As we move to the 0.18mm node and beyond, the dominant trend in device and process technology is a simple continuation of several decades of scaling. However, some serious challenges to straightforward scaling are on the horizon. This paper will review the present status of process technology and examine the likely departures from scaling in the various areas. The 0.18mm node is seeing the first major new materials introduced into the Si process for many years in the interconnect, and major departures from the traditional process are being actively considered for the transistor. However, it is probable that continued scaling will continue to dominate advanced processes for several generations to come.

40.2 SOI Digital CMOS VLSI - A Design Perspective [p. 709]

C. T. Chuang, R. Puri

This paper reviews the recent advances of SOI for digital CMOS VLSI applications with particular emphasis on the design issues and advantages resulting from the unique SOI device structure. The technology/device requirements and design issues/challenges for high-performance, general-purpose microprocessor applications are differentiated with respect to low-power portable applications. Particular emphases are placed on the impact of floating-body in partially-depleted devices on the circuit operation, stability, and functionality. Unique SOI design aspects such as parasitic bipolar effect and hysteretic V T variation are addressed. Circuit techniques to improve the noise immunity and global design issues are discussed.

Session 41: Timing Analysis and Optimization

Chair: Thomas G. Szymanski
Organizers: Kunle Olukotun, Sharak Malik

41.1 Equivalent Elmore Delay for RLC Trees [p. 715]

Yehea I. Ismail, Eby G. Friedman, Jose L. Neves

Closed form solutions for the 50% delay, rise time, overshoots, and settling time of signals in an RLC tree are presented. These solutions have the same accuracy characteristics as the Elmore delay model for RC trees and preserves the simplicity and recursive characteristics of the Elmore delay. The solutions introduced here consider all damping conditions of an RLC circuit including the underdamped response, which is not considered by the classical Elmore delay model due to the non-monotone nature of the response. Also, the solutions have significantly improved accuracy as compared to the Elmore delay for an overdamped response. The solutions introduced here for RLC trees can be practically used for the same purposes that the Elmore delay is used for RC trees.

41.2 Effects of Inductance on the Propagation Delay and Repeater Insertion in VLSI Circuits [p. 721]

Yehea I. Ismail, Eby G. Friedman

A closed form expression for the propagation delay of a CMOS gate driving a distributed RLC line is introduced that is within 5% of dynamic circuit simulations for a wide range of RLC loads. It is shown that the traditional quadratic dependence of the propagation delay on the length of an RC line approaches a linear dependence as inductance effects increase. The closed form delay model is applied to the problem of repeater insertion in RLC interconnect. Closed form solutions are presented for inserting repeaters into RLC lines that are highly accurate with respect to numerical solutions. An RC model as compared to an RLC model creates errors of up to 30% in the total propagation delay of a repeater system. Considering inductance in repeater insertion is also shown to significantly save repeater area and power consumption. The error between the RC and RLC models increases as the gate parasitic impedances decrease which is consistent with technology scaling trends. Thus, the importance of inductance in high performance VLSI design methodologies will increase as technologies scale.

41.3 Retiming for DSM with Area-Delay Trade-offs and Delay Constraints [p. 725]

Abdallah Tabbara, Robert K. Brayton, A. Richard Newton

The concept of improving the timing behavior of a circuit by relocating registers is called retiming and was first presented by Leiserson and Saxe. They showed that the problem of determining an equivalent minimum area (total number of registers) circuit is polynomial-time solvable. In this work we show how this approach can be reapplied in the DSM domain when area-delay trade-offs and delay constraints are considered. The main result is that the concavity of the trade-off function allows for a casting of this DSM problem into a classical minimum area retiming problem whose solution is polynomial time solvable.

41.4 Functional Timing Analysis for IP Characterization [p. 731]

Hakan Yalcin, Mohammad Mortazavi, Robert Palermo, Cyrus Bamji, Karem Sakallah

A method that characterizes the timing of Intellectual Property (IP) blocks while taking into account IP functionality is presented. IP blocks are assumed to have multiple modes of operation specified by the user. For each mode, our method calculates IO path delays and timing constraints to generate a timing model. The method thus captures the mode-dependent variation in IP delays which, according to our experiments, can be as high as 90%. The special manner in which delay calculation is performed guarantees that IP delays are never underestimated. The resulting timing models are also compacted through a process whose accuracy is controlled by the user.
1.1 Keywords: Timing analysis, false path, functional (mode) dependency, IP characterization.

41.5 Detecting False Timing Paths: Experiments on PowerPC^TM Microprocessors [p. 737]

Richard Raimi, Jacob Abraham

We present a new algorithm for detecting both combinationally and sequentially false timing paths, one in which the constraints on a timing path are captured by justifying symbolic functions across latch boundaries. We have implemented the algorithm and we present, here, the results of using it to detect false timing paths on a recent PowerPC microprocessor design. We believe these are the first published results showing the extent of the false path problem in industry. Our results suggest that the reporting of false paths may be compromising the effectiveness of static timing analysis.

Session 42: Built-In Self-Test Solutions

Chair: Yervant Zorian
Organizers: Janusz Rajski

42.1 On ILP Formulations for Built-In Self-Testable Data Path Synthesis [p. 742]

Han Bin Kim, Dong Sam Ha, Takeshi Takahashi

In this paper, we present a new method to the built-in self-testable data path synthesis based on integer linear programming (ILP). Our method performs system register assignment, built-self-test (BIST) register assignment, and interconnection assignment concurrently to yield optimal designs. Our experimental results show that our method successfully synthesizes BIST circuits for all six circuits experimented. All the BIST circuits are better in area overhead than those generated by existing high-level BIST synthesis methods.
Keywords: high-level BIST synthesis, built-in self-test, BIST, ILP.

42.2 Improving The Test Quality for Scan-Based BIST Using A General Test Application Scheme [p. 748]

Huan-Chih Tsai, Kwang-Ting Cheng, Sudipta Bhawmik

In this paper, we propose a general test application scheme for existing scan-based BIST architectures. The objective is to further improve the test quality without inserting additional logic to the Circuit Under Test (CUT). The proposed test scheme divides the entire test process into multiple test sessions. A different number of capture cycles is applied after scanning in a test pattern in each test session to maximize the fault detection for a distinct subset of faults. We present a procedure to find the optimal number of capture cycles following each scan sequence for every fault. Based on this information, the number of test sessions and the number of capture cycles after each scan sequence are determined to maximize the random testability of the CUT. We conduct experiments on ISCAS89 benchmark circuits to demonstrate the effectiveness of our approach.

42.3 Built-In Test Sequence Generation for Synchronous Sequential Circuits Based on Loading and Expansion of Test Subsequences [p. 754]

Irith Pomeranz, Sudhakar M. Reddy

We describe an on-chip test generation scheme for synchronous sequential circuits that allows at-speed testing of such circuits. The proposed scheme is based on loading of (short) input sequences into an on-chip memory, and expansion of these sequences on-chip into test sequences. Complete coverage of modeled faults is achieved by basing the selection of the loaded sequences on a deterministic test sequence T0 , and ensuring that every fault detected by T0 is detected by the expanded version of at least one loaded sequence. Experimental results presented for benchmark circuits show that the length of the sequence that needs to be stored at any time is on the average 10% of the length of T0 , and that the total length of all the loaded sequences is on the average 46% of the length of T0.

Session 43: Clock, Power Distribution and Analog Issues

Chair: Abhijit Dharchoudhury
Organizers: Teresa Meng, David Blaauw

43.1 Analysis of Performance Impact Caused by Power Supply Noise in Deep Submicron Devices [p. 760]

Yi-Min Jiang, Kwang-Ting Cheng

The paper addresses the problem of analyzing the performance degradation caused by noise in power supply lines for deep submicron CMOS devices. We first propose a statistical modeling technique for the power supply noise including inductive DI noise and power net IR voltage drop. The model is then integrated with a statistical timing analysis framework to estimate the performance degradation caused by the power supply noise. Experimental results of our analysis framework, validated by HSPICE, for benchmark circuits implemented on both 0.25 m, 2.5 V and 0.55 m, 3.3 V technologies are presented and discussed. The results show that on average, with the consideration of this noise effect, the circuit critical path delays increase by 33% and 18%, respectively for circuits implemented on these two technologies.

43.2 A Floorplan-Based Planning Methodology for Power and Clock Distribution in ASICs [p. 766]

Joon-Seo Yim, Seong-Ok Bae, Chong-Min Kyung

In deep submicron technology, IR-drop and clock skew issues become more crucial to the functionality of chip. This paper presents a floorplan-based power and clock distribution methodology for ASIC design. From the floorplan and the estimated power consumption, the power network size is determined at an early design stage. Next, without detailed gate-level netlist, clock interconnect sizing, the number and strength of clock buffers are planned for balanced clock distribution. This early planning methodology at the full-chip level enables us to fix the global interconnect issues before the detailed layout composition is started.

43.3 Digital Detection of Analog Parametric Faults in SC Filters [p. 772]

Ramesh Harjani, Bapiraju Vinnakota

Many design for test techniques for analog circuits are ineffective at detecting multiple parametric faults because either their accuracy is poor, or the circuit is not tested in the configuration it is used in. We present a DFT scheme that offers the accuracy needed to test high-quality circuits. The DFT scheme is based on a circuit that digitally measures the ratio of a pair of capacitors. The circuit is used to completely characterize the transfer function of a switched capacitor circuit, which is usually determined by capacitor ratios. In our DFT scheme, capacitor ratios can be measured to within 0.01% accuracy, and filter parameters can be shown to be satisfied to within 0.1% accuracy. A filter can be shown to satisfy all its functional specifications through this characterization process. We believe the accuracy of our scheme is at least an order of magnitude greater than that offered by any other scheme reported in the literature.

Session 44: Modeling for Design Reuse

Chair: Han Yang
Organizers: Kenji Yoshida, David Ku

44.1 Application of High Level Interface-based Design to Telecommunications System Hardware [p. 778]

Dyson Wilkes, M.M. Kamal Hashmi

The assumption in moving system modelling to higher levels is that this improves the design process by allowing exploration of the architecture, providing an unambiguous specification and catching system errors early. We used the interface-based high level abstractions of VHDL+ in a real design, and in parallel with the actual project to investigate the validity of these claims.

44.2 Hardware Reuse at the Behavioral Level [p. 784]

Patrick Schaumont, Radim Cmar, Serge Vernalde, Marc Engels, Ivo Bolsens

Standard interfaces for hardware reuse are currently defined at the structural level. In contrast to this, our contribution defines the reuse interface at the behavioral register transfer (RT) level. This promotes direct reuse of functionality and avoids the integration problems of structural reuse. We present an object oriented reuse interface in C++ and show the use of it within two real-life designs.

44.3 Description and Simulation of Hardware/Software Systems with Java [p. 790]

Tommy Kuhn, Wolfgang Rosenstiel, Udo Kebschull

In this paper a newly developed object model is presented which allows the description of hardware/ software systems in all its parts. An adaption of the component model JavaBeans allows to combine different kinds of reuse in one unitary language. A model based design flow and some tools are presented and applied to a JPEG example.
1.1 Keywords: Object oriented hardware modeling, simulation, codesign.

44.4 Java Driven Codesign and Prototyping of Networked Embedded Systems [p. 794]

Josef Fleischmann, Klaus Buchenrieder, Rainer Kress

While the number of embedded systems in consumer electronics is growing dramatically, several trends can be observed which challenge traditional codesign practice: An increasing share of functionality of such systems is implemented in software; flexibility or reconfigurability is added to the list of non-functional requirements. Moreover, networked embedded systems are equipped with communication capabilities and can be controlled over networks. In this paper, we present a suitable methodology and a set of tools targeting these novel requirements. JACOP is a codesign environment based on Java and supports specification, co-synthesis and prototyping of networked embedded systems.

Session 45: Panel: Subwavelength Lithography: How Will it Affect Your Design Flow?

Chair: Andrew B. Kahng
Organizers: ctul Sharan, Susan Lippincott
Panel Members: Y. C. Pati, Warren Grobman, Robert Pack, Lance Glasser, Kenneth V. Rousseau

In the sub 0.25 micron regime, IC feature sizes become smaller than the wavelength of light used for silicon exposure. Resultant light distortions create patterns on silicon that are substantially different from a GDSII layout. Although light distortions have traditionally not affected the design flow, the techniques used to control these distortions have a potential impact on the design flow that is as formidable as the recently addressed Deep Sub-Micron transition. This session will discuss the design implications arising from techniques used to control sub-wavelength lithography. It will begin with an embedded tutorial on subwavelength mask design techniques and their resultant effect on the IC design process. The panel will then debate the extent of the resulting impact on IC performance, design flow, and CAD tools.

45.2 Subwavelength Lithography and its Potential Impact on Design and EDA [p. 799]

Andrew B. Kahng, Y. C. Pati

This tutorial paper surveys the potential implications of subwave-length optical lithography for new tools and flows in the interface between layout design and manufacturability. We review control of optical process effects by optical proximity correction (OPC) and phase-shifting masks (PSM), then focus on the implications of OPC and PSM for layout synthesis and verification methodologies. Our discussion addresses the necessary changes in the design-to-manufacturing flow, including infrastructure development in the mask and process communities, evolution of design methodology, and opportunities for research and development in the physical lay-out and verification areas of EDA.

Session 46: Software Implementation for DSP Applications

Chair: Mahesh Mehendale
Organizers: Sunil Sherlekar, Luciano Lavagno

46.1 Synthesis of Embedded Software Using Free-Choice Petri Nets [p. 805]

Marco Sgroi, Luciano Lavagno, Yosinori Watanabe, Alberto Sangiovanni-Vincentelli

Software synthesis from a concurrent functional specification is a key problem in the design of embedded systems. A concurrent specification is well-suited for medium-grained partitioning. However, in order to be implemented in software, concurrent tasks need to be scheduled on a shared resource (the processor). The choice of the scheduling policy mainly depends on the specification of the system. For pure dataflow specifications, it is possible to apply a fully static scheduling technique, while for algorithms containing data-dependent control structures, like the if-then-else or while-do constructs, the dynamic behaviour of the system cannot be completely predicted at compile time and some scheduling decisions are to be made at run-time. For such applications we propose a Quasi-static scheduling (QSS) algorithm that generates a schedule in which run-time decisions are made only for data-dependent control structures. We use Free Choice Petri Nets (FCPNs), as underlying model, and define quasi-static schedulability for FCPNs. The proposed algorithmis complete, in that it can solve QSS for any FCPN that is quasi-statically schedulable. Finally, we show how to synthesize from a quasi-static schedule a C code implementation that consists of a set of concurrent tasks.

46.2 Exact Memory Size Estimation for Array Computations without Loop Unrolling [p. 811]

Ying Zhao, Sharad Malik

This paper presents a new algorithm for exact estimation of the minimum memory size required by programs dealing with array computations. Memory size is an important factor affecting area and power cost of memory units. For programs dealing mostly with array computations, memory cost is a dominant factor in the overall system cost. Thus, exact estimation of memory size required by a program is necessary to provide quantitative information for making high-level design decisions. Based on formulated live variables analysis, our algorithm transforms the minimum memory size estimation into an equivalent problem: integer point counting for intersection/union of mappings of parameterized polytopes. Then, a heuristics was proposed to solve the counting problem. Experimental results show that the algorithm achieves the exactness traditionally associated with totally-unrolling loops while exploiting the reduced computation complexity by preserving original loop structure.

46.3 Constraint Driven Code Selection for Fixed-Point DSPs [p. 817]

Steven Bashford, Rainer Leupers

Fixed-point DSPs are a class of embedded processors with highly irregular architectures. This irregularity makes it difficult to generate high-quality machine code from programming languages such as C. In this paper we present a novel constraint driven approach to code selection for irregular processor architectures, which provides a twofold improvement of earlier work. First, it handles complete data flow graphs instead of trees and thereby generates better code in presence of common subexpressions. Second, the presented technique is not restricted to computation of a single solution, but it generates alternative solutions. This feature enables the tight coupling of different code generation phases, resulting in better exploitation of instruction-level parallelism. Experimental results indicate that our technique is capable of generating machine code that competes well with handwritten assembly code.

46.4 Rapid Development of Optimized DSP Code From a High Level Description Through Software Estimations [p. 823]

Alain Pegatoquet, Emmanuel Gresset, Michel Auguin, Luc Bianco

Generation of optimized DSP code from a high level language such as C is very time consuming since current DSP compilers are generally unable to produce efficient code. We present a software estimation methodology from a C description that helps for a rapid development of DSP applications. Our tool VESTIM provides both a performance evaluation for assembly code generated by the compiler and an estimation of an optimized assembly code. Blocks of applications G.721 and G.728 have been evaluated using VESTIM. Results show that estimations are very accurate and allow software development time to be significantly reduced.
Keywords: DSP, Code generation, Performance Estimation.

46.5 Software Environment for a Multiprocessor DSP [p. 827]

Asawaree Kalavade, Joe Othmer, Bryan Ackland, K. J. Singh

In this paper, we describe the software environment for Daytona, a single-chip, bus-based, shared-memory, multiprocessor DSP. The software environment is designed around a layered architecture. Tools at the lower layer are designed to deliver maximum performance and include a compiler, debugger, simulator, and profiler. Tools at the higher layer focus on improving the programmability of the system and include a run-time kernel and parallelizing tools. The run-time kernel includes a low-over-head, preemptive, dynamic scheduler with multiprocessor support that guarantees real-time performance to admitted tasks.
1.1 Keywords: Multiprocessor DSP, media processor, software environment, run-time kernel, RTOS

Session 47: Intellectual Property Protection Using Watermarking

Chair: Ken Hodor
Organizers: Mike Murray, Neil Weste

47.1 Robust FPGA Intellectual Property Protection Through Multiple Small Watermarks [p. 831]

John Lach, William H. Mangione-Smith, Miodrag Potkonjak

A number of researchers have proposed using digital marks to provide ownership identification for intellectual property. Many of these techniques share three specific weaknesses: complexity of copy detection, vulnerability to mark removal after revelation for ownership verification, and mark integrity issues due to partial mark removal. This paper presents a method for watermarking field programmable gate array (FPGA) intellectual property (IP) that achieves robustness by responding to these three weaknesses. The key technique involves using secure hash functions to generate and embed multiple small marks that are more detectable, verifiable, and secure than existing IP protection techniques.
Keywords: Field programmable gate array (FPGA), intellectual property protection, watermarking

47.2 Robust Techniques For Watermarking Sequential Circuit Designs [p. 837]

Arlindo L. Oliveira

We present a methodology for the watermarking of synchronous sequential circuits that makes it possible to identify the authorship of designs by imposing a digital watermark on the state transition graph of the circuit. The methodology is applicable to sequential designs that are made available as firm Intellectual Property (IP), the designation commonly used to characterize designs specified as structural descriptions or circuit netlists. The watermarking is obtained by manipulating the state transition graph of the design in such a way as to make it exhibit a chosen property that is extremely rare in non-watermarked circuits, while, at the same time, not changing the functionality of the circuit. This manipulation is performed without ever actually computing this graph in either implicit or explicit form. We present both theoretical and experimental results that show that the watermarking can be created and verified efficiently.

47.3 Effective Iterative Techniques for Fingerprinting Design IP [p. 843]

Andrew E. Caldwell, Hyun-Jin Choi, Andrew B. Kahng, Stefanus Mantik, Miodrag Potkonjak, Gang Qu, Jennifer L. Wong

While previous watermarking-based approaches to intellectual property protection (IPP) have asymmetrically emphasized the IP provider's rights, the true goal of IPP is to ensure the rights of both the IP provider and the IP buyer. Symmetric fingerprinting schemes have been widely and effectively used to achieve this goal; however, their application domain has been restricted only to static artifacts, such as image and audio. In this paper, we propose the first generic symmetric fingerprinting technique which can be ap-plied to an arbitrary optimization/synthesis problem and, therefore, to hardware and software intellectual property. The key idea is to apply iterative optimization in an incremental fashion to solve a fingerprinted instance; this leverages the optimization effort already spent in obtaining a previous solution, yet generates a uniquely fingerprinted new solution. We use this approach as the basis for developing specific fingerprinting techniques for four important problems in VLSI CAD: partitioning, graph coloring, satisfiability, and standard-cell placement. We demonstrate the effectiveness of our fingerprinting techniques on a number of standard benchmarks for these tasks. Our approach provides an effective tradeoff between runtime and resilience against collusion.

47.4 Behavioral Synthesis Techniques for Intellectual Property Protection [p. 849]

Inki Hong, Miodrag Potkonjak

The economic viability of the reusable core-based design paradigm depends on the development of techniques for intellectual property protection. We introduce the first dynamic watermarking technique for protecting the value of intellectual property of CAD and compilation tools and reusable core components. The essence of the new approach is the addition of a set of design and timing constraints which encodes the author's signature. The constraints are selected in such a way that they result in minimal hardware overhead while embedding the signature which is unique and difficult to detect, remove and forge. We establish the first set of relevant metrics which forms the basis for the quantitative analysis, evaluation, and comparison of watermarking techniques. We develop a generic approach for signature data hiding in designs, which is applicable in conjunction with an arbitrary behavioral synthesis task, such as scheduling, assignment, allocation, and transformations. Error correcting codes are used to augment the protection of the signature data from tampering attempts. On a large set of design examples, studies indicate the effectiveness of the new approach in a sense that the signature data, which are highly resilient, difficult to detect and remove, and yet easy to verify, can be embedded in designs with very low hardware overhead.

Session 48: Low Power System Design Techniques

Chair: Vivek Tiwari
Organizers: Tadahiro Kuroda, Mojy Chian

48.1 Design and Implementation of a Scalable Encryption Processor with Embedded Variable DC/DC Converter [p. 855]

James Goodman, Anantha Chandrakasan, Abram P. Dancy

This work describes the design and implementation of an energy-efficient, scalable encryption processor that utilizes variable voltage supply techniques and a high-efficiency embedded variable output DC/DC converter. The resulting implementation dissipates 134nJ/bit @ V DD = 2.5V, when encrypting at its maximum rate of 1Mb/s using a maximum datapath width of 512 bits. The embedded converter achieves an efficiency of 96% at this peak load. The processor is 2-3 orders of magnitude more energy efficient than optimized assembly code running on a low-power processor such as the StrongARM.

48.2 Design Considerations for Battery-Powered Electronics [p. 861]

Massoud Pedram, Qing Wu

In this paper, we consider the problem of maximizing the battery life (or duration of service) in battery-powered CMOS circuits. We first show that the battery efficiency (or utilization factor) decreases as the average discharge current from the battery increases. The implication is that the battery life is a super-linear function of the average discharge current. Next we show that even when the average discharge current remains the same, different discharge current profiles (distributions) may result in very different battery lifetimes. In particular, the maximum battery life is achieved when the variance of the discharge current distribution is minimized. Analytical derivations and experimental results underline importance of the correct modeling of the battery-hardware system as a whole and provide a more accurate basis (i.e., the battery discharge times delay product) for comparing various low power optimization methodologies and techniques targeted toward battery-powered electronics. Finally, we calculate the optimal value of Vdd for a battery-powered VLSI circuit so as to minimize the product of the battery discharge times the circuit delay.

48.3 Cycle-Accurate Simulation of Energy Consumption in Embedded Systems [p. 867]

Tajana Simunic, Luca Benini, Giovanni De Micheli

This paper presents a methodology for cycle-accurate simulation of energy dissipation in embedded systems. The ARM Ltd. [1] instruction-level cycle-accurate simulator is extended with energy models for the processor, the L2 cache, the memory, the interconnect and the DC-DC converter. A SmartBadge, which can be seen as an embedded system consisting of StrongARM-1100 processor, memory and the DC-DC converter, is used to evaluate the methodology with the Dhrystone benchmark. We compared performance and energy computed by our simulator with measurements in hardware and found them in agreement within a 5% tolerance. The simulation methodology was applied to design exploration for enhancing a SmartBadge with real-time MPEG feature.

48.4 Lowering Power Consumption in Clock by Using Globally Asynchronous Locally Synchronous Design Style [p. 873]

A. Hemani, T. Meincke, S. Kumar, A. Postula, T. Olsson, P. Nilsson, J. Oberg, P. Ellervee, D. Lundqvist

Power consumption in clock of large high performance VLSIs can be reduced by adopting Globally Asynchronous, Locally Synchronous design style (GALS). GALS has small overheads for the global asynchronous communication and local clock generation. We propose methods to a) evaluate the benefits of GALS and account for its overheads, which can be used as the basis for partitioning the system into optimal number/size of synchronous blocks, and b) automate the synthesis of the global asynchronous communication. Three realistic ASICs, ranging in complexity from 1 to 3 million gates, were used to evaluate GALS benefits and overheads. The results show an average power saving of about 70% in clock with negligible overheads.

Session 49: Emerging Directions

Chair: Randolph E. Harr
Organizers: Anantha Chandrakasan, Mojy Chian

49.1 A CAD Tool for Optical MEMS [p. 879]

Timothy P. Kurzweg, Steven P. Levitan, Philippe J. Marchand, Jose A. Martinez, Kurt R. Prough, Donald M. Chiarulli

Chatoyant models free-space opto-electronic components and systems and performs simulations and analyses that allow designers to make informed system level trade-offs. Recently, the use of MEM bulk and surface micro-machining technology has enabled the fabrication of micro-optical-mechanical systems. This paper presents our models for diffractive optics and new analysis techniques which extend Chatoyant to support optical MEMS design. We show these features in the simulation of two optical MEM systems.
Keywords: Optical MEMS, MEMS-CAD, MOEMS, micro-optics

49.2 On Thermal Effects in Deep Sub-Micron VLSI Interconnects [p. 885]

Kaustav Banerjee, Amit Mehrotra, Alberto Sangiovanni-Vincentelli, Chenming Hu

This paper presents a comprehensive analysis of the thermal effects in advanced high performance interconnect systems arising due to self-heating under various circuit conditions, including electrostatic discharge. Technology (Cu, low-k etc) and scaling effects on the thermal characteristics of the interconnects, and on their electromigration reliability has been analyzed simultaneously, which will have important implications for providing robust and aggressive deep submicron interconnect design guidelines. Furthermore, the impact of these thermal effects on the design (driver sizing) and optimization of the interconnect length between repeaters at the upper-level signal lines are investigated.

49.3 Converting a 64b PowerPC Processor from CMOS Bulk to SOI Technology [p. 892]

D. Allen, D. Behrends, B. Stanisic

A 550MHz 64b PowerPC processor was developed for fabrication in Silicon-On-Insulator (SOI) technology from a processor previously designed and fabricated in bulk CMOS [1]. Both the design and the associated CAD methodology (point tools, flow, and models) were modified to handle demands specific to SOI technology. The challenge was to improve the cycle time by adapting the circuit design, timing, and chip integration methodologies to accommodate effects unique to SOI.

49.4 A Framework for Collaborative and Distributed Web-Based Design [p. 898]

Gangadhar Konduri, Anantha Chandrakasan

The increasing complexity and geographical separation of design data, tools and teams has created a need for a collaborative and distributed design environment. In this paper we present a framework that enables collaborative and distributed Web-based CAD, in which the designers can collaborate on a design and efficiently utilize existing design tools on the Internet. The framework includes a Java-based hierarchical collaborative schematic/block editor with interfaces to distributed Web tools and cell libraries, infrastructure to store and manipulate design objects, and protocols for tool communication, message passing and collaboration.

Session 50: Inductance Issues In High Frequency Design

Chair: David Blaauw
Organizers: Anantha Chandrakasan, David Blaauw

50.1 Dealing With Inductance In High-Speed Chip Design [p. 904]

Phillip Restle, Albert Ruehli, Steven G. Walker

Inductance effects in on-chip interconnects have become significant for specific cases such as clock distributions and other highly optimized networks [1,2]. Designers and CAD tool developers are searching for ways to deal with these effects. Unfortunately, accurate on-chip inductance extraction and simulation in the general case are much more difficult than capacitance extraction. In addition, even if ideal extraction tools existed, most chip designers have little experience designing with lossy transmission lines. This tutorial will attempt to demystify on-chip inductance through the discussion of several illustrative examples analyzed using full-wave extraction and simulation methods. A specialized PEEC (Partial Element Equivalent Circuit) method tailored for chip applications was used for most cases. Effects such as overshoot, reflections, frequency dependent effective resistance and inductance will be illustrated using animated visualizations of the full-wave simulations. Simple examples of design techniques to avoid, mitigate, and even take advantage of on-chip inductance effects will be described.

50.2 Interconnect Analysis: From 3-D Structures to Circuit Models [p. 910]

M. Kamon, N. Marques, Y. Massoud, L. Silveira, J. White

In this survey paper we describe the combination of: discretized integral formulations, sparsification techniques, and krylov-subspace based model-order reduction that has led to robust tools for automatic generation of macromodels that represent the distributed RLC effects in 3-D interconnect. A few computational results are presented, mostly to point out the problems yet to be addressed.

50.3 IC Analyses Including Extracted Inductance Models [p. 915]

Michael W. Beattie, Lawrence T. Pileggi

IC inductance extraction generally produces either port inductances based on simplified current path assumptions or a complete partial inductance matrix. Combining either of these results with the IC interconnect resistance and capacitance models significantly complicates most IC design and verification methodologies. In this tutorial paper we will review some of the analysis and verification problems associated with on-chip inductance, and present a subset of recent results for partially addressing the challenges which lie ahead.
Keywords: Interconnect; Inductance; Model Order Reduction.

50.4 On-chip Inductance Issues in Multiconductor Systems [p. 921]

Shannon V. Morton

As the family of Alpha microprocessors continues to scale into more advanced technologies with very high frequency edge rates and multiple layers of interconnect, the issue of characterizing inductive effects and providing a chip-wide design methodology becomes an increasingly complex problem. To address this issue, a test chip has been fabricated to evaluate various conductor configurations and verify the correctness of the simulation approach. The implementation of and results from this test chip are presented in this paper. Furthermore the analysis has been extended to the upcoming EV7 microprocessor, and important aspects of the derivation of its design methodology, as pertains to these inductive effects, are discussed.
Keywords: Alpha microprocessor, semiconductor, interconnect, buses, inductance, resistance, capacitance, RLC, noise, cross-talk, transmission line.

Session 51: Instruction-Level ASIP Design

Chair: Sharad Malik
Organizers: Rajesh Gupta, David Ku

51.1 A Methodology for Accurate Performance Evaluation in Architecture Exploration [p. 927]

George Hadjiyiannis, Pietro Russo, Srinivas Devadas

We present a system that automatically generates a cycle-accurate and bit-true Instruction Level Simulator (ILS) and a hardware implementation model given a description of a target processor. An ILS can be used to obtain a cycle count for a given program running on the target architecture, while the cycle length, die size, and power consumption can be obtained from the hardware implementation model. These figures allow us to accurately and rapidly evaluate target architectures within an architecture exploration methodology for system-level synthesis. In an architecture exploration scheme, both the ILS and the hardware model must be generated automatically, else a substantial programming and hardware design effort has to be expended in each design iteration. Our system uses the ISDL machine description language to support the automatic generation of the ILS and the hardware synthesis model, as well as other related tools.

51.2 LISA - Machine Description Language for Cycle-Accurate Models of Programmable DSP Architectures [p. 933]

Stefan Pees, Andreas Hoffmann, Vojin Zivojnovic, Heinrich Meyr

This paper presents the machine description language LISA for the generation of bit-and cycle accurate models of DSP processors. Based on a behavioral operation description, the architectural details and pipeline operations of modern DSP processors can be covered. Beyond the behavioral model, LISA descriptions include other architecture-related information like the instruction set. The information provided by LISA models enables automatic generation of simulators and assemblers which are essential elements of DSP software development environments. in order to proof the applicability of our approach, a realized model of the Texas Instruments TMS320C6201 DSP is presented and derived LISA code examples are given.

51.3 Exploiting Intellectual Properties in ASIP Designs for Embedded DSP Software [p. 939]

Hoon Choi, Ju Hwan Yi, Jong-Yeol Lee, In-Cheol Park, Chong-Min Kyung

The growing requirements on the correct design of a high-performance system in a short time force us to use IP's in many designs. In this paper, we propose a new approach to select the optimal set of IP's and interfaces to make the application program meet the performance constraints in ASIP designs. The proposed approach selects IP's with considering interfaces and supports concurrent execution of parts of task in kernel as software code with others in IP's, while the previous state-of-the-art approaches do not consider IP's and interfaces simultaneously and cannot support the concurrent execution. The experimental results on real applications show that the proposed approach is effective in making application programs meet the performance constraints using IP's.

Session 52: Analog Synthesis and Modeling

Chair: Hidetoshi Onodera
Organizers: Alan Mantooth, Hidetoshi Onodera

52.1 MAELSTROM: Efficient Simulation-Based Synthesis for Custom Analog Cells [p. 945]

Michael Krasnicki, Rodney Phelps, Rob A. Rutenbar, L. Richard Carley

Analog synthesis tools have failed to migrate into mainstream use primarily because of difficulties in reconciling the simplified models required for synthesis with the industrial-strength simulation environments required for validation. MAELSTROM is a new approach that synthesizes a circuit using the same simulation environment created to validate the circuit. We introduce a novel genetic/ annealing optimizer, and leverage network parallelism to achieve efficient simulator-in-the-loop analog synthesis.

52.2 Behavioral Synthesis of Analog Systems using Two-Layered Design Space Exploration [p. 951]

Alex Doboli, Adrian Nunez-Aldana, Nagu Dhanwada, Sree Ganesan, Ranga Vemuri

This paper presents a novel approach for synthesis of analog systems from behavioral VHDL-AMS specifications. We implemented this approach in the VASE behavioral-synthesis tool. The synthesis process produces a netlist of electronic components that are selected from a component library and sized such that the overall area is minimized and the rest of the performance constraints such as power, slew-rate, bandwidth, etc. are met. The gap between system level specifications and implementations is bridged using a hierarchically-organized, design-space exploration methodology. Our methodology performs a two-layered synthesis, the first being architecture generation, and the other component synthesis and constraint transformation. For architecture generation we suggest a branch-and-bound algorithm, while component synthesis and constraint transformation use a Genetic Algorithm based heuristic method. Crucial to the success of our exploration methodology is a fast and accurate performance estimation engine that embeds technology process parameters, SPICE models for basic circuits and performance composition equations. We present a telecommunication application as an example to illustrate our synthesis methodology, and show that constraint-satisfying designs can be synthesized in a short time and with a reduced designer effort.

52.3 Circuit Complexity Reduction for Symbolic Analysis of Analog Integrated Circuits [p. 958]

Walter Daems, Georges Gielen, Willy Sansen

This paper presents a method to reduce the complexity of a linear or linearized (small-signal) analog circuit. The reduction technique, based on quality-error ranking, can be used as a standard reduction engine that ensures the validity of the resulting network model in a specific (set of) design point(s) within a given frequency range and a given magnitude and phase error. It can also be used as an analysis engine to extract symbolic expressions for poles and zeroes. The reduction technique is driven by analysis of the signal flow graph associated with the network model. Experimental results show the effectiveness of the approach.

Session 53: Simulation and Test

Chair: Miodrag Potkonjak
Organizers: Mike Murray, Miodrag Potkonjak

53.1 Cycle and Phase Accurate DSP Modeling and Integration for HW/SW Co-Verification [p. 964]

Lisa Guerra, Joachim Fitzner, Dipankar Talukdar, Chris Schläger, Bassam Tabbara, Vojin Zivojnovic

We present our practical experience in the modeling and integration of cycle/phase-accurate instruction set architecture (ISA) models of digital signal processors (DSPs) with other hardware and software components. A common approach to the modeling of processors for HW/SW co-verification relies on instruction-accurate ISA models combined (i.e. wrapped) with the bus interface models (BIM) that generate the clock/phase-accurate timing at the component's interface pins. However, for DSPs and new microprocessors with complex architectural features this approach is from our perspective not acceptable. The additional extensive modeling of the pipeline and other architectural details in the BIM would force us to develop two detailed processor models with a complex BIM API between them. We therefore propose an alternative approach in which the processor ISAs themselves are modeled in a full cycle/phase-accurate fashion. The bus interface model is then reduced to just modeling the connection to the pins. Our models have been integrated into a number of cycle-based and event-driven system simulation environments. We present one such experience in incorporating these models into a VHDL environment. The accuracy has been verified cycle-by-cycle against the gate/RTL level models. Multi-processor debugging and observability into the precise cycle-accurate processor state is provided. The use of co-verification models in place of the RTL resulted in system speedups up to 10 times, with the cycle-accurate ISA models themselves reaching performances of up to 123K cycles/sec.

53.2 A Study in Coverage-Driven Test Generation [p. 970]

Mike Benjamin, Daniel Geist, Alan Hartman, Yaron Wolfsthal, Gerard Mas, Ralph Smeets

One possible solution to the verification crisis is to bridge the gap between formal verification and simulation by using hybrid techniques. This paper presents a study of such a functional verification methodology that uses coverage of formal models to specify tests. This was applied to a modern superscalar microprocessor and the resulting tests were compared to tests generated using existing methods. The results showed some 50% improvement in transition coverage with less than a third the number of test instructions, demonstrating that hybrid techniques can significantly improve functional verification.
1.1 Keywords: Functional verification, test generation, formal models, transition coverage

53.3 IC Test Using the Energy Consumption Ratio [p. 976]

Wanli Jiang, Bapiraju Vinnakota

Dynamic-current based test techniques can potentially address the drawbacks of traditional and I_ddq test methodologies. The quality of dynamic current based test is degraded by process variations in IC manufacture. The energy consumption ratio (ECR) is a new metric that improves the effectiveness of dynamic current test by reducing the impact of process variations by an order of magnitude. We address several issues of significant practical importance to an ECR-based test methodology. We use the ECR to test a low-voltage submicron IC with a microprocessor core. The ECR more than doubles the effectiveness of the dynamic current test already used to test the IC. The fault coverage of the ECR is greater than that offered by any other test, including I_ddq.We develop a logic-level fault simulation tool for the ECR and techniques to set the threshold for an ECR-based test process. Our results demonstrate that the ECR offers the potential to be a high-quality low-cost test methodology. To the best of our knowledge, this is the first dynamic-current based test technique to be validated with manufactured ICs.

Session 54: Designing On-Chip Inductors

Chair: Teresa Meng
Organizers: Teresa Meng, David Blaauw

54.1 Design Strategy of On-Chip Inductors for Highly Integrated RF Systems [p. 982]

C. Patrick Yue, S. Simon Wong

This paper describes a physical model for spiral inductors on silicon which is suitable for circuit simulation and layout optimization. Key issues related to inductor modeling such as skin effect and silicon substrate loss are discussed. An effective ground shield is devised to reduce substrate loss and noise coupling. A practical design methodology based on the trade-off between the series resistance and oxide capacitance of an inductor is presented. This method is applied to optimize inductors in state-of-the-art processes with multilevel interconnects. The impact of interconnect scaling, copper metallization and low-K dielectric on the achievable inductor quality factor is studied.
1.1 Keywords: Spiral inductor, quality factor, skin effect, substrate loss, substrate coupling, patterned ground shield, interconnects

54.2 The Simulation and Design of Integrated Inductors [p. 988]

N. R. Belk, M. R. Frei, M. Tsai, A. J. Becker, K. L. Tokuda

At present there are two common types of integrated circuit inductor simulation tools. The first type is based on the Greenhouse methods[1], and obtains a solution in a fraction of a second; however, because it does not use solutions of the inductor charge and current distributions, it has limited accuracy. The second type, method of moments (MoM) solvers, determines the charge and current variations by decomposing the inductor into thousands of sub elements and solving a matrix. However, this process takes between minutes and hours to obtain a reasonably accurate solution. In this paper, we present a series of algorithms for solving inductors, of radius small compared to the wave length of the electrical signal, that equal or exceed the accuracy of MoM solvers, but obtain those solutions in roughly 1 second.

54.3 Optimization of Inductor Circuits via Geometric Programming [p. 994]

Maria del Mar Hershenson, Sunderarajan S. Mohan, Stephen P. Boyd, Thomas H. Lee

We present an efficient method for optimal design and synthesis of CMOS inductors for use in RF circuits. This method uses the the physical dimensions of the inductor as the design parameters and handles a variety of specifications including fixed value of inductance, minimum self-resonant frequency, minimum quality factor, etc. Geometric constraints that can be handled include maximum and minimum values for every design parameter and a limit on total area. Our method is based on formulating the design problem as a special type of optimization problem called geometric programming, for which powerful efficient interior-point methods have recently been developed. This allows us to solve the inductor synthesis problem globally and extremely efficiently.Also, we can rapidly compute globally optimal trade-off curves between competing objectives such as quality factor and total inductor area. We have fabricated a number of inductors designed by the method, and found good agreement between the experimental data and the specifications predicted by our method.

Session 55: Panel: What is the Proper System on Chip Design Methodology?

Chair: Richard Goering
Organizer: Stanley J. Krolikoski
Panel Members: Pierre Bricaud, James G. Dougherty, Steve Glaser, Michael Keating, Robert Payne, Davoud Samani [p. 999]

Over the past year two distinct answers have emerged regarding SoC design methodologies. On the one hand, it is posited in the Reuse Methodology Manual, that a logic synthesis-based design methodology can be used effectively to develop system chips. An alternative methodology focuses on integration (or "reference") platforms and the customization of the basic application-specific platform through the addition of selected SW and/or HW IP blocks. This panel session will debate the merits of these seemingly incompatible proposed SoC methodologies.

Tutorials:

Automated Layout and Migration in Ultra-Deep Submicron VLSI: Andrew B. Kahng, Shmuel Wimer, Cyrus S. Bamji, Chris Strolenberg
Analog and Mixed-Signal Modeling Using the VHDL-AMS Language: Ernst Christen, Kenneth Bakalar, Allen M. Dewey, Eduard Moser
Information Visualization: Creating Advanced Visual Interfaces for Analyzing Designs: Kevin Mullet, Erric Solomon, David Overhauser
Design Reuse Tools and Methodologies: Michael Keating, Warren Savage, Ulf Schlichtmann, Pierre Bricaud
Embedded Memories in System Design - From Technology to Systems Architecture: Francky Catthoor, Soren Hein, Vijay Nagasamy, Bernhard Rohfleisch, Chris toforos E. Kozyrakis, Nikil D. Dutt
Built-In Self-Test for Systems on a Chip: Janusz Rajski, Jerzy Tyszer, Nilanjan Mukherjee