ICCAD 2001 Abstracts

Sessions: [1A] [1B] [1C] [1D] [2A] [2B] [3A] [3B] [3C] [3D] [Panel] [4A] [4B] [4C] [4D] [5A] [5B] [6A] [6B] [6C] [6D] [7A] [7B] [7C] [7D] [8A] [8B] [8C] [8D] [9A] [9B] [9C] [9D] [10A] [10B] [10C] [11A]


Session 1A: Dynamic Verification

Moderators: Jay Lawrence, Cadence Design Systems, Inc., Chelmsford, MA
Kunle Olokotun, Stanford University, Stanford, CA
1A.1 Static Scheduling of Multi-Domain Memories For Functional Verification [p. 2]
Murali Kudlugi, Charles Selvidge, Russell Tessier

Over the past decade both the quantity and complexity of available on-chip memory resources have increased dramatically. In order to ensure accurate ASIC behavior, both logic functions and memory resources must be successfully verified before fabrication. Often, the functional verification of contemporary ASIC memory is complicated by the presence of multiple design clocks that operate asynchronously to each other. The presence of multiple clock domains presents significant challenges for large parallel verification systems such as parallel simulators and logic emulators that model both design logic and memory. Specifically, multiple asynchronous design clocks make it difficult to verify that design hold times are met during memory model execution and causality along memory data/control paths is preserved during signal communication. In this paper, we describe new scheduling heuristics for memory-based designs with multiple asynchronous clock domains that are mapped to parallel verification systems. Our scheduling approach scales to an unlimited number of clock domains and converges quickly to a feasible solution if one exists. It is shown that when our technique is applied to an FPGA-based emulator containing 48MB of SRAM, evaluation fidelity is maintained and increased verification performance is achieved for large, memory-intensive circuits with multiple asynchronous clock domains.

1A.2 A Simulation-Based Method for the Verification of Shared Memory in Multiprocessor Systems [p. 10]
Scott A. Taylor, Carl Ramey, Craig Barner, David Asher

As processor architectural complexity increases, greater effort must be focused on functional verification of the chip as a component of the system. Multiprocessor verification presents a particular challenge in terms of both difficulty and importance. While formal methods have made significant progress in the validation of coherence protocols, these methods are not always practical to apply to the structural implementation of a complex microprocessor. This paper describes a simulation-based approach to modeling and checking the simulation-based approach to modeling and checking the shared-memory properties of the Alpha architecture by using a directed acyclic graph to represent memory-access orderings. The resulting tool is integrated with a simulation model of an Alpha implementation, allowing the user to verify aspects of the implementation with respect to the overall architectural specification. Both an implementation-dependent and an implementation-specific version of the tool are discusses.
Keywords: Microprocessor, Functional Verification, Shared Memory, Checking

1A.3 Predicting the Performance of Synchronous Discrete Event Simulation Systems [p. 18]
Jinsheng Xu, Moon Jung Chung

In this paper we propose a model to predict the performance of synchronous discrete event simulation. The model considers parameters including the number of active objects per cycle, event execution granularity and communication cost. We derive a single formula that predicts the performance of synchronous simulation. We have benchmarked several VHDL circuits on SGI Origin 2000. The benchmark results show that the prediction model explains more than 90% of parallel simulation execution time. We also measure the effect of computation granularity over performance. The benchmark results show that although higher granularity can have better speedup because of dominance of computation over communication, the computational granularity cannot overshadow the inherent synchronization cost. This model can be used to predict the speed-up expected for synchronous simulation, and to decide whether it is worthwhile to use synchronous simulation before actually implementing it.
Keywords: Parallel Discrete Event Simulation, Synchronous Simulation, Synchronization Cost, Communication Cost, Granularity, Performance.


Session 1B: System-Level Exploration and Design

Moderators: Yanbing Li, Synopsys, Inc., Mountain View, CA
Xiaobo (Sharon) Hu, University of Notre Dame, Notre Dame, IN
1B.1System-Level Exploration for Pareto-Optimal Configurations in Parameterized Systems-on-a-Chip [p. 25]
Tony D. Givargis, Frank Vahid, Joerg Henkel

In this work, we provide a technique for efficiently exploring the configuration space of a parameterized system-on-a-chip (SOC) architecture to find all Pareto-optimal configurations. These configurations represent the range of meaningful power and performance tradeoffs that are obtainable by adjusting parameter values for a fixed application mapped onto the SOC architecture. Our approach extensively prunes the potentially large configuration space by taking advantage of parameter dependencies. We have successfully incorporated our technique into the parameterized SOC tuning environment (Platune) and applied it to a number of applications.
Keywords: System-on-a-chip, parameterized architectures, configurable platforms, embedded systems, system-level exploration, low-power system design, platform tuning.

1B.2 System Level Design with Spade: an M-JPEG Case Study [p. 31]
Paul Lieverse, Todor Stefanov, Pieter van der Wolf, Ed F. Deprettere

In this paper we present and evaluate the SPADE (System level Performance Analysis and Design space Exploration) methodology through an illustrative case study. SPADE is a method and tool for architecture exploration of heterogeneous signal processing systems. In this case study we start from an M-JPEG application and use SPADE to evaluate alternative multiprocessor architectures for implementing this application. SPADE follows the Y-chart paradigm for system level design; application and architecture are modeled separately and mapped onto each other in an explicit design step. SPADE permits architectures to be modeled at an abstract level using a library of generic building blocks, thereby reducing the cost of model construction and simulation. The case study shows that SPADE supports efficient exploration of candidate architectures; models can be easily constructed, modified and simulated in order to quickly evaluate alternative system implementations.
Keywords: system level design, design space exploration, application modeling, architecture modeling, M-JPEG

1B.3 NetBench: A Benchmarking Suite for Network Processors [p. 39]
Gokhan Memik, William H. Mangione-Smith, Wendong Hu

In this study we introduce NetBench, a benchmarking suite for network processors. NetBench contains a total of 9 applications that are representative of commercial applications for network processors. These applications are from all levels of packet processing; Small, low-level code fragments as well as large application level programs are included in the suite. Using SimpleScalar simulator we study the NetBench programs in detail and characterize the network processor workloads. We also compare key characteristics such as instructions per cycle, instruction distribution, branch prediction accuracy, and cache behavior with the programs from MediaBench. Although the aimed architectures are similar for MediaBench and NetBench suites, we show that these workloads have significantly different characteristics. Hence a separate benchmarking suite for network processors is a necessity. Finally, we present performance measurements from Intel IXP1200 Network Processor to show how NetBench can be utilized.


Session 1C: Interconnect Planning

Moderators: Robi Dutta, Synopsys, Inc., Mountain View, CA
Chung-Kuan Cheng, University of California at San Diego, La Jolla, CA
1C.1 Analysis of Substrate Thermal Gradient Effects on Optimal Buffer Insertion [p. 44]
Amir H. Ajami, Kaustav Banerjee, Massoud Pedram

This paper studies the effects of the substrate thermal gradients on the buffer insertion techniques. Using a non-uniform temperature-dependent distributed RC interconnect delay model, the buffer insertion problem is analyzed and design guidelines are provided to ensure the near-optimality of the signal performance in the presence of the thermal gradients. In addition, the effect of temperature-dependent driver resistance on the buffer insertion is studied. Experimental results show that neglecting thermal gradients in the substrate and the interconnect lines can result in non-optimal solutions when using standard buffer insertion techniques and that these effects intensify with technology scaling.

1C.2 A New Algorithm for Routing Tree Construction with Buffer Insertion and Wire Sizing under Obstacle Constraints [p. 49]
Xiaoping Tang, Ruiqi Tian, Hua Xiang, D.F. Wong

Buffer insertion and wire sizing are critical in deep submicron VLSI design. This paper studies the problem of constructing routing trees with simultaneous buffer insertion and wire sizing in the presence of routing and buffer obstacles. No previous algorithms consider all these factors simultaneously. Previous dynamic programming based algorithm is first extended to solve the problem. However, with the size of routing graph increasing and with wire sizing taken into account, the time and space requirement increases enormously. Then a new approach is proposed to formulate the problem as a series of graph problems. The routing tree solution is obtained by finding shortest paths in a series of graphs. In the new approach, wire sizing can be handled almost without any additional time and space requirement. Moreover, the time and space requirement is only polynomial in terms of the size of routing graph. Our algorithm differs from traditional dynamic programming, and is capable of addressing the problem of inverter insertion and sink polarity. Both theoretical and experimental results show that the graph-based algorithm outperforms the DP-based algorithm by a large margin. We also propose a hierarchical approach to construct routing tree for a large number of sinks.

1C.3 Bus Encoding to Prevent Crosstalk Delay [p. 57]
Bret M. Victor, Kurt Keutzer

The propagation delay across long on-chip buses is increasingly becoming a limiting factor in high-speed designs. Crosstalk between adjacent wires on the bus may create a significant portion of this delay. Placing a shield wire between each signal wire alleviates the crosstalk problem but doubles the area used by the bus, an unacceptable consequence when the bus is routed using scarce top-level metal resources. Instead, we propose to employ data encoding to eliminate crosstalk delay within a bus. This paper presents a rigorous analysis of the theory behind "self-shielding codes", and gives the fundamental theoretical limits on the performance of codes with and without memory. Specifically, we find that a 32-bit bus can be encoded with 40 wires using a code with memory or 46 wires with a memoryless code, in comparison to the 63 wires required with simple shielding.


Session 1D: Analog Macromodeling

Moderators: Rob A. Rutenbar, Carnegie Mellon University, Pittsburgh, PA
Henry Chang, Cadence Design Systems, Inc., San Jose, CA
1D.1 Behavioral Modeling of Analog Circuits by Wavelet Collocation Method [p. 65]
Xin Li, Xuan Zeng, Dian Zhou, Xieting Ling

In this paper, we develop a wavelet collocation method with nonlinear companding for behavioral modeling of analog circuits. To construct the behavioral models, the circuit is first partitioned into building blocks and the input-output function of each block is then approximated by wavelets. As the blocks are mathematically represented by sets of simple wavelet basis functions, the computation cost for the behavioral simulation is significantly reduced. The proposed method presents several merits compared with those conventional techniques. First, the algorithm for expanding input-output functions by wavelets is a general-purpose approach, which can be applied in automatically modeling of different analog circuit blocks with different structures. Second, both the small signal effect and the large signal effect are modeled in a unified formulation, which eases the process of modeling and simulation. Third, a nonlinear companding method is developed to control the modeling error distribution. To demonstrate the promising features of the proposed method, a 4th order switched-current filter is employed to build the behavioral model.

1D.2 Simulation-Based Automatic Generation of Signomial and Posynomial Performance Models for Analog Integrated Circuit Sizing [p. 70]
Walter Daems, Georges Gielen, Willy Sansen

This paper presents a method to automatically generate posynomial response surface models for the performance parameters of analog integrated circuits. The posynomial models enable the use of efficient geometric programming techniques for circuit sizing and optimization. To avoid manual derivation of approximate symbolic equations and subsequent casting to posynomial format, techniques from design of experiments and response surface modeling in combination with SPICE simulations are used to generate signomial and posynomial models in an automatic way. Attention is paid to estimating the relative "goodness-of-fit" of the generated models. Experimental results allow to assess both the quality of the generated models as well as the strengths and the limitations of the presented approach.
Keywords: Analog Circuit Modeling, Posynomial and Signomial Response Surface Modeling, Geometric Programming, Design of Experiments

1D.3 Power Grid Transient Simulation in Linear Time Based on Transmission-Line-Modeling Alternating-Direction-Implicit Method [p. 75]
Yu-Min Lee, Charlie Chung-Ping Chen

The soaring clocking frequency and integration density demand robust and stable power delivery to support tens of millions of transistors switching. To ensure the design quality of power delivery, extensive transient power grid simulations need to be performed during design process. However, the traditional circuit simulation engines are not scaled as well as the complexity of power delivery, as a result, it often takes a long runtime and huge memory requirement to simulate a medium size power grid circuit. In this paper, we develop and present a new efficient transient simulation algorithm for power distribution. The proposed algorithm, TLM-ADI, (Transmission-Line-Modeling Alternating-Direction-Implicit), first models the power delivery structure as transmission line mesh structure, then solves the transient MNA matrices by the alternating-direction-implicit method. The proposed algorithm, with linear runtime and memory requirement, is also unconditionally stable which ensures that the time-step is not limited by any stability requirement. Extensive experimental results show that the proposed algorithm is not only orders of magnitude faster than SPICE but also extremely memory saving and accurate.


Session 2A: Embedded Tutorial: Platform-Based Designs

Moderator: Ellen M. Sentovich, Cadence Berkeley Labs., Berkeley, CA

Session 2B: Embedded Tutorial: VLSI Microsystems: The Power of Many

Moderator: Matton Kamon, Conventor, Inc., Cambridge, MA

Session 3A: Sequential Synthesis

Moderators: Hamid Savoj, Magma Design Automation, Inc., Cupertino, CA
Diana Marculescu, Carnegie Mellon University, Pittsburgh, PA
3A.1 Sequential SPFDs [p. 84]
Subarnarekha Sinha, Andreas Kuehlmann, Robert K. Brayton

SPFDs are a mechanism to express flexibility in Boolean networks. Introduced by Yamashita et al. in the context of FPGA synthesis [4], they were extended later to general combinational networks [2]. We introduce the concept of sequential SPFDs and provide an algorithm to compute them based on a partition of the state bits. The SPFDs of each component in the partition are used to generate equivalence classes of states. We provide a formal relation between the resulting state classification and the equivalence classes produced by classical state minimization of completely specified machines [6]. The SPFDs associated with the state bits can be applied for re-encoding the state space. For this, we give an algorithm to re-synthesize the sequential circuit using sequential SPFDs and the new state re-encoding.

3A.2 On the Optimization Power of Redundancy Addition and Removal Techniques for Sequential Circuits [p. 91]
Enrique San Mill‡n, Luis Entrena, JosŽ Alberto Espejo

This paper attempts to determine the capabilities of existing Redundancy Addition and Removal (SRAR) techniques for logic optimization of sequential circuits. To this purpose, we compare this method with the Retiming and Resynthesis (RaR) techniques. For the RaR case the set of possible transformations has been established by relating them to STG transformations by other authors. Following these works, we first formally demonstrate that logic transformations provided by RaR are covered by SRAR as well. Then we also show that SRAR is able to identify transformations that cannot be found by RaR. This way we prove the higher potential of the Sequential Redundancy Addition and Removal over the Retiming and Resynthesis techniques.

3A.3 Placement Driven Retiming with a Coupled Edge Timing Model [p. 95]
Ingmar Neumann, Wolfgang Kunz

Retiming is a widely investigated technique for performance optimization. It performs powerful modifications on a circuit netlist. However, often it is not clear, whether the predicted performance improvement will still be valid after placement has been performed. This paper presents a new retiming algorithm using a highly accurate timing model taking into account the effect of retiming on capacitive loads of single wires as well as fanout systems. We propose the integration of retiming into a timing-driven standard cell placement environment based on simulated annealing. Retiming is used as an optimization technique throughout the whole placement process. The experimental results show the benefit of the proposed approach. In comparison with the conventional design flow based on standard FEAS our approach achieved an improvement in cycle time of up to 34% and 17% on the average.

3A.4 Solution of Parallel Language Equations for Logic Synthesis [p. 103]
Nina Yevtushenko, Tiziano Villa, Robert K. Brayton, Alex Petrenko, Alberto L. Sangiovanni-Vincentelli

The problem of designing a component that combined with a known part of a system, conforms to a given overall specification arises in several applications ranging from logic synthesis to the design of discrete controllers. We cast the problem as solving abstract equations over languages. Language equations can be defined with respect to several language composition operators such as synchronous composition, and parallel composition, ; conformity can be checked by language containment. In this paper we address parallel language equations.
Parallel composition arises in the context of modeling delay-insensitive processes and their environments. The parallel composition operator models an exchange protocol by which an input is followed by an output after a finite exchange of internal signals. It abstracts a system with two components with a single message in transit, such that at each instance either the components exchange messages or one of them communicates with its environment, which submits the next external input to the system only after the system has produced an external output in response to the previous input.
We study the most general solutions of the language equation A ^ X is a member of C , and define the language operators needed to express them. Then we specialize such equations to languages associated with important classes of automata used for modeling systems, e.g., regular languages and FSM languages. In particular, for A ^ X is a member of C, we give algorithms for computing: the largest FSM language solution, the largest complete solution, and the largest solution whose composition with A yields a complete FSM language. We solve also FSM equations under bounded parallel composition. In this paper, we give concrete algorithms for computing such solutions, and state and prove their correctness.


Session 3B: Compiler Techniques in System Level Design

Moderators: Radu Marculescu, Carnegie Mellon University, Pittsburgh, PA
Grant Martin, Cadence Design Systems, Inc., San Jose, CA
3B.1 CALiBeR: A Software Pipelining Algorithm for Clustered Embedded VLIW Processors [p. 112]
Cagdas Akturan, Margarida F. Jacome

In this paper we describe a software pipelining framework, CALiBeR (Cluster Aware Load Balancing Retiming Algorithm), suitable for compilers targeting clustered embedded VLIW processors. CALiBeR can be effectively used by embedded system designers to explore different code optimization alternatives, i.e., can assist the generation of high-quality customized retiming solutions for desired program memory size and throughput requirements, while minimizing register pressure. An extensive set of experimental results is presented, considering several representative benchmark loop kernels and a wide variety of clustered datapath configurations, demonstrating that our algorithm compares favorably with one the best state-of-the-art algorithms, achieving up to 50% improvement in performance and up to 47% improvement in register requirements.

3B.2 Software-Assisted Cache Replacement Mechanisms for Embedded Systems [p. 119]
Prabhat Jain, Srinivas Devadas, Daniel Engels, Larry Rudolph

We address the problem of improving cache predictability and performance in embedded systems through the use of software-assisted replacement mechanisms. These mechanisms require additional software controlled state information that affects the cache replacement decision. Software instructions allow a program to kill a particular cache element, i.e., effectively make the element the least recently used element, or keep that cache element, i.e., the element will never be evicted. We prove basic theorems that provide conditions under which kill and keep instructions can be inserted into program code, such that the resulting performance is guaranteed to be as good as or better than the original program run using the standard LRU policy. We developed a compiler algorithm based on the theoretical results that, given an arbitrary program, determines when to perform software-assisted replacement, i.e., when to insert either a kill or keep instruction. Empirical evidence is provided that shows that performance and predictability (worst-case performance) can be improved for many programs.

3B.3 Instruction Generation for Hybrid Reconfigurable Systems [p. 127]
Ryan Kastner, Seda Ogrenci-Memik, Elaheh Bozorgzadeh, Majid Sarrafzadeh

In this work, we present an algorithm for simultaneous template generation and matching. The algorithm profiles the graph and iteratively contracts edges to create the templates. The algorithm is general and can be applied to any type of graph, including directed graphs and hypergraphs. We discuss how to target the algorithm towards the novel problem of instruction generation and selection for a hybrid (re)configurable systems. In particular, we target the Strategically Programmable System, which embeds complex computational units like ALUs, IP blocks, etc. into a configurable fabric. We argue that an essential compilation step for these systems is instruction generation, as it is needed to specify the functionality of the embedded computational units. Additionally, instruction generation can be used to create soft macros ö tightly sequenced pre-specified operations placed in the configurable fabric.


Session 3C: Routing Architecture and Techniques for FPGAs

Moderators: Majid Sarrafzadeh, University of California, Los Angeles, CA
Rajeev Jayaraman, Xilinx, Inc., San Jose, CA
3C.1 Interconnect Resource-Aware Placement for Hierarchical FPGAs [p. 132]
Amit Singh, Ganapathy Parthasarathy, Malgorzata Marek-Sadowska

In this paper, we utilize Rent's rule as an empirical measure for efficient clustering and placement of circuits on hierarchical FPGAs. We show that careful matching of design complexity and architecture resources of hierarchical FPGAs can have a positive impact on the overall device area. We propose a circuit placement algorithm based on Rentās parameter and show that our clustering and placement techniques can improve the overall device routing area by as much as 21% for the same array size, when compared to a state-of-art FPGA placement and routing tool.

3C.2 A Router for Symmetrical FPGAs Based on Exact Routing Density Evaluation [p. 137]
Nak-Woong Eum, Taewhan Kim, Chong Min Kyung

This paper presents a new performance and routability driven routing algorithm for symmetrical array based field-programmable gate arrays (FPGAs). A key contribution of our work is to overcome one essential limitation of the previous routing algorithms: inaccurate estimations of routing density which were too general for symmetrical FPGAs. To this end, we derive an exact routing density calculation that is based on a precise analysis of the structure (switch block) of symmetrical FPGAs, and utilize it consistently in global and detailed routings. With an introduction of the proposed accurate routing metrics, we design a new routing algorithm called a cost-effective net-decomposition based routing which is fast, and yet produces remarkable routing results in terms of both routability and path/net delays. We performed an extensive experiment to show the effectiveness of our algorithm based on the proposed cost metrics.

3C.3 A Search-Based Bump-and-Refit Approach to Incremental Routing for ECO Applications in FPGAs [p. 144]
Vinay Verma, Shantanu S. Dutt

Incremental physical CAD is encountered frequently in the so-called engineering change order (ECO) process in which design changes are made typically late in the design process in order to correct logical and/or technological problems in the circuit. As far as routing is concerned, in order to capitalize on the enormous resources and time already spent on routing the circuit, and to meet time-to-market requirements, it is desirable to re-route only the ECO-affected portion of the circuit, while minimizing any routing changes in the much larger unaffected part of the circuit. Incremental re-routing also needs to be fast and to effectively use available routing resources. In this paper, we develop a complete incremental routing methodology for FPGAs using a novel approach called bump and refit (B&R); B&R was initially proposed in [4] in the much simpler context of extending some nets by a segment (for the purpose of fault tolerance) for FPGAs with simple i-to-i switchboxes. Here we significantly extend this concept to global and detailed incremental routing for FPGAs with complex switchboxes such as those in Lucentās ORCA and Xilinxās Virtex series. We also introduce new concepts such as B&R cost estimation during global routing, and determination of the optimal subnet set to bump for each bumped net, which we obtain using an efficient dynamic programming formulation. The basic B&R idea in our algorithms is to re-arrange some portions of some existing nets on other tracks within their current channels to find valid routings for the incrementally changed circuit without requiring any extra routing resources (i.e., completely unused tracks), and with little effect on the electrical properties of existing nets. We have developed optimal and near-optimal algorithms (called Sub-sec B&R and Subnet B&R, respectively) to find incremental routing solutions using the B&R paradigm in complex FPGAs. We implemented these algorithms for Lucentās ORCA-2C FPGA, and compared our algorithms to two recent incremental routing techniques, Standard and Rip-up&Reroute, and to Lucentās A PAR routing tool. Experimental results show that our incremental routers perform very well for ECO applications. Firstly, B&R is 10 to 20 times faster than complete re-routing using A PAR. Further, the B&R incremental routers are nearly 27% faster and the new nets have nearly 10% smaller lengths than in previous incremental techniques. Also, the B&R routers do not change either the lengths or topologies of existing nets, a significant advantage in ECO applications, in contrast to Rip-up&Reroute which increases the length of ripped up nets by an average of 8.75% to 13.6%.


Session 3D: Interconnect Performance and Reliability Optimization

Moderators: David D. Ling, IBM Corp. TJ Watson Research Center, Yorktown Heights, NY
Ken Kundert, Cadence Design Systems, Inc., San Jose, CA

3D.1 Area Minimization of Power Distribution Network Using Efficient Nonlinear Programming Techniques [p. 153]
Xiaohai Wu, Xianlong Hong, Yici Cai, C.K. Cheng, Jun Gu, Wayne Dai

This paper deals with area minimization of power distribution network for VLSIs. A new algorithm based on efficient nonlinear programming techniques is presented to solve this problem. The experiment results prove that this algorithm has achieved the objects that minimize the area of power/ground networks with higher speed.

3D.2 Coupled Analysis of Electromigration Reliability and Performance in ULSI Signal Nets [p. 158]
Kaustav Banerjee, Amit Mehrotra

In deep submicron VLSI circuits, interconnect reliability due to electromigration and thermal effects is fast becoming a serious design issue particularly for long signal lines. This paper presents for the first time a rigorous coupled analysis of AC electromigration that are prevalent in signal lines and thermal effects arising due to Joule heating of the wires. The analysis is applied to study the effect of technology scaling using ITRS data, wherein the effects of increasing interconnect (Cu) resistivity with line dimensions and the effect of a finite barrier metal thickness have been included. Finally, we have also quantified the reliability implications for minimum sized vias in optimally buffered signal nets. Our analysis suggests that for the optimally buffered interconnects, while the maximum current density in the line remains limited by the performance, the current density in the vias exceeds the reliability limits and therefore requires careful consideration in the physical design process flow.

3D.3 Compact Modeling and SPICE-Based Simulation for Electrothermal Analysis of Multilevel ULSI Interconnects [p. 165]
TingYen Chiang, Kaustav Banerjee, Krishna C. Saraswat

This paper presents both compact analytical models and fast SPICE based 3-D electro-thermal simulation methodology to characterize thermal effects due to Joule heating in high performance Cu/low-k interconnects under steady-state and transient stress conditions. The results demonstrate excellent agreement with experimental data and those using Finite Element (FE) thermal simulations (ANSYS). The effect of vias, as additional heat sinking paths to alleviate the temperature rise in the metal wires, is included in our analysis to provide more accurate and realistic thermal diagnosis. It shows that the effectiveness of vias in reducing the temperature rise in interconnects is highly dependent on the via separation and the dielectric materials used. The analytical model is then applied to estimate the temperature distribution in multi-level interconnects. In addition, we discuss the possibility that, under the impact of thermal effects, the performance improvement expected from the use of low-k dielectric materials may be degraded. Furthermore, thermal coupling between wires is evaluated and found to be significant. Finally, the impact of metal wire aspect ratio on interconnect thermal characteristics is discussed.


Panel:

Moderator: Andreas Kuehlmann, Cadence Berkeley Laboratories, Berkeley, CA
Panelists: Robert W. Dutton, Paul Franzon, Seth C. Goldstein, Philip Luekes, Eric Parker, Thomas N. Theis
Will Nanotechnology Change the Way We Design and Verify Systems? [p. 174]

During the past few decades advances in semiconductor technology have had only modest implications for the design and verification of integrated systems. While "deep-submicron" effects created new challenges for guaranteeing electrical integrity and necessitated the bundling of traditionally separated design steps such as logic synthesis and physical layout, the general integrated circuit design flow did not change. But what will happen when semiconductor technology hits the wall? Which of the currently futuristic nanotechnologies will step in and how will they change the way we design and verify systems? Will we just replace the implementation of the NAND-gate and D-flipflop and then do business as usual? Or do these technologies force us to abandon convenient hierarchical modeling of electrical, logical, and system levels? The panel will discuss the exciting opportunities of nanotechnologies and their implication for the design and verification of future systems.


Session 4A: Circuit Structure in Formal Verification

Moderators: Masahiro Fujita, University of Tokyo, Tokyo, Japan
Vigyan Singal, Tempus Fujit, Berkeley, CA
4A.1 Min-Area Retiming on Dynamic Circuit Structures [p. 176]
Jason Baumgartner, Andreas Kuehlmann

In this paper we present two techniques for improving min-area retiming that combine the actual register minimization with combinational optimization. First, we discuss an on-the-fly retiming approach based on a sequential AND/INVERTER/REGISTER graph. With this method the circuit structure is sequentially compacted using a combination of register ćdraggingä and AND vertex hashing. Second, we present an extension of the classical retiming formulation that allows an optimal sharing of fanin registers of AND clusters, similar to traditional fanout register sharing. The combination of both techniques is capable of minimizing the circuit size beyond that possible with a standard Leiserson and Saxe retiming approach on a static netlist structure. Our work is primarily aimed at optimizing the performance of reachability-based verification methods. However, the presented techniques are equally applicable to sequential redundancy removal in technology-independent logic synthesis. A large set of experiments using benchmark and industrial circuits demonstrate the effectiveness of the described techniques.

4A.2 Verification of Integer Multipliers on the Arithmetic Bit Level [p. 183]
Dominik A. Stoffel, Wolfgang Kunz

One of the most severe short-comings of currently available equivalence checkers is their inability to verify integer multipliers. In this paper, we present a bit level reverse-engineering technique that can be integrated into standard equivalence checking flows. We propose a Boolean mapping algorithm that extracts a network of half adders from the gate netlist of an addition circuit. Once the arithmetic bit level representation of the circuit is obtained, equivalence checking can be performed using simple arithmetic operations. Experimental results show the promise of our approach.

4A.3 Induction-Based Gate-Level Verification of Multipliers [p. 190]
Ying-Tsai Chang, Kwang-Ting(Tim) Cheng

We propose a method based on unrolling the inductive definition of binary number multiplication to verify gate-level implementations of multipliers. The induction steps successively reduce the size of the multiplier under verification. Through induction, the verification of an n-bit multiplier is decomposed into n equivalence checking problems. The resulting equivalence checking problems could be significantly sped up by simple structural analysis. This method could be generalized to the verification of more general arithmetic circuits and the equivalence checking of complex data-path.


Session 4B: System Level Power and Performance Modeling`

Moderators: Wayne Wolf, Mediaworks Technology, Schaumberg, IL
Preeti Panda, Synopsys, Inc., Mountain View, CA
4B.1 An Assembly-Level Execution-Time Model for Pipelined Architectures [p. 195]
Giovanni Beltrame, Carlo Brandolese, William Fornaciari, Fabio Salice, Donatella Sciuto, Vito Trianni

The aim of this work is to provide an elegant and accurate static execution timing model for 32-bit microprocessor instruction sets, covering also interöinstruction effects. Such effects depend on the processor state and the pipeline behavior, and are related to the dynamic execution of assembly code. The paper proposes a mathematical model of the delays deriving from instruction dependencies and gives a statistical characterization of such timing overheads. The model has been validated on a commercial architecture, the Intel486, by means of timing analysis of a set of benchmarks, obtaining an error within 5%. This model can be seamlessly integrated with a static energy consumption model in order to obtain precise software power and energy estimations.

4B.2 Improving Memory Energy Using Access Pattern Classification [p. 201]
Mahut T. Kandemir, Ugur Sezer, Victor Delaluz

In this paper, we propose a data-driven strategy to optimize the memory energy consumption in a banked memory system. Our compiler-based strategy modifies the original execution order of loop iterations in array-dominated applications to increase the length of the time period(s) in which memory banks are idle (i.e., not accessed by any loop iteration). To achieve this, it first classifies loop iterations according to their bank access patterns and then, with the help of a polyhedral tool, tries to bring the iterations with similar bank access patterns close together. Increasing the idle periods of memory banks brings two major benefits; first, it allows us to place more memory banks into low-power operating modes, and second, it enables us to use a more aggressive (i.e., more energy saving) operating mode for a given bank. Our strategy has been evaluated using seven array-dominated applications on both a cacheless system and a system with cache memory. Our results indicate that the strategy is very successful in reducing the memory system energy, and improves the memory energy by as much as 34% on the average.

4B.3 System-Level Power/Performance Analysis of Portable Multimedia Systems Communicating over Wireless Channels [p. 207]
Radu Marculescu, Amit Nandi, Luciano Lavagno, Alberto Sangiovanni-Vincentelli

This paper presents a new methodology for system-level power and performance analysis of wireless multimedia systems. More precisely, we introduce an analytical approach based on concurrent processes modeled as Stochastic Automata Networks (SANs) that can be effectively used to integrate power and performance metrics in system-level design. We show that 1) under various input traces and wireless channel conditions, the average-case behavior of a multimedia system consisting of a video encoder/decoder pair is characterized by very different probability distributions and power consumption values and 2) in order to identify the best trade-off between power and performance figures, one must take into consideration the entire environment (i.e., encoder, decoder, and channel) for which the system is being designed. Compared to using simulation, our analytical technique reduces the time needed to find the steady-state behavior by orders of magnitude, with some limited loss in accuracy compared to the exact solution. We illustrate the potential of our methodology using the MPEG-2 video as the driver application.


Session 4C: Topics in Physical Synthesis

Moderators: Andrew B. Kahng, University of California at San Diego, La Jolla, CA
Malgorzata Marek-Sadowska, University of California, Santa Barbara, CA
4C.1 Congestion Aware Layout Driven Logic Synthesis [p. 216]
Thomas Kutzschebauch, Leon Stok

In this paper, we present novel algorithms that effectively combine physical layout and early logic synthesis to improve overall design quality. In addition, we employ partitioning and clustering algorithms to achieve faster turn around times. With the increasing complexity of designs, the traditional separation of logic and physical design leads to sub-optimal results as the cost functions employed during logic synthesis do not accurately represent physical design information. While this problem has been addressed extensively, the existing solutions apply only simple synthesis transforms during physical layout and are generally unable to reverse decisions made during logic minimization and technology mapping, that have a major negative impact on circuit structure. In our novel approach, we propose congestion aware algorithms for layout driven decomposition and technology mapping, two of the steps that affect congestion the most during logic synthesis, to effectively decrease wire length and improve congestion. In addition, to improve design turn-around-time and handle large designs, we present an approach in which synthesis partitioning and placement clustering co-exist, reflecting the different characteristics of logical and physical domain.

4C.2 Addressing the Timing Closure Problem by Integrating Logic Optimization and Placement [p. 224]
Wilsin Gosti, Sunil Khatri, Alberto Sangiovanni-Vincentelli

Timing closure problems occur when timing estimates computed during logic synthesis do not match with timing estimates computed from the layout of the circuit. In such a situation, logic synthesis and layout synthesis are iterated until the estimates match. The number of such iterations is becoming larger as technology scales. Timing closure problems occur mainly due to the difficulty in accurately predicting interconnect delay during logic synthesis. In this paper, we present an algorithm that integrates logic synthesis and global placement to address the timing closure problem. We introduce technology independent algorithms as well as technology dependent algorithms. Our technology independent algorithms are based on the notion of "wire-planning". All these algorithms interleave their logic operations with local and incremental/full global placement, in order to maintain a consistent placement while the algorithm is run. We show that by integrating logic synthesis and placement, we avoid the need to predict interconnect delay during logic synthesis. We demonstrate that our scheme significantly enhances the predictability of wire delays, thereby solving the timing closure problem. This is the main result of our paper. Our results also show that our algorithms result in a significant reduction in total circuit delay. In addition, our technology independent algorithms result in a significant circuit area reduction.

4C.3 An Algorithm for Simultaneous Pin Assignment and Routing [p. 232]
Hua Xiang, Xiaoping Tang, D.F. Wong

Macro-block pin assignment and routing are important tasks in physical design planning. Existing algorithms for these problems can be classified into two categories: 1) a two-step approach where pin assignment is followed by routing, and 2) a net-by-net approach where pin assignment and routing for a single net are performed simultaneously. None of the existing algorithms is "exact" in the sense that the algorithm may fail to route all nets even though a feasible solution exists. This remains to be true even if only 2-pin nets between two blocks are concerned. In this paper, we present the first polynomial-time exact algorithm for simultaneous pin assignment and routing for 2-pin nets from one block (source block) to all other blocks. In addition to finding a feasible solution whenever one exists, it guarantees to find a pin-assignment/routing solution with minimum cost alpha.W+beta.V, where W is the total wirelength and V is the total number of vias. Our algorithm has various applications: 1) It is suitable in ECO (Engineering Change Order) situations where a designer wants to incrementally modify the existing solutions instead of redoing everything after a design change. 2) Given any pin assignment and routing solution obtained by any existing method, our algorithm can be used to increase the number of routed nets and reduced the routing cost. Furthermore, it provides an efficient algorithm for the pin assignment and routing problem of all blocks. The method is applicable to both global and detailed routing with arbitrary routing obstacles on multiple layers. Experimental results demonstrate its efficiency and effectiveness.


Session 4D: Model Order Reduction

Moderators: Eli Chiprout, Intel Corp., Chandler, AZ
Andreas C. Cangellaris, University of Illinois, Urbana, IL
4D.1 Techniques for Including Dielectrics when Extracting Passive Low-Order Models of High Speed Interconnect [p. 240]
Luca Daniel, Alberto Sangiovanni-Vincentelli, Jacob K. White

Interconnect structures including dielectrics can be modeled by an integral equation method using volume currents and surface charges for the conductors, and volume polarization currents and surface charges for the dielectrics. In this paper we describe a mesh analysis approach for computing the discretized currents in both the conductors and the dielectrics. We then show that this fully mesh-based formulation can be cast into a form using provably positive semidefinite matrices, making for easy application of Krylov-subspace based model-reduction schemes to generate accurate guaranteed passive reduced-order models. Several printed circuit board examples are given to demonstrate the effectiveness of the strategy.

4D.2 A Convex Programming Approach to Positive Real Rational Approximation [p. 245]
Carlos P. Coelho, Joel R. Phillips, Luis M. Silveira

As system integration evolves and tighter design constraints must be met, it becomes necessary to account for the non-ideal behavior of all the elements in a system. Certain devices common in high-frequency integrated circuit applications, such as spiral inductors, SAW filters, etc, are often described and studied in the frequency domain. Models take the form of frequency domain data obtained through measurement or through physical simulation. Usually the available data is sampled, incomplete, noisy, and covers only a finite range of the spectrum. In this paper we present a methodology for generating guaranteed passive time-domain models of frequency-described subsystems. The methodology presented is based on convex programming based algorithms for fixed denominator system identification. The algorithm is guaranteed to produce a passive system model that is optimal in the sense of having minimum weighted square error in the frequency band of interest over all models with a prescribed set of system poles. An incremental-fitting reformulation of the problem is also introduced that trades optimality for efficiency while still guaranteeing passivity. Results of the application of the proposed methodologies to the modeling of a variety of subsystems are presented and discussed.

4D.3 A Trajectory Piecewise-Linear Approach to Model Order Reduction and Fast Simulation of Nonlinear Circuits and Micromachined Devices [p. 252]
Michal Rewienski, Jacob K. White

In this paper we present an approach to the nonlinear model reduction based on representing the nonlinear system with a piecewise-linear system and then reducing each of the pieces with a Krylov projection. However, rather than approximating the individual components as piecewise-linear and then composing hundreds of components to make a system with exponentially many different linear regions, we instead generate a small set of linearizations about the state trajectory which is the response to a Ītraining inputā. Computational results and performance data are presented for a nonlinear circuit and a micromachined fixed-fixed beam example. These examples demonstrate that the macromodels obtained with the proposed reduction algorithm are significantly more accurate than models obtained with linear or the recently developed quadratic reduction techniques. Finally, it is shown that the proposed technique is computationally inexpensive, and that the models can be constructed "on-the-fly", to accelerate simulation of the system response.


Session 5A: Embedded Tutorial: Embedded Software and Systems

Moderators: Rolf Ernst, Technical University of Braunschweig, Braunschweig, Germany
Francky Catthoor, IMEC, Leuven, Belgium
5A.1 Low Power System Scheduling and Synthesis [p. 259]
Niraj K. Jha

Many scheduling techniques have been presented recently which exploit dynamic voltage scaling (DVS) and dynamic power management (DPM) for both uniprocessors and distributed systems, as well as both real-time and non-real-time systems. While such techniques are power-aware and aim at extending battery lifetimes for portable systems, they need to be augmented to make them battery-aware as well. We will survey such power-aware and battery-aware scheduling algorithms. Also, system synthesis algorithms for real-time systems-on-a-chip (SOCs), distributed and wireless client-server embedded systems, etc., have begun optimizing power consumption in addition to system price. We will survey such algorithms as well, and point out some open problems.

5A.3 Integral Design Representations for Embedded Systems [p. 264]
Lothar Thiele

Modern embedded computing systems tend to be heterogeneous assemblages of concurrent subsystems, typically described in different languages and semantics which are well established in the various application fields. For example, the specification of the functional and timing behavior necessitates a mixture of different basic models of computation and communication which come from transformative or reactive domains. There is little hope that a single language will replace this heterogeneous set of languages. In fact, experiments with system specification languages show that there is not a unique universal specification language to support the whole life cycle. A similar problem occurs when reused components shall be integrated, possibly described in another language and incompletely documented. Examples would be components of other companies or "legacy code". The lack of coherency of the different languages, methods and tools is a substantial obstacle on the way to higher design productivity and design quality. A design process must be able to bridge the semantic differences for verification and synthesis and should account for limited knowledge of system properties. The embedded tutorial will describe approaches which allow for the common representation of different languages and incomplete specifications. In particular, recent results on the use of internal design representations targeted to scheduling and design space exploration will be reviewed. Here, the information useful for methods like allocation of resources, partitioning the design and scheduling must be estimated or extracted from the various input languages and mapped onto internal representations which describe properties of the subsystems and their coordination. Design methods work on these internal representations and eventually refine them by adding components and reducing non-determinism.

5A.4 Optimisation Problems for Dynamic Concurrent Task-Based Systems [p. 265]
D. Verkest, P. Yang, C. Wong, P. Marchal

One of the most critical bottlenecks in many novel multi-media applications is their very dynamic concurrent behaviour. This is especially true because of the quality-of-service (QoS) aspects of these applications. Prominent examples of this can be found in the recent MPEG4 and JPEG2000 standards and especially the new MPEG21 standard. In order to deal with these dynamic issues where tasks and complex data types are created and deleted at run-time based on non-deterministic events, a novel system design paradigm is required. Because of the high computation and communicating requirement from these kind of applications, multi-processor SoC (system on chip) is more and more accepted as a solution (eg., TriMedia), which is also promised by the advance of the processing technology. It is different from traditional general-purpose computers because it is application specific and it is an embedded system, which means costs like energy consumption are of major concern. The task scheduling on such a multiprocessor, embedded multi-media systems forms a real challenge. This part of the tutorial will focus on the new requirements in system-level synthesis. In particular a ätask concurrency managementä problem formulation will be proposed, with special focus on the formal definition of the most crucial optimization problems. The concept of Pareto curve based exploration is crucial in these formulations.


Session 5B: Embedded Tutorial: CAD Solutions and Outstanding Challenges for Mixed-Signal and RF IC Design [p. 270)

Moderator: Georges Gielen, Katholieke University, Leuven, Belgium
Domine Leenaerts, Rob A. Rutenbar, Georges Gielen
This tutorial paper addresses the problems and solutions that are posed by the design of mixed-signal integrated systems on chip (SoC). These include problems in mixed-signal design methodologies and flows, problems in analog design productivity, as well as open problems in analog, mixed-signal and RF design, modeling and verification tools. The tutorial explains the problems that are posed by these mixed-signal/RF SoC designs, describes the solutions and their underlying methods that exist today and outlines the challenges that still remain to be solved at present. In the first part the design of analog and mixed-signal circuits is addressed, while the second part focuses on the specific problems raised by RF wireless circuits.


Session 6A: BDDs and SAT

Moderators: Alan J. Hu, University of British Columbia, Vancouver, BC, Canada
Rajeev Ranjan, Real Intent, Santa Clara, CA
6A.1 Efficient Conflict Driven Learning in Boolean Satisfiability Solver [p. 279]
Lintao Zhang, Conor Madigan, Matthew Moskewicz, Sharad Malik

One of the most important features of current state-of-the-art SAT solvers is the use of conflict based backtracking and learning techniques. In this paper, we generalize various conflict driven learning strategies in terms of different partitioning schemes of the implication graph. We re-examine the learning techniques used in various SAT solvers and propose an array of new learning schemes. Extensive experiments with real world examples show that the best performing new learning scheme has at least a 2X speedup compared with learning schemes employed in state-of-the-art SAT solvers.

6A.2 Partition-Based Decision Heuristics for Image Computation Using SAT and BDDs [p. 286]
Aarti Gupta, Zijiang Yang, Pranav N. Ashar, Lintao Zhang, Sharad Malik

Methods based on Boolean satisfiability (SAT) typically use a Conjunctive Normal Form (CNF) representation of the Boolean formula, and exploit the structure of the given problem through use of various decision heuristics and implication methods. In this paper, we propose a new decision heuristic based on separator-set induced partitioning of the underlying CNF graph. It targets those variables whose choice generates clause partitions with disjoint variable supports. This can potentially improve performance of SAT applications by decomposing the problem dynamically within the search. In the context of a recently proposed image computation method combining SAT and BDDs, this results in simpler BDD subproblems. We provide algorithms for CNF partitioning ö one based on a clause-variable dependency matrix, and another based on standard hypergraph partitioning techniques, and also for the use of partitioning information in decision heuristics for SAT. We demonstrate the effectiveness of our proposed partition-based heuristic with practical results for reachability analysis of benchmark sequential circuits.

6A.3 Non-linear Quantification Scheduling in Image Computation [p. 293]
Pankaj Chauhan, Edmund M. Clarke, Somesh Jha, Jim Kukula, Tom Shiple, Helmut Veith, Dong Wang

Computing the set of states reachable in one step from a given set of states, i.e. image computation, is a crucial step in several symbolic verification algorithms, including model checking and reachability analysis. So far, the best methods for quantification scheduling in image computation, with a conjunctively partitioned transition relation, have been restricted to a linear schedule. This results in a loss of flexibility during image computation. We view image computation as a problem of constructing an optimal parse tree for the image set. The optimality of a parse tree is defined by the largest BDD that is encountered during the computation of the tree. We present dynamic and static versions of a new algorithm, VarScore, which exploits the flexibility offered by the parse tree approach to the image computation. We show by extensive experimentation that our techniques outperform the best known techniques so far.


Session 6B: Convergence of Abstractions in High-Level Synthesis

Moderators: Sandeep Shukla, University of California, Irvine, CA
Francky Catthoor, IMEC, Leuven, Belgium
6B.1 Symbolic Algebra and Timing Driven Data-flow Synthesis [p. 300]
Armita Peymandoust, Giovanni De Micheli

The growing market of multi-media applications has required the development of complex ASICs with significant data-path portions. Unfortunately, most high-level synthesis tools and methods cannot automatically synthesize data paths such that complex arithmetic library blocks are intelligently used. Symbolic computer algebra has been previously used to automate mapping data flow into a minimal set of complex arithmetic components. In this paper, we present extensions to the previous methods in order to find the minimal critical path delay (CPD) mapping. A new algorithm is proposed that incorporates symbolic manipulations such as tree-height-reduction, factorization, expansion, and Horner transformation. Such manipulations are used as guidelines in initial library element selection. Furthermore, we demonstrate how substitution can be used for multi-expression component sharing and critical path delay optimization.

6B.2 Application-Driven Processor Design Exploration for Power-Performance Trade-off Analysis [p. 306]
Diana Marculescu, Anoop Iyer

This paper presents an efficient design exploration environment for high-end core processors. The heart of the proposed design exploration framework is a two-level simulation engine that combines detailed simulation for critical portions of the code with fast profiling for the rest. Our two-level simulation methodology relies on the inherent clustered structure of application programs and is completely general and applicable to any microarchitectural power/performance simulation engine. The proposed simulation methodology is 3-17X faster, while being sufficiently accurate (within 5%) when compared to the fully detailed simulator. The design exploration environment is able to vary different microarchitectural configurations and find the optimal one as far as energyxdelay product is concerned in a matter of minutes. The parameters that are found to affect drastically the core processor power/performance metrics are issue width, instruction window size, and pipeline depth, along with correlated clock frequency. For very high-end configurations for which balanced pipelining may not be possible, opportunities for running faster stages at lower voltage exist. In such cases, by using up to 3 voltage levels, the energyxdelay product is reduced by 23-30% when compared to the single voltage implementation.

6B.3 A System for Synthesizing Optimized FPGA Hardware from MATLAB [p. 314]
Malay Haldar, Anshuman Nayak, Alok Choudhary, Prith Banerjee

Efficient high level design tools that can map behavioral descriptions to FPGA architectures are one of the key requirements to fully leverage FPGA for high throughput computations and meet time-to-market pressures. We present a compiler that takes as input algorithms described in MATLAB and generates RTL VHDL. The RTL VHDL then can be mapped to FPGAs using existing commercial tools. The input application is mapped to multiple FPGAs by parallelizing the application and embedding communication and synchronization primitives automatically. Our compiler infers the minimum number of bits required to represent the variable through a precision analysis framework. The compiler can leverage optimized IP cores to enhance the hardware generated. The compiler also exploits parallelism in the input algorithm by pipelining in the presence of resource constraints. We demonstrate the utility of the compiler by synthesizing hardware for a couple of signal/image processing algorithms and comparing them with manually designed hardware.

6B.4 Behavior-to-Placed RTL Synthesis with Performance-Driven Placement [p. 320]
Daehong Kim, Jinyong Jung, Sunghyun Lee, Jinhwan Jeon, Kiyoung Choi

Interconnect delay should be considered together with computation delay during architectural synthesis in order to achieve timing closure in deep submicrometer technology. In this paper, we propose an architectural synthesis technique for distributed-register architecture, which separates interconnect delay for data transfer from component delay for computation. The technique incorporates performance-driven placement into the architectural synthesis to minimize performance overhead due to interconnect delay. Experimental results show that our methodology achieves performance improvement of up to 60% and 22% on the average.


Session 6C: Signal Integrity and Clock Design

Moderators: Cheng-Kok Koh, Purdue University, West Lafayette, IN
Tong Gao, Monterey Design Systems, Inc., Sunnyvale, CA
6C.1 Formulae and Applications of Interconnect Estimation Considering Shield Insertion and Net Ordering [p. 327]
James D.Z Ma, Lei He

It has been shown recently that simultaneous shield insertion and net ordering (called SINO/R as only random shields are used) provides an area-efficient solution to reduce the RLC noise. In this paper, we first develop simple formulae with errors less than 10% to estimate the number of shields in the min-area SINO/R solution. In order to accommodate pre-routed P/G wires that also serve as shields, we then formulate two new SINO problems: SINO/SPR and SINO/UPG, and propose effective and efficient two-phase algorithms to solve them. Compared to the existing dense wiring fabric scheme, the resulting SINO/SPR and SINO/UPG schemes maintain the regularity of the P/G structure, have negligible penalty on noise and delay variation, and reduce the total routing area by up to 42% and 36%, respectively. Further, we develop various pre-layout estimation formulae for shielding areas and optimal P/G structures under different routing styles. These formulae can be readily used to guide global routing and high-level design decisions.

6C.2 Hybrid Structured Clock Network Construction [p. 333]
Haihua Su, Sachin Sapatnekar

This paper hierarchically constructs a hybrid mesh/tree clock network structure consisting of overlying zero-skew clock meshes, with underlying zero-skew clock trees originating from the mesh nodes. We propose a mesh construction procedure, which guarantees zero skew under the Elmore delay model, using a simple and efficient linear programming formulation. Buffers are inserted to reduce the transition time (or rise time). As a post-processing step, wire width optimization under an accurate higher-order delay metric is performed to further minimize the transition time and propagation delay/skew. Experimental results show that the hybrid mesh/tree construction scheme can provide smaller propagation delay and transition time than a comparable clock tree.

6C.3 CASh: A Novel "Clock as Shield" Design Methodology for Noise Immune Precharge-Evaluate Logic [p. 337]
Yonghee Im, Kaushik Roy

In gigascale integrated circuits (GSI), interconnects are expected to play a more dominant role in circuit performance than transistor cells. The circuit performance is affected by signal integrity as cross-talk becomes more significant with the scaling of feature sizes. Many attempts have been made to improve noise immunity, but all require the sacrifice of speed as a trade-off, especially in dynamic circuits. Avoiding noise problems while maintaining the desired speed would involve increased wire spacing or extensive shielding, both of which are unfavorable due to demands for high density and a relatively higher cost of wires in current process technologies. We propose a novel methodology in which clock lines are used as shielding wires to reduce cross-talk effects in domino circuits, thereby minimizing the possibility of functional failures. In addition, this method provides another benefit: a small buffer size is required for driving a long interconnect for iso-noise immunity. Since clock lines, which are always required in domino circuits, are used to shield signal lines, speed penalty and area overhead which are drawbacks of previous work can be avoided. This design methodology CASh (Clock As Shielding) demonstrates the superiority over conventional methods. HSPICE simulations on a 2-input domino AND gate and 4 and 8-bit full adders designed in CASh show higher noise immunity over conventional design.


Session 6D: Analog Synthesis

Moderators: Koen Lampaert, Mindspeed Technology, Newport Beach, CA Henry Chang, Cadence Design Systems, Inc., San Jose, CA
6D.1 The Sizing Rules Method for Analog Integrated Circuit Design [p. 343]
Helmut E. Graeb, Stephan Zizala, Josef Eckmueller, Kurt Antreich

This paper presents the sizing rules method for analog CMOS circuit design that consists of: first, the development of a hierarchical library of transistor pair groups as basic building blocks for analog CMOS circuits; second, the derivation of a hierarchical generic list of constraints that must be satisfied to guarantee the function of each block and its reliability with respect to physical effects; and third, the development of an automatic recognition of building blocks in a circuit schematic. The sizing rules method efficiently captures design knowledge on the technology-specific level of transistor pair groups. This reduces the preparatory modeling effort for analog circuit synthesis. Results of industrial applications to circuit sizing, design centering, response surface modeling and analog placement show the significance of the sizing rules method. Sizing rules especially make sure that automatic circuit sizing and design centering lead to technically meaningful and robust results.

6D.2 ASF: A Practical Simulation-Based Methodology for the Synthesis of Custom Analog Circuits [p. 350]
Michael J. Krasnicki, Rodney Phelps, James R. Hellums, Mark McClung, Rob A. Rutenbar, L. Richard Carley

This paper describes ASF, a novel cell-level analog synthesis framework that can size and bias a given circuit topology subject to a set of performance objectives and a manufacturing process. To manage complexity and time-to-market, SoC designs require a high level of automation and reuse. Digital methodologies are inapplicable to analog IP, which relies on tight control of low-level device and circuit properties that vary widely across manufacturing processes. This analog synthesis solution automates these tedious, technology specific aspects of analog design. Unlike previously proposed approaches, ASF extends the prevalent "schematic and SPICE" methodology used to design analog and mixed-signal circuits. ASF is topology and technology independent and can be easily integrated into a commercial schematic capture design environment. Furthermore, ASF employs a novel numerical optimization formulation that incorporates classical downhill techniques into stochastic search. ASF consistently produces results comparable to expert manual design with 10x fewer candidate solution evaluations than previously published approaches that rely on traditional stochastic optimization methods.

6D.3 A Layout-Aware Synthesis Methodology for RF Circuits [p. 358]
Peter J. Vancorenland, Geert Van der Plas, Michiel Steyaert, Georges Gielen, Willy Sansen

In this paper a layout-aware RF synthesis methodology is presented. The methodology combines the power of a differential evolution algorithm with cost function response modeling and integrated layout generation to synthesize RF Circuits ef ciently, taking into account all layout parasitics during the circuit optimization. The proposed approach has successfully been applied to the design of a high-performance downconverter mixer circuit, proving the effectiveness of the implemented design methodology.


Session 7A: Manufacturing Test: Stuck-at to Crosstalk

Moderators: Sujit Dey, University of California at San Diego, La Jolla, CA
Yervant Zorian, LogicVision, Inc., San Jose, CA
7A.1 On Identifying Don't Care Inputs of Test Patterns for Combinational Circuits [p. 364]
Seiji Kajihara, Kohei Miyase

Given a test set for stuck-at faults, some of primary input values may be changed to opposite logic values without losing fault coverage. We can regard such input values as donāt care (X). In this paper, we propose a method for identifying X inputs of test vectors in a given test set. While there are many combinations of X inputs in the test set generally, the proposed method finds one including X inputs as many as possible, by using fault simulation and procedures similar to implication and justification of ATPG algorithms. Experimental results for ISCAS benchmark circuits show that approximately 66% of inputs of uncompacted test sets could be X in average. Even for compacted test sets, the method found that approximately 47% of inputs are X. Finally, we discuss how logic values are reassigned to the identified X inputs where several applications exist to make test vectors more desirable.

7A.2 REDI: An Efficient Fault Oriented Procedure to Identify Redundant Faults in Combinational Logic Circuits [p. 370]
Chen Wang, Irith Pomeranz, Sudhakar M. Reddy

In this work, a new and effective procedure, called REDI, to efficiently identify redundant single stuck-at faults in combinational logic circuits is proposed. The method is fault oriented and uses sensitizability of partial paths to determine redundant faults. It uses only implications and hence may not determine all the redundant faults of a circuit. However, experimental results presented on benchmark circuits show that the procedure identifies nearly all the redundant faults in most of the benchmark circuits. The key features of REDI that make it efficient are: partial path sensitization, blockage learning, dynamic branch ordering and fault grouping. Experimental results on benchmark circuits demonstrate the efficiency of the proposed procedure in identifying redundant faults in combinational logic circuits.

7A.3 Crosstalk Fault Detection by Dynamic Idd [p. 375]
Xiaoyun Sun, Seonki Kim, Bapiraju Vinnakota

Undesired capacitive crosstalk between signals is expected to be a significant concern in deep submicron circuits. New test techniques are needed for these crosstalk faults since they may cause unacceptable performance degradation. We analyze the impact of crosstalk faults on a circuit's power dissipation. Crosstalk faults can be detected by monitoring the dynamic supply current. The test method is based on a recently developed dynamic Idd test metric, the energy consumption ratio (ECR). ECR-based test has been shown to be effective at tolerating the impact of process variations. In this paper, we apply a ECR-based test method called ECR-VDD test to detect the crosstalk faults. The effectiveness of the method is demonstrated by simulation results.


Session 7B: Architecture Oriented Scheduling

Moderators: Miodrag Potkonjak, University of California, Los Angeles, CA
Kazutoshi Wakabayashi, NEC Corp., Kawasaki, Japan
7B.1 Color Permutation: An Iterative Algorithm for Memory Packing [p. 380]
Jianwen Zhu, Edward S. Rogers, Sr.

It is predicted that 70%of the silicon real-estate will be occupied by memories in future system-on-chips. The minimization of on-chip memory hence becomes increasingly important for cost, performance and energy consumption. In this paper, we present a reasonably fast algorithm based on iterative improvement, which packs a large number of memory blocks into a minimum-size address space. The efficiency of the algorithm is achieved by two new techniques. First, in order to evaluate each solution in linear time, we propose a new algorithm based on the acyclic orientation of the memory conflict graph. Second, we propose a novel representation of the solution which effectively compresses the potentially infinite solution space to a finite value of n!, where n is the number of vertices in th memory conflict graph. Furthermore, if a near-optimal solution is satisfactory, this value can be dramatically reduced to c!, where c! is the chromatic number of the memory conflict graph. Experiments show that consistent improvement over scalar method by 30% can be achieved.

7B.2 Constraint Satisfaction for Relative Location Assignment and Scheduling [p. 384]
Carlos Alba-Pinto, Bart Mesman, Jochen Jess

Tight data- and timing constraints are imposed by communication and multimedia applications. The architecture for the embedded processor imply resource constraints. Instead of random-access registers, relative location storages or rotating register files are used to exploit the available parallelism of resources by means of reducing the initiation interval in pipelined schedules. Therefore, the compiler or synthesis tool must deal with the difficult tasks of scheduling of operations and location assignment of values while respecting all the constraints including the storage file capacity. This paper presents a method that handles constraints of relative location storages during scheduling together with timing and resource constraints. The characteristics of the coloring of conflict graphs, representing the relative overlap of value instances, are analyzed in order to identify the bottlenecks for location assignment with the aim of serializing their lifetimes. This is done with pairs of loop instances of values until it can be guaranteed that all constraints will be satisfied. Experiments show that high quality schedules for kernels and inner loops can be efficiently obtained.

7B.3 A Super-Scheduler for Embedded Reconfigurable Systems [p. 391]
S. Ogrenci Memik, E. Bozorgzadeh, R. Kastner, M. Sarrafzadeh

Emerging reconfigurable systems attain high performance with embedded optimized cores. For mapping designs on such special architectures, synthesis tools, that are aware of the special capabilities of the underlying architecture are necessary. In this paper we are proposing an algorithm to perform simultaneous scheduling and binding, targeting embedded reconfigurable systems. Our algorithm differs from traditional scheduling methods in its capability of efficiently utilizing embedded blocks within the reconfigurable system. Our algorithm can be used to implement several other scheduling techniques, such as ASAP, ALAP, and list scheduling. Hence we refer to it as a super-scheduler. Our algorithm is a path-based scheduling algorithm. At each step, an individual path from the input DFG is scheduled. Our experiments with several DFGās extracted from MediaBench suit indicate promising results. Our scheduler presents capability to perform the trade-off between maximally utilizing the high-performance embedded blocks and exploiting parallelism in the schedule.


Session 7C: New Techniques in Routing

Moderators: Jo Dale Carothers, University of Arizona, Tucson, AZ
Amir H. Farrahi, IBM Corp. TJ Watson Research Center, Yorktown Heights, NY
7C.1 Multilevel Approach to Full-Chip Gridless Routing [p. 396]
Jason Cong, Jie Fang, Yan Zhang

This paper presents a novel gridless detailed routing approach based on multilevel optimization. The multilevel framework with recursive coarsening and refinement in a "V-shaped" flow allows efficient scaling of our gridless detailed router to very large designs. The downward pass of recursive coasening builds the representations of routing regions at different levels, while the upward pass of iterative refinement allows a gradual convergence to a globally optimized solution. The use of a multicommodity flow-based routing algorithm for the initial routing at the coarsest level and a modified maze algorithm for the refinement at each level considerably improves the quality of gridless routing results. Compared with the recently published gridless detailed routing algorithm using wire planning [1], our multilevel gridless routing algorithm is 3x to 75x faster. We also compared our multilevel framework with a recently developed three-level routing approach [1] and a traditional hierarchical routing approach. Our multilevel algorithm generates better detailed routing results with higher completion rates. To our knowledge, this is the first time that multilevel optimization has been applied to IC routing.

7C.2 A Force-Directed Maze Router [p. 404]
Fan Mo, Abdallah Tabbara, Robert K. Brayton

A new routing algorithm is presented. It is based on a multiple star net model, force-directed placement and maze searching techniques. The algorithm inherits the power of maze routing in that it is able to route complex layouts with various obstructions. The large memory requirement of the conventional maze algorithm is alleviated through successive net refinement, which constrains the maze searching to small regions. The algorithm shows advantages in routing designs with complicated layout obstructions.

7C.3 Minimum-Buffered Routing of Non-Critical Nets for Slew Rate and Reliability Control [p. 408]
Charles Alpert, Andrew B. Kahng, Bao Liu, Ion Mandoiu, Alexander Zelikovsky

In high-speed digital VLSI design, bounding the load capacitance at gate outputs is a well-known methodology to improve coupling noise immunity, reduce degradation of signal transition edges, and reduce delay uncertainty due to coupling noise. Bounding load capacitance also improves reliability with respect to hot-carrier oxide breakdown and AC self-heating in interconnects, and guarantees bounded input rise/fall times at buffers and sinks. This paper introduces a new minimum-buffer routing problem (MBRP) formulation which requires that the capacitive load of each buffer, and of the source driver, be upper-bounded by a given constant. Our contributions include the following.
-We give linear-time algorithms for optimal buffering of a given routing tree with a single (inverting or non-inverting) buffer type.
- For simultaneous routing and buffering with a single non-inverting buffer type, we give a factor 2(1+e) approximation algorithm and prove that no algorithm can guarantee a factor smaller than 2 unless P=NP. For the case of a single inverting buffer type, we give a factor 4(1+e) approximation algorithm.
- We give local-improvement and clustering based MBRP heuristics with improved practical performance, and present a comprehensive experimental study comparing the runtime/quality tradeoffs of the proposed MBRP heuristics on test cases extracted from recent industrial designs.


Session 7D: Issues in Substrate Coupling

Moderators: Mattan Kamon, Coventor, Inc., Cambridge, MA
Kenneth L. Shepard, Columbia University, New York, NY
7D.1 Highly Accurate Fast Methods for Extraction and Sparsification of Substrate Coupling Based on Low-Rank Approximation [p. 417]
Joe Kanapka, Jacob White

More aggressive design practices have created renewed interest in techniques for analyzing substrate coupling problems. Most previous work has focused primarily on faster techniques for extracting coupling resistances, but has offered little help for reducing the resulting resistance matrix, whose number of nonzero entries grows quadratically with the number of contacts. Wavelet-like methods have been applied to sparsifying the resistance matrix representing the substrate coupling, but the accuracy of the method is very sensitive to the particulars of the contact layout. In this paper we show that for the substrate problem it is possible to improve considerably on the wavelet-like methods by making use of the algorithmic structure common to the fast multipole and wavelet-like algorithms, but making judicious use of low-rank approximations. The approach, motivated by the hierarchical SVD algorithm, can achieve more than an order of magnitude better accuracy for commensurate sparsity, or can achieve much better sparsity at commensurate accuracy, when compared to the wavelet-like algorithm.

7D.2 Fast 3-D Inductance Extraction in Lossy Multi-Layer Substrate [p. 424]
Minqing Liu, Tiejun Yu, Wayne W.-M. Dai

A mixed potential integral equation (MPIE) technique combined with fast multi-layer Greenās functions and Gaussian Jacobi high order techniques is used to compute the 3-D frequency dependent inductances and resistances in lossy multi-layered substrate. Compared to FastHenry, a multipole-accelerated 3-D inductance extraction program, the algorithm presented here is more accurate and faster for lossy multilayer structures due to two reasons. First, substrate and optional ground planeās lose and coupling effect are efficiently modeled by multilayer Greenās functions, while the Greenās functions are efficiently calculated via window-based acceleration technique. Second, Gaussian Jacobi-based high order techniques are used to capture the singularity of the current distribution at metal edges, leading to significant reduction of problem size and speed-up of computation.

7D.3 Simulation Approaches for Strongly Coupled Interconnect Systems [p. 430]
Joel R. Phillips, L. Miguel Silveira

Shrinking feature sizes and increasing speeds of operation make interconnect-related effects very relevant for current circuit verification methodologies. Reliable and accurate system verification requires the full analysis of circuits together with the environment that surrounds them, including the common substrate, the packaging structures, and perhaps even board information. In this paper we discuss circuit-level simulation algorithms that enable the analysis of the impact of strongly coupled interconnect structures on nonlinear circuit operation, so as to allow reliable and accurate system verification.


Session 8A: Combinational Optimization

Moderators: Olivier Coudert, Monterey Design Systems, Inc., Sunnyvale, CA
Tiziano Villa, Parades Labs., Rome, Italy
8A.1 BOOM ö A Heuristic Boolean Minimizer [p. 439]
Jan Hlavicka, Petr Fiser

We present a two-level Boolean minimization tool (BOOM) based on a new implicant generation paradigm. In contrast to all previous minimization methods, where the implicants are generated bottom-up, the proposed method uses a top-down approach. Thus instead of increasing the dimensionality of implicants by omitting literals from their terms, the dimension of a term is gradually decreased by adding new literals. Unlike most other minimization tools like ESPRESSO, BOOM doesn't use the definition of the function to be minimized as a basis for the solution, thus the original coverage influences the solution only indirectly through the number of literals used. Most minimization methods use two basic phases introduced by Quine-McCluskey, known as prime implicant (PI) generation and the covering problem solution. Some more modern methods, like ESPRESSO, combine these two phases, reducing the number of PIs to be processed. This approach is also used in BOOM, the search for new literals to be included into a term aims at maximum coverage of the output function. The function to be minimized is defined by its on-set and off-set, listed in a truth table. Thus the don't care set, often representing the dominant part of the truth table, need not be specified explicitly. The proposed minimization method is efficient above all for functions with a large number of input variables while only few care terms are defined. The minimization procedure is very fast, hence if the first solution does not meet the requirements, it can be improved in an iterative manner. The method has been tested on several different kinds of problems, like the MCNC standard benchmarks or larger problems generated randomly.

8A.2 Faster SAT and Smaller BDDs via Common Function Structure [p. 443]
Fadi A. Aloul, Igor L. Markov, Karem A. Sakallah

The increasing popularity of SAT and BDD techniques in verification and synthesis encourages the search for additional speed-ups. Since typical SAT and BDD algorithms are exponential in the worst-case, the structure of real-world instances is a natural source of improvements. While SAT and BDD techniques are often presented as mutually exclusive alternatives, our work points out that both can be improved via the use of the same structural properties of instances. Our proposed methods are based on efficient problem partitioning and can be easily applied as pre-processing with arbitrary SAT solvers and BDD packages without source code modifications. Finding a better variable-ordering is a well recognized problem for both SAT solvers and BDD packages. Currently, all leading edge variable-ordering algorithms are dynamic, in the sense that they are invoked many times in the course of the "host" algorithm that solves SAT or manipulates BDDs. Examples include the DLCS ordering for SAT solvers and variable-sifting during BDD manipulations. In this work we propose a universal variable-ordering MINCE (MIN Cut Etc.) that preprocesses a given Boolean formula in CNF. MINCE is completely independent from target algorithms and outperforms both DLCS for SAT and variable sifting for BDDs. We argue that MINCE tends to capture structural properties of Boolean functions arising from real-world applications. Our contribution is validated on the ISCAS circuits and the DIMACS benchmarks. Empirically, our technique often outperforms existing techniques by a factor of two or more. Our results motivate search for stronger dynamic ordering heuristics and combined static/dynamic techniques.

8A.3 Recursive Bipartitioning of BDDs for Performance Driven Synthesis of Pass Transistor Logic Circuits [p. 449]
Rupesh S. Shelar, Sachin S. Sapatnekar

In this paper, we address the problem of performance oriented synthesis of pass transistor logic (PTL) circuits using a binary decision diagram (BDD) decomposition technique. We transform the BDD decomposition problem into a recursive bipartitioning problem and solve the latter using a max-flow min-cut technique. We use the are and delay cost of the PTL implementation of the logic function to guide the bipartitioning scheme. Using recursive bipartitioning and a one-hot multiplexer circuit, we show that our PTL implementation has logarithmic delay in the number of inputs, under certain assumptions. The experimental results on benchmark circuits are promising, since they show the significant delay reductions with small or no area overheads as compared to previous approaches.

8A.4 A Probabilistic Constructive Approach to Optimization Problems [p. 453]
Jennifer L. Wong, Farinzaz Koushanfar, Seapahn Meguerdichian, Miodrag Potkonjak

We propose a new optimization paradigm for solving intractable combinatorial problems. The technique, named Probabilistic Constructive (PC), combines the advantages of both constructive and probabilistic algorithms. The constructive aspect provides relatively short runtime and makes the technique amenable for the inclusion of insights through heuristic rules. The probabilistic nature facilitates a flexible trade-off between runtime and the quality of solution. In addition to presenting the generic technique, we apply it to the Maximal Independent Set problem. Extensive experimentation indicates that the new approach provides very attractive trade-offs between the quality of the solution and runtime, often outperforming the best previously published approaches.


Session 8B: Real Time Scheduling and Performance Analysis

Moderators: Pai Chou, University of California, Irvine, CA
Luciano Lavagno, Cadence Berkeley Labs., Berkeley, CA
8B.1 Energy Efficient Real-Time Scheduling [p. 458]
Amit Sinha, Anantha P. Chandrakasan

Real-time scheduling on processors that support dynamic voltage and frequency scaling is analyzed. The Slacked Earliest Deadling First (SEDF) algorithm is proposed and it is shown that the algorithm is optimal in minimizing processor energy consumption and maximum lateness. An upper bound on the processor energy savings is also derived. Real-time scheduling of periodic tasks is also analyzed and optimal voltage and frequency allocation for a given task set is determined that guarantees schedulability and minimizes energy consumption.

8B.2 Efficient Performance Estimation for General Real-Time Task Systems [p. 464]
Hongchao (Stephanie) Liu, Xiaobo (Sharon) Hu

The paper presents a novel approach to compute tight upper bounds on the processor utilization independent of the implementation for general real-time systems where tasks are composed of subtasks and precedence constraints may exist among subtasks of the same task. We formulate the problem as a set of linear programming (LP) problems. Observations are made to reduce the number of LP problem instances required to be solved, which greatly improves the computation time of the utilizatoin bounds. Furthermore, additional constraints are allowed to be included Under certain circumstances to improve the quality of the bounds.

8B.3 Stars in VCC: Complementing Simulation with Worst-Case Analysis [p. 471]
Felice Balarin

STARS is a methodology for worst-case analysis of embedded systems. STARS manipulates abstract representations of system components to obtain upper bounds on the number of various events in the system, as well as a bound on the response time. VCC is a commercial discrete event simulator, that can be used both for functional and performance verification. We describe an extension of VCC to facilitate STARS. The extension allows the user to specify abstract representations of VCC modules. These abstractions are used by STARS, but their validity can also be checked by VCC simulation. We also propose a mostly automatic procedure to generate these abstractions. Finally, we illustrate on an example how STARS can be combined with simulation to find bugs that would be hard to find by simulation alone.


Session 8C: Power Analysis

Moderators: Anirudh Devgan, IBM Corp., Austin, TX
Carlo Guardini, PDF Solutions, San Jose, CA
8C.1 Multigrid-Like Technique for Power Grid Analysis [p. 480]
Joseph N. Kozhaya, Sani R. Nassif, Farid N. Najm

Modern sub-micron VLSI designs include huge power grids that are required to distribute large amounts of current, at increasingly lower voltages. The resulting voltage drop on the grid reduces noise margin and increases gate delay, resulting in a serious performance impact. Checking the integrity of the supply voltage using traditional circuit simulation is not practical, for reasons of time and memory complexity. We propose a novel multigrid-like technique for the analysis of power grids. The grid is reduced to a coarser structure, and the solution is mapped back to the original grid. Experimental results show that the proposed method is very efficient as well as suitable for both DC and transient analysis of power grids.

8C.2 An Analytical High-Level Battery Model for Use in Energy Management of Portable Electronic Systems [p. 488]
Daler N. Rakhmatov, Sarma B.K. Vrudhula

Once the battery becomes fully discharged, a battery-powered portable electronic system goes off-line. Therefore, it is important to take the battery behavior into account. A system designer needs an adequate high-level model in order to make battery-aware decisions that target maximization of the systemās lifetime on-line. We propose such a model: it allows a designer to predict the battery time-to-failure for a given load and provides a cost metric for life-time optimization algorithms. Our model also allows for a tradeoff between the accuracy and the amount of computation performed. The quality of the proposed model is evaluated using a detailed low-level simulation of a lithiumion electrochemical cell.

8C.3 Power-Delay Modeling of Dynamic CMOS Gates for Circuit Optimization [p. 494]
J.L. Rossello, Jaume Segura

We present an accurate analytical expression to compute power and delay of domino CMOS circuits from a detailed description of internal capacitor switching and discharging currents. The expression obtained accounts for the main effects in complex sub-micron gates like velocity saturation effects, body effect, device sizes and coupling capacitors. The energy-delay product is also evaluated and analyzed. Results are compared to HSPICE simulations (level 50) for a 0.18 um CMOS technology.


Session 8D: Timing and Noise Analysis

Moderators: Tim Burks, Magma Design Automation, Inc., Cupertino, CA
Florentin Dartu, Intel Corp., Hillsboro, OR
8D.1 A Symbolic Simulation-Based Methodology for Generating Black-Box Timing Models of Custom Macrocells [p. 501]
Clayton McDonald, Randal E. Bryant

We present a methodology for generating black-box timing models for full-custom transistor-level CMOS circuits. Our approach utilizes transistor-level ternary symbolic timing simulation to explore the input arrival time space and determine the input arrival time windows that result in proper operation. This approach integrates symbolic timing simulation into existing static timing analysis flows and allows automated modelling of the timing behavior of aggressive full-custom circuit design styles.

8D.2 On the Signal Bounding Problem in Timing Analysis [p. 507]
Jin-Fuw Lee, D.L. Ostapko, Jeffery Soreff, C.K. Wong

In this paper, we study the propagation of slew dependent bounding signals and the corresponding slew problem in static timing analysis. The selection of slew from the latest arriving signal, a commonly used strategy, may violate the rule of monotonic delay. Several methods for generating bounding signals to overcome this difficulty are described. The accuracy and monotonicity of each method is analyzed. These methods can be easily implemented in a static timer to improve the accuracy.

8D.3 False-Noise Analysis using Logic Implications [p. 515]
Alexey Glebov, Sergey Gavrilov, David Blaauw, Supamas Sirichotiyakul, Chanhee Oh, Vladimir Zolotov

Cross-coupled noise analysis has become a critical concern in todayās VLSI designs. Typically, noise analysis makes an assumption that all aggressing nets can simultaneously switch in the same direction. This creates a worst-case noise pulse on the victim net that often leads to false noise violations. In this paper, we present a new approach that uses logic implications to identify the maximum set of aggressor nets that can inject noise simultaneously under the logic constraints of the circuit. We propose an approach to efficiently generate logic implications from a transistor-level description and propagate them in the circuit using ROBDD representations and a newly proposed laterial propagation method. We then show that the problem of finding the worst case logically feasible noise can be represented as a maximum weighted independent set problem and show how to efficiently solve it. Initially, we restrict our discussion to zero-delay implications, which are valid for glitch-free circuits and then extend our approach to timed implications. The proposed approaches were implemented in an industrial noise analysis tool and results are shown for a number of industrial test cases. We demonstrate that a significant reduction in the number of noise failures can be obtained from considering the logic implications as proposed in this paper, underscoring the need for false-noise analysis.


Session 9A: System Level Test and Reliability

Moderators: Bapi Vinnakota, University of Minnesota, Minneapolis, MN
Seiji Kajihara, Kyushu Institute of Technology, Iizuka, Japan
9A.1 The Design and Optimization of SOC Test Solutions [p. 523]
Erik Larsson, Zebo Peng, Gunnar Carlsson

We propose an integrated technique for extensive optimization of the final test solution for System-on-Chip using Simulated Annealing. The produced results from the technique are a minimized test schedule fulfilling test conflicts under test power constraints and an optimized design of the test access mechanism. We have implemented the proposed algorithm and performed experiments with several benchmarks and industrial designs to show the usefulness and efficiency of our technique.

9A.2 Accurate CMOS Bridge Fault Modeling with Neural Network-Based VHDL Saboteurs [p. 531]
Don Shaw, Dhamin Al-Khalili, C™me Rozon

This paper presents a new bridge fault model that is based on a multiple layer feedforward neural network and implemented within the framework of a VHDL saboteur cell. Empirical evidence and experimental results show that it satisfies a prescribed set of bridge fault model criteria better than existing approaches. The new model computes exact bridged node voltages and propagation delay times with due attention to surrounding circuit elements. This is significant since, with the exception of full analog simulation, no other technique attempts to model the delay effects of bridge defects. Yet, compared to these analog simulations, the new approach is orders of magnitude faster and achieves reasonable accuracy; computing bridged node voltages with an average error near 0.006 volts and propagation delay times with an average error near 14 ps.
Keywords: Bridge Defects, Fault Models, Neural Networks, VHDL, CMOS ICs, Fault Simulation

9A.3 Algorithm Level Re-Computing - A Register Transfer Level Concurrent Error Detection Technique [p. 537]
Kaijie Wu, Ramesh Karri

In this paper we propose two algorithm-level time redundancy based Concurrent Error Detection (CED) schemes that exploit diversity in a Register Transfer (RT) level implementation. RT level diversity can be achieved either by changing the operation-to-operator allocation (allocation diversity) or by shifting the operands before re-computation (data diversity). By enabling a fault to affect the normal result and the re-computed result in two different ways, RT level diversity yields good CED capability with low area overhead. We used Synopsys Behavior Complier (BC) to implement the technique.


Session 9B: Power Issues in High Level Synthesis

Moderators: Sridevan Parameswaran, The University of New South Wales, Kensington, Australia
Nikil Dutt, University of California, Irvine, CA
9B.1 Transient Power Management Through High Level Synthesis [p. 545]
Vijay Raghunathan, Srivaths Ravi, Anand Raghunatha, Ganesh Lakshminarayana

The use of nanometer technologies is making it increasingly important to consider transient characteristics of a circuitās power dissipation (e.g., peak power, and power gradient or differential) in addition to its average power consumption. Current transient power analysis and reduction approaches are mostly at the transistor- and logic-levels. We argue that, as was the case with average power minimization, architectural solutions to transient power problems can complement and significantly extend the scope of lower-level techniques. In this work, we present a high-level synthesis approach to transient power management. We demonstrate how high-level synthesis can impact the cycle-by-cycle peak power and peak power differential for the synthesized implementation. Further, we demonstrate that it is necessary to consider transient power metrics judiciously in order to minimize or avoid area and performance overheads. In order to alleviate the limits on parallelism imposed by peak power constraints, we propose a novel technique based on the selective insertion of data monitor operations in the behavioral description. We present enhanced scheduling algorithms that can accept constraints on transient power characteristics (in addition to the conventional resource and performance constraints). Experimental results on several example designs obtained using a state-of-the-art commercial design flow and technology library indicate that high-level synthesis with transient power management results in significant benefits ÷peak power reductions of up to 32% (average of 25%), and peak power differential reductions of up to 58% (average of 42%) ÷with minimal performance overheads.

9B.2 An Integrated Data Path Optimization for Low Power Based on Network Flow Method [p. 553]
Chun-Gi Lyuh, Taewhan Kim, C.L. Liu

We propose an effective algorithm for power optimization in behavioral synthesis. In previous work, it has been shown that several hardware allocation/binding problems for power optimization can be formulated as network flow problems and be solved optimally. However, in these formulations, a fixed schedule was assumed. In such context, one key problem is: given an optimal network flow solution to a hardware allocation/binding problem for a schedule, how to generate a new optimal network flow solution rapidly for a local change of the schedule. To this end, from a comprehensive analysis of the relation between network structure and flow computation, we devise a two-step procedure: (Step 1) max-flow computation step which finds a valid (maximum) flow solution while retaining the previous (maximum flow of minimum cost) solution as much as possible; (Step 2) min-cost computation step which incrementally refines the flow solution obtained in Step 1, using the concept of finding a negative cost cycle in the residual graph for the flow. The proposed algorithm can be applied effectively to several important high-level data path optimization problems (e.g., allocations/bindings of functional units, registers, buses, and memory ports) when we have the freedom to choose a schedule that will minimize power consumption. Experimental results (for bus synthesis) on benchmark problems show that our designs are 5.2% more power-efficient over the best known results, which is due to (a) exploitation of the effect of scheduling and (b) optimal binding for every schedule instance. Furthermore, our algorithm is about 2.8 times faster in run time over the full network flow based (optimal) bus synthesis algorithm, which is due to (c) our novel (two-step) mechanism which utilize the previous flow solution to reduce redundant flow computations.

9B.3 What is the Limit of Energy Saving by Dynamic Voltage Scaling? [p. 560]
Gang Qu

Dynamic voltage scaling (DVS) is a technique that varies the supply voltage and clock frequency based on the computation load to provide desired performance with the minimal amount of energy consumption. It has been demonstrated as one of the most effective low power system design techniques, in particular for real time systems. Previously, there are works on both ends of the DVS systems: the ideal variable voltage system which can change its voltage with no physical constraints, and the multiple voltage system which has a number of discrete voltages available simultaneously. In this paper, we study the DVS systems between these two extreme cases. We consider systems that can vary the operating voltage dynamically under various real-life physical constraints. Based on the systemās different behavior during voltage transition, we define the feasible DVS system and the practical DVS system. We build mathematical model to analyze the potential of DVS on energy saving for these different systems. Finally, we simulate the behavior of a secure wireless communication networks with DVS systems. The results show that DVS results in energy reduction from 36% to 79%, and the real life DVS systems can be very close to the ideal system in energy saving.


Session 9C: Advances in Placement

Moderators: Jason Cong, University of California, Los Angeles, CA
Patrick Groeneveld, Magma Design Automation, Inc., Cupertino, CA
9C.1 Local Search for Final Placement in VLSI Design [p. 565]
Oluf Faroe, David Pisinger, Martin Zachariasen

A new heuristic is presented for the general cell placement problem where the objective is to minimize total bounding box netlength. The heuristic is based on the Guided Local Search (GLS) metaheuristic. GLS modifies the objective function in a constructive way to escape local minima. Previous attempts to use local search on final (or detailed) placement problems have often failed as the neighborhood quickly becomes too excessive for large circuits. Nevertheless, by combining GLS with Fast Local Search it is possible to focus the search on appropriate sub-neighborhoods, thus reducing the time complexity considerably. Comprehensive computational experiments with the developed algorithm are reported on small, medium and large industrial circuits, and for standard cell and general cell variants of the problem. The experiments demonstrate that the developed algorithm is able to improve the estimated routing length of large-sized general cell layouts with as much as 20 percent. The general nature of the proposed method makes it easy to incorporate other goals, such as routability and timing constraints, into the objective function. Current layout algorithms use a feedback approach in which a placement is evaluated by performing (partial) routing and timing analysis; the output of this analysis is then used to construct an improved placement. This iterative nature of the design process calls for placement algorithms that take an existing placement and construct an improved placement that resembles the original one, but in which the information from the routing/timing analysis is taken into account.

9C.2 Congestion Reduction During Placement Based on Integer Programming [p. 573]
Xiaojian Yang, Ryan Kastner, Majid Sarrafzadeh

This paper presents a novel method to reduce routing congestion during placement stage. The proposed approach is used as a post-processing step in placement. Congestion reduction is based on local improvement on the existing layout. However, the approach has a global view of the congestion over the entire design. It uses integer linear programming (ILP) to formulate the conflicts between multiple congested regions, and performs local improvement according to the solution of ILP. Experiments show that the proposed approach can effectively reduce the total overflow of global routing result. The short running time of the algorithm indicates good scalability on large designs.

9C.3 Direct Transistor-Level Layout for Digital Blocks [p. 577]
Prakash Gopalakrishnan, Rob A. Rutenbar

We present a complete transistor-level layout flow, from logic netlist to final shapes, for blocks of combinational logic up to a few thousand transistors in size. The direct transistor-level attack easily accommodates the demands for careful custom sizing necessary in high-speed design, and is also significantly denser than a comparable cell-based layout. The key algorithmic innovations are (a) early identification of essential diffusion-merged MOS device groups called clusters, but (b) deferred binding of clusters to a specific shape-level layout until the very end of a multi-phase placement strategy. A global placer arranges uncommitted clusters; a detailed placer optimizes clusters at shape level for density and for overall routability. A commercial router completes the flow. Experiments comparing to a commercial standard cell-level layout flow show that, when flattened to transistors, our tool consistently achieves 100% routed layouts that average 23% less area.


Session 9D: Interconnect Analysis and Extraction

Moderators: Dennis M. Sylvester, University of Michigan, Ann Arbor, MI
Mustafa Celik, Monterey Design Systems, Inc., Sunnyvale, CA
9D.1 Model Reduction of Variable-Geometry Interconnects using Variational Spectrally-Weighted Balanced Truncation [p. 586]
Payam Heydari, Massoud Pedram

This paper presents a spectrally-weighted balanced truncation technique for RLC interconnects, a technique needed when the interconnect circuit parameters change as a result of variations in the manufacturing process. The salient features of this algorithm are the inclusion of parameter variations in the RLC interconnect, the guaranteed stability of the reduced transfer function, and the availability of provable frequency-weighted error bounds for the reduced-order system. This paper shows that the balanced truncation technique is an effective model-order reduction technique when variations in the circuit parameters are taken into consideration. Experimental results show that the new variational spectrally-weighted balanced truncation attains, on average, 20% more accuracy than the variational Krylov-subspace-based model-order reduction techniques while the run-time is also, on average, 5% faster.

9D.2 Improving the Robustness of a Surface Integral Formulation for Wideband Impendance Extraction of 3D Structures [p. 592]
Zhenhai Zhu, Jingfang Huang, Ben Song, Jacob White

In order for parasitic extraction of high-speed integrated circuit interconnect to be sufficiently efficient, and fit with model-order reduction techniques, a robust wideband surface integral formulation is essential. One recently developed surface integral formulation has shown promise, but was plagued with numerical difficulties of poorly understood origin. In this paper we show that one of that formulationās difficulties was related to the inaccuracy in the approach to evaluate integrals over discretization panels, and we present an accurate approach based on an adapted piecewise quadrature scheme. We also show that the condition number of the original system of integral equations can be reduced by differentiating one of the integral equations. Computational results on a ring and a spiral inductor are used to show that the new quadrature scheme and the differentiated integral formulation improve accuracy and accelerate the convergence of iterative solution methods.

9D.3 Practical Considerations in RLCK Crosstalk Analysis for Digital Integrated Circuits [p. 598]
Steven C. Chan, K.L. Shepard

Inductance and inductive crosstalk has become an important new concern for on-chip wires in deep-submicron integrated circuits. Recent advances in extractors to include inductance make possible the extraction of coupled RLCK interconnect networks from large, complex on-chip layouts. In this paper, we describe the techniques we use in a commercial static noise analysis tool to analyze crosstalk noise due to fully-coupled RLCK networks extracted from layout. Notable are the approaches we use to filter and lump aggressor couplings, as well as the techniques used to handle degeneracies in the modified nodel analysis (MNA) formulation. Furthermore, the nonmonotonicity of interconnect responses in the presence of inductance require additional "sensitizations" in searching the possible switching events inducing the worst-case noise. Comparisons with silicon indicate the need to include the substrate in the extracted models in certain cases.


Session 10A: Don't Care Optimization and Boolean Matching

Moderators: Yuji Kukimoto, Silicon Perspective Corp., Santa Clara, CA
Prabhakar Kudva, IBM Corp. TJ Watson Research Center, Yorktown Heights, NY
10A.1 Single-Pass Redundancy Addition and Removal[p. 606]
Chih-Wei (Jim) Chang, Malgorzata Marek-Sadowska

Redundancy-addition-and-removal is a rewiring technique which for a given target wire wt finds a redundant alternative wire wa. Addition of wa makes wt redundant and hence removable without changing the overall circuit functionality. Incremental logic restructuring based on this technique has been used in many applications. However, the search for valid alternative wires requires trial-and-error redundancy testing of a potentially large set of candidate wires. In this paper, we study the fundamental theory behind this technique and propose a new reasoning scheme which directly identifies alternative wires without performing trial-and-error tests. Experimental results show up to 15 times speedup in comparison to the best techniques in literature.

10A.2 Efficient Canonical Form for Boolean Matching of Complex Functions in Large Libraries [p. 610]
Jovanka Ciric, Carl Sechen

A new algorithm is developed which transforms the truth table or implicant table of a Boolean function into a canonical form under any permutation of inputs. The algorithm is used for Boolean matching for large libraries that contain cells with large numbers of inputs and implicants. The minimum cost canonical form is used as a unique identifier for searching for the cell in the library. The search time is nearly constant if a hash table is used for storing the cellsā canonical representations in the library. Experimental results on more than 100,000 gates confirm the validity and feasible run-time of the algorithm.

10A.3 Compatible Observability Donāt Cares Revisited [p. 618]
R.K. Brayton

CODCs stand for compatible observability don't cares. We first examine the definition of compatibility and when a set of CODCs is compatible. We then discuss Savoj's CODC computation for propagating CODCs from a node's output to its fanins, and show by example, that the results can depend on the current implementation of the node. Then we generalize the computation so that the result is independent of the implementation at the node. The CODCs propagated by this computation are proved to be maximal in some sense. Local don't cares (LDCs) are CODCs of a node, pre-imaged to the primary inputs and then imaged and projected to the local fanins of the node. LDCs combine CODCs with SDCs (satisfiabaility don't cares), but only the CODC part is propagated to the fanin network. Another form of local don't cares, propagates both the CODC and SDC parts to the fanin network. Both are shown to be compatible in some sense, but conservative. We give a method for updating both kinds of local don't cares incrementally when other nodes in the network are changed.


Session 10B: Power Saving Techniques for Embedded Processors

Moderators: Preeti Panda, Synopsys, Inc., Mountain View, CA
Tony Givargis, University of California, Irvine, CA
10B.1 A Methodology for the Design of Application Specific Instruction Set Processors (ASIP) using the Machine Description Language LISA [p. 625]
Andreas Hoffmann, Oliver Schliebusch, Achim Nohl, Gunner Braun, Oliver Wahlen, Heinrich Meyr

The development of application specific instruction set processors (ASIP) is currently the exclusive domain of the semiconductor houses and core vendors. This is due to the fact that building such an architecture is a di±cult task that requires expertise knowledge in different domains: application software development tools, processor hardware implementation, and system integration and verification. This paper presents a retargetable framework for ASIP design which is based on machine descriptions in the LISA language. From that, software development tools can be automatically generated including HLL C-compiler, assembler, linker, simulator and debugger frontend. Moreover, synthesizable HDL code can be derived which can then be processed by standard synthesis tools. Implementation results for a low-power ASIP for DVB-T acquisition and tracking algorithms designed with the presented methodology will be given.

10B.2 Area and Power Reduction of Embedded DSP Systems using Instruction Compression and Re-Configurable Encoding [p. 631]
Subash G. Chandar, Mahesh Mehendale, R. Govindarajan

In this paper, we propose a reconfiguration mechanism that allows multiple instruction compression to reduce both code size, which in turn reduces the cost, and (instruction fetch) power, which enhances the battery lifetime, two key considerations in embedded DSP systems. We enhance Texas Instruments DSP core TMS320C27x to incorporate this mechanism and evaluate the improvements on code size and instruction fetch energy using real life embedded control application programs. We show that even with minimal hardware overhead, we can improve code size by over 10% and instruction fetch energy by over 40%.

10B.3 I-CoPES: Fast Instruction Code Placement for Embedded Systems to Improve Performance and Energy Efficiency [p. 635]
Sri Parameswaran, Jšrg Henkel

The ratio of cache hits to cache misses in a computer system is, to a large extent, responsible for its characteristics such as energy consumption and performance. In recent years energy efficiency has become one of the dominating design constraints, due to the rapid growth in market share for mobile computing/ communication/internet devices. In this paper we present a novel fast constructive technique that relocates the instruction code in such a manner into the main memory that the cache is utilized more efficiently. The technique is applied as a preöprocessing step, i.e. before the code is executed. It is applicable for embedded systems where the number and characteristics of tasks running on the system is know a priori. The technique does not impose any computational overhead to the system. As a result of applying our technique to a variety of real-world applications we measured (through simulation) that the number of cache misses drops significantly. Further, this reduces the energy consumption of a whole system (CPU, caches, buses, main memory) by up to 65% at an only slightly increased memory size of 13% on average.


Session 10C: Embedded Tutorial: IC Power Distribution Challenges

Moderator: Farid Najm, University of Toronto, Toronto, ON, Canada
10C.1 IC Power Distribution Challenges [p. 643]
Sudhakar Bobba, Tyler Thorp, Kathirgamar Aingaran, Dean Liu

With each technology generation, delivering a time-varying current with reduced nominal supply voltage variation is becoming more difficult due to increasing current and power requirements. The power delivery network design becomes much more complex and requires accurate analysis and optimizations at all levels of abstraction in order to meet the specifications. In this paper, we describe techniques for estimation of the supply voltage variations that can be used in the design of the power delivery network. We also describe the decoupling capacitor hierarchy that provides a low impedance to the increasing high-frequency current demand and limits the supply voltage variations. Techniques for high-level power estimation that can be used for performance vs. power trade-offs to reduce the current and power requirements of the circuit are also presented.

10C.2 Challenges in Power-Ground Integrity [p. 651]
Shen Lin, Norman Chang

With the advance of semiconductor manufacturing, EDA, and VLSI design technologies, circuits with increasingly higher speed are being integrated at an increasingly higher density. This trend causes correspondingly larger voltage fluctuations in the on-chip power distribution network due to IR-drop, L di/dt noise, or LC resonance. Therefore, Power-Ground integrity becomes a serious challenge in designing future high-performance circuits. In this paper, we will introduce Power-Ground integrity, addressing its importance, verification methodology, and problem solution.


Session 11A: Panel: Automatic Hierarchical Design: Fantasy or Reality?

Moderator: Rob A. Rutenbar, Carnegie Mellon University, Pittsburgh, PA
Panelists: Olivier Coudert, Patrick Groeneveld, Juergen Koehl, Scott Peterson, Vivek Raghavan, Naresh Soni
11A.1 Automatic Hierarchical Design: Fantasy or Reality? [p. 656]

The debate regarding the design of integrated systems, flat versus hierarchically, is almost as old as VLSI design itself. Through the years experts have predicted that future generations of design automation tools would have to adopt a strictly hierarchical scheme for mastering algorithmic and logistic complexity. However, the steady improvement of tools and algorithms and the desire to push integrated circuit performance to the ultimate extremes has pushed the flat design style into the multi-million gate domain. Will this development continue, or will hierarchy finally win ? The panel will discuss the different views on this topic and explore possible options for future design and tool flows. In particular, the panel will address questions related to verification, system design, logic synthesis, and physical design.