ASP-DAC 2000 Abstracts

Sessions: [A1] [B1] [C1] [D1] [A2] [B2] [C2] [D2] [E2] [A3] [B3] [C3] [D3] [E3] [A4] [B4] [C4] [D4] [E4] [A5] [B5] [C5] [D5] [E5] [A6] [B6] [C6] [D6] [E6] [A7] [B7] [C7] [D7] [E7] [A8] [B8] [D8]


Session A1: (Special Session) University LSI Design Contest

Co-Chairs : Ryota Kasai, NTT, Japan
Anantha Chandrakasan, MIT, USA
A1.1 A VLSI Implementation of the Blowfish Encryption/Decryption Algorithm [p. 1]
Michael C.-J. Lin, Youn-L. Lin

We propose an efficient hardware architecture for the Blowfish algorithm [1]. The speed is up to 4 bit/clock, which is 9 times faster than a Pentium. By applying operator-rescheduling method, the critical path delay is improved by 21.7%. We have successfully implemented it using Compass cell library targeted at a 0.6 mm TSMC SPTM CMOS process. The die size is 5.7 x 6.1 mm2 and the maximum frequency is 50MHz.

A1.2 VLSI Implementation of Rake Receiver for IS-95 CDMA Testbed using FPGA [p. 3]
Oliver Y-h. Leung, Chi-Y.Tsui, Roger S. Cheng

In this work, an implementation of a time-multiplexed downlink Rake receiver complied with the IS-95 CDMA standard is presented. A low power architecture of the Rake receiver is implemented. A structure which provides the offset changing for the pseudo-random sequence (PN sequence) used for despreading of the CDMA signals is discussed. Architecture for the efficient time multiplexing of the Rake fingers is also presented. The design was implemented using Xilinx FPGA. It was tested to be functionally correct and the performance was complied with IS-95.

A1.3 VLSI Implementation of a Switch Fabric for Mixed ATM and IP Traffic [p. 5]
Chi-Y. Tsui, Louis Chung-Y. Kwan, Chin-T. Lea

A VLSI implementation of a multistage self-routing ATM switch fabric is presented. The size of the switch is 16x16 and can handle the OC-12 (622 Mbps) link rate. Based on a bit-slice architecture, the entire 16x16 switch is implemented using four identical chips. The switch has multiple paths, created by a randomizer in front of the routing stages, between each input-output pair. The switch uses an input/output-buffering scheme and contains no buffers inside the fabric. A priority structure, which supports four levels, allows the delay sensitive ATM cells to be switched with the shortest latency. It also enables the non-interleaving routing scheme of IP cells. The switch fabric was designed and fabricated using MOSIS 0.8µm technology and was tested to run at 93MHz with 3.3V supply voltage.

A1.4 Design of Digital Neural Cell Scheduler for Intelligent IB-ATM Switch [p. 7]
J.-K. Lee, S.-M. Lee, M.M.-O.Lee, D.-W. Lee, Y.-C. Kim, S.-J. Jeong

We present the architecture of the ATM banyan switch composed of pattern process and high-speed digital neural cell scheduler. An input buffer type ATM switch with a window-based contention algorithm is proposed, modeling in VHDL for high-speed cell scheduler of ATM switching. A digital hopfield neural cell scheduler which has the ability of real-time processing is used to solve loss of throughput due to head-of-line( HOL) and internal blocking when FIFO queueing is employed at the Banyan network. In this scheduler, it is found we can minimize the delay for scheduling and select non-blocking cells in maximum leading to high performance. Our proposed ATM switch is modeled in VHDL, synthesized, implemented into a FPGA chip set and fabricated using 0.6¤ Š CMOS technology.

A1.5 Genetic Algorithm Accelerator GAA-II [p. 9]
Shin'ichi Wakabayashi, Tetsuhi Koide, Nayoshi Toshine, Masataka Yamane, Hajime Ueno

We have developed a new GA hardware called GAA-I(Genetic Algorithm Accelerator-I), in which the crossover operator to be applied to each individual was dynamically selected during the algorithm execution, GAA-I has some restrictions due to the limited chip size. In this paper, we extend the GAA-I and propose a new GA hardware, GAA-II, so that large, complex optimization problems can be solved. Furthermore, GAA-II has capability of parallel processing with other GAA-II chips. The GAA-II chip has been fabricated as a CMOS standard cell chip with 0.6 technology.

A1.6 A Programmable Built-In Self-Test Core for Embedded Memories [p. 11]
Chih-T. Huang, Jing-R. Huang, Cheng-W. Wu

Testing embedded memories is becoming an industry-wide concern with the advent of deep-submicron technology and system-on-chip applications. We present a prototype chip for a programmable built-in self-test (BIST) design that is used for testing embedded memories, especially DRAMs. The BIST chip supports various memory test algorithms by a novel controller and sequencer design. The area of the core circuit is about 1,020 x 1,020 um 2 using a 0:6um CMOS process, and the clock speed is over 100MHz.

A1.7 An Algorithm for VLSI Implementation of Highly Efficient Cubic-Polynomial Evaluation [p. 13]
Fan Mo, Yihua Zhang, Jun Yu, Qianling Zhang

In this paper, we present a novel cubic-polynomial evaluation algorithm. It is suitable for VLSI implementation and the computational cost is reduced to about 66% of the previously reported method.

A1.8 Design of Self-timed Asynchronous Boothās multiplier [p. 15]
Tin-Y. Tang, Chin-S. Choy, Pui-L. Siu, Cheong-F. Chan

This paper presents a design of multiplier for the multiplication of two 8-bit two-complement numbers. The multiplier applies the self-timed asynchronous methodology such that the multiplier can be assumed to operate on average case delay. And also, modified booth's algorithm [1] is used to reduce the number of partial product generated. As a result, the speed of the multiplier can be improved.

A1.9 High Speed and Ultra-Low Power 16x16 MAC Design using TG Techniques for Web-based Multimedia System [p. 17]
Seung-M. Lee, Jin-H. Chung, Hying-S. Yoon, Mike M-O. Lee

In this paper a study has been presented on High Speed(HS) and 79mW Low Power(LP) 16X16 MAC performance of new XOR-Based circuits using transmission gate logic(TG) implemented on 0.65um CMOS DLP/DLM technology. It is shown that our proposed MAC results in better performance than other published MACs due to no DC leakage currents for low power and bypassing unnecessary switching activities with latches before and after multiplier for high speed.

A1.10 A Smart Imager for the Vision Processing Front-END [p. 19]
Noriaki Takeda, Mituru Homma, Makoto Nagata, Takashi Morie, Atsushi Iwata

A CMOS PWM imager which realizes block summation and 2D projection of a thresholded image, in addition to row-parallel PWM readout with gray scale, is reported. An imager including 56 x 56 pixels, an address signal generator, and a signal processing circuit is fabricated in a 6mm x 6mm chip with a 0.8um CMOS technology.

A1.11 A Binary Image Sensor with Flexible Motion Vector Detection using Block Matching Method [p. 21]
Tomohiro Nezuka, Takafumi Fujita, Makoto Ikeda, Kunihiro Asada

A binary image sensor with motion vector detection has been developed and successfully tested. The sensor acquires binary images and detects motion vectors using block matching method. The size of matching blocks and the search area of motion vectors are variable. The block matching is realized by a 2-D shift-register, XOR circuits and block current-sum circuits. The sensor integrates 32x32 pixels on a 7.3mmx7.3mm die in a 1.2um CMOS 2-Metal 2-poly-Si process.

A1.12 An Arbitrary Chaos Generator Core Circuit Using PWM/PPM Signals [p. 23]
Kenichi Murakoshi, Takashi Morie, Makoto Nagata, Atushi Iwata

A compact chaos generator core circuit has been designed, fabricated, and tested. It generates PWM and PPM (pulse width and phase modulation) signals that follow arbitrary nonlinear analog dynamics. The measurement results show that PWM/PPM chaotic signals are generated at 1 MHz with a calculation precision of 6.7 bits at a supply voltage of 3.3 V. The circuit can be used for large-scale integrated noise sources or high-performance advanced neural networks using nonlinear analog dynamics.

A1.13 An Application Specific Java Processor with Reconfigurabilities [p. 25]
Shinji Kimura, Hiroyuki Kida, Kazuyoshi Takagi, Tatsumori Abematsu, Katsumasa Watanabe

The paper presents an application specific Java processor including reconfigurabilities, which is a DLX like pipeline processor with 5 stages and executes Java byte codes directly. Reconfigurabilities are the key technologies for application specific operations. We have introduced two reconfigurabilities: (1) a mechanism to override the control signals for a specific instruction, (2) external components can be attached with the same input and output ports as the internal ALU.

A1.14 Reconfigurable Synchronized Dataflow Processor [p. 27]
Hiroshi Sasaki, Hitoshi Maruyama, Hideaki Tsukioka, Nobuyoshi Shoji, Hiroaki Kobayashi, Tadao Nakamura

This paper describes the design and implementation of a reconfigurable synchronized dataflow processor (RSDP). The RSDP can configure its hardware to directly represent dataflow graphs (DFGs) of applications. Data are processed while they flow along application-specific datapaths in the RSDP. We have designed three DFGs for benchmarking and evaluated their performance on an RSDP board. The results show that the RSDP running at relatively lower frequency can achieve a competitive performance with a general-purpose processor.

A1.15 Prototype Microprocessor LSI with Scheduling Support Hardware for Operating System on Multiprocessor System [p. 29]
Naoki Nishimura, Takahiro Sasaki, Tetsuo Hironaka

In this paper we describe the details of SSH (Scheduling Support Hardware) which makes it possible to use fine grain parallelism effectively in multiprocessor systems. The SSH supports fast and concurrent thread scheduling that is very important in fine grain parallel processing.

A1.16 A Floating Point Arithmetic Unit for a Static Scheduling and Compiler Oriented Multiprocessor System ASCA [p. 31]
Takahiro Kawaguchi, Takayuki Suzuki, Hideharu Amano

We describe a floating point arithmetic unit (FPU) which supports static scheduling by automatic parallelizing compiler. This FPU designed to work with 50MHz clock with the assistance of EDA synthesis and layout tools. Under the clock rate condition, it appears that this FPU requires about 120,000 gates and marks 8.2 MFLOPS with the clock level simulation.

A1.17 A 16-bit Redundant Binary Multiplier Using Low-Power Pass-Transistor Logic SPL [p. 33]
Hirofumi Sakamoto, Ken'ichiro Uda, Bu-Y. Lee, Hiroyuki Ochi, Kazuo Taki, Takao Tsuda

We have designed a 16-bit redundant binary multiplier using pass-transistor logic SPL on a 0.35 um technology. Number of transistors is 12,349, and area is 1,322 um x 332 um. Measured power dissipation and maximum delay at T=25 C and VDD =3.3 V are 33.7 mW/100 MHz and 7.4 nsec, respectively.

A1.18 An 8x8 nRERL Serial Multiplier for Ultra-Low-Power Applications [p. 35]
Joonho Lim, Dong-G. Kim, Sang-C. Kang, Soo-I. Chae

An 8 x 8-b nRERL serial multiplier is implemented in a 0.6-um n-well 3-metal CMOS process. nERL (nMOS Reversible Energy Recovery Logic) is a new reversible adiabatic logic circuit, which can be operated at the leakage-current level for ultra-low-energy applications. Measurement results showed that the nRERL serial multiplier consumed only 0.9 % of the energy dissipation of the static CMOS one at the operating frequency 100 kHz at 5V, where its adiabatic and leakage losses were about equal.


Session B1: IP Reuse and Protection Methods

Co-Chairs : Graham R. Hellestrand, VaST Systems Technology Corp., USA
Minoru Yamamoto, Fujitsu Ltd., Japan
B1.1 Embedded Tutorial: Essential Issues for IP Reuse [p. 37]
Daniel D. Gajski, Allen C.-H. Wu, Viraphol Chaiyakul, Shojiro Mori, Tom Nukiyama, Pierre Bricaud

With widespread recent emphasis on System-On-a-Chip (SOC), IP reuse has emerged as a vital and growing business in semiconductor industry. In this paper, we will address essential issues for IP reuse by discussing current challenges to the success of IP businesses and identifying the obstacles that need to be overcome.

B1.2 Usage-Based Characterization of Complex Functional Blocks for Reuse in Behavioral Synthesis [p. 43]
Nong Fan, Viraphol Chaiyakul, Daniel D. Gajski

This paper presents a novel usage-based characterization method to capture pre-designed complex functional blocks for automatic reuse in behavioral synthesis. We identify attributes necessary for reuse of such complex components and illustrate how the attributes are captured into a design database. A complex component MTX_MULT 8X8, which computes product of two 8 x 8 matrices, is captured with the proposed method, and its reuse in behavioral synthesis is demonstrated with design of a DCT example. Feasibility of the method for capturing various components is demonstrated as well.

B1.3 Reuse and Protection of Intellectual Property in the SpecC System [p. 49]
Rainer Dömer, Daniel D. Gajski

In system-level design, the key to cope with the complexities involved with System-on-Chip (SOC) designs, is the reuse of Intellectual Property (IP). With the increasing demand for IP, the mechanism to protect an IP component from being copied, modified, or reverse-engineered, becomes very important. This paper describes how reuse and protection of IP is supported by the SpecC language and the SpecC design environment.

B1.4 Fair Watermarking Techniques [p. 55]
Gang Qu, Jennifer L. Wong, Miodrag Potkonjak

Many intellectual property protection (IPP) techniques have been proposed. Their primary objectives are providing convincible proof of authorship with least degradation of the quality of the intellectual property (IP), and achieving robustness against attacks. These are also well accepted as the most important criteria to evaluate different IPP techniques. The essence of such techniques is to limit the solution space by embedding signatures as constraints. One key issue that should be addressed but has not been discussed is the fairness of the techniques: what is the quality of the solution subspace for different signatures, that is, how large the solution subspace is (uniqueness), and how difficulty it is to get a solution from such subspace (hardness)? In this paper, we introduce fairness as one of the metrics for good IPP techniques and post the challenge problem of how to design fair watermarking techniques. We claim that all fair techniques have to be instance-oriented and due to the complexity of the problem itself, we propose an approach that utilizes the statistical information of the problem instance. We use the satisfiability (SAT) problem as an example to illustrate how fairness could be achieved. We make the observation that the unfairness of the previous watermarking techniques comes from the global embedding of the signature and propose fair watermarking techniques. We test the uniqueness and hardness on a model with full knowledge of the solution and real life benchmarks as well. The experimental results show fairness can be achieved.


Session C1: Decision Diagrams and Verification Methods

Co-Chairs : Yirng-An Chen, National Chiao Tung Univ., Taiwan
Kiyoharu Hamaguchi, Osaka Univ., Japan
C1.1 An Efficient Heuristic for State Encoding Minimizing the BDD Representations of the Transition Relations of Finite State Machines [p. 61]
Riccardo Forth, Paul Molitor

An efficient representation of the (smoothed) transition relation of a synchronous finite state machine (FSM) speeds up the traversal based symbolic verification and the functional simulation of the FSM. When using reduced ordered binary decision diagrams (BDDs), dynamic reordering algorithms are applied in order to keep the sizes of the BDDs tractable. However, when FSMs are represented by BDDs, the state encoding can be used as an additional optimization criteria. In this paper, we present a new algorithm for state encoding of FSMs that minimizes the BDD representations of the corresponding (smoothed) transition relations. Experimental results show the approach to be very efficient.

C1.2 Automatic Partitioning for Efficient Combinational Verification [p. 67]
Rajarshi Mukherjee, Jawahar Jain, Koichiro Takayama, Masahiro Fujita

A majority of the state-of-the-art combinational verification techniques are based on the extraction and use of internal equivalences between two circuits. Verification can become difficult if the two circuits have none or very few internal correspondences. In this paper we investigate automatic circuit partitioning as a methodology to make otherwise intractable circuits relatively tractable to the verifier. We show that given any two circuits to be verified, finding the best partitions that minimize the verification runtime is NP-hard. Therefore, we propose efficient heuristics to utilize certain characteristics of typical circuit design styles to find good partitions for the circuits. A key difference between our approach and earlier approaches to circuit partitioning is that ours is fully automated and does not require any prior knowledge of the type of function being implemented by the circuit. Using circuit partitioning we are able to verify several hard industrial circuits that could not be verified otherwise.

C1.3 A Hardware Simulation Engine Based on Decision Diagrams (Short Paper) [p. 73]
Yukihiro Iguchi, Tsutomu Sasao, Munehiro Matsuura, Atsumu Iseno

A hardware logic simulation engine based on decision diagrams is presented. For the data structure of the engine, we propose PMDDs (Paged reduced ordered Multi-valued Decision Diagrams). A unit of this engine consists of memory (RAMs) and control circuits : RAMs store the PMDD data, and the control circuits trace the edges according to the input vectors. The engine consists of several units, and is accelerated by pipelining. Experimental results using a prototype are shown.
Keywords : Logic simulation, Hardware simulation engine, BDD, MDD, PMDD.

C1.4 Formal Verification based on Assume and Guarantee Approach - A Case Study (Short Paper) [p. 77]
Subir K. Roy, Hiroaki Iwashita, Tsuneo Nakata

In this paper, we present a verification case study of the Audio Output Interface (AOF) sub-system based on a compositional verification approach known as the Assume-Guarantee approach. We illustrate, how designers can leverage their detailed knowledge of a design to partition it at the module level, and thereby, enable the Assume-Guarantee approach to overcome intrinsic limitations of a formal verification tool to successfully verify large designs.

C1.5 Multi-Clock Path Analysis Using Propositional Satisfiability [p. 81]
Kazuhiro Nakamura, Shinji Maruoka, Shinji Kimura, Katsumasa Watanabe

We present a satisfiability based multi-clock with analysis method. The method uses propositional satisfiability (SAT) in the direction of multi-clock paths. We show a method to reduce the multi-clock path detection problems to SAT problems. We also show heuristics on the conversion from multi-level circuits into CNF formulae. We have applied out method to ISCAS89 benchmarks and other sample circuits. Experimental results show the improvement on the manipulatable size of circuits by using SAT.


Session D1: Routing in Deep-Submicron

Co-Chairs : Jason Cong, Univ. of California, Los Angeles, USA
Takumi Okamoto, NEC Corp., Japan
D1.1 Self-Reforming Routing for Stochastic Search in VLSI Interconnection Layout [p. 87]
Yukiko Kubo, Yasuhiro Takashima, Shigetoshi Nakatake, Yoji Kajitani

Given a route which connects terminals on a one-layer routing area(Steiner tree), ip is a procedure that makes a current route change its configuration within its peripheral domain. A ip reforms a route by replacing one of its edges with a minimal detour. A route can ip one nearby obstacle. If the obstacle is another route, a more organized operation, called the dual ip, is applied to a pair of routes. The idea is enhanced to the 2-layer hv-routing. The performance of ip and dual ip was experimented in simulated annealing which reforms a route or a set of routes with respect to the evaluation function of multiple objectives. Some unique and satisfiable results were observed.

D1.2 An Interconnect Topology Optimization by a Tree Transformation [p. 93]
Naofumi Tsujii, Katsutoshi Baba, Shuji Tsukiyama

Since the interconnect delay has become the dominating factor in circuit performance, demands for a good delay-minimization router are very high. In this paper, we propose two algorithms to find an interconnect tree of a net which minimizes a weighted sum T of delays to all sinks, where the weight assigned to a sink represents a criticality of the delay to the sink. The algorithms start from a Steiner tree and repeat a tree transformation as monitoring the change of T. We also show some experimental results to evaluate the performance of the algorithms.

D1.3 Timing-Driven Hierarchical Global Routing with Wire-Sizing and Buffer-Insertion for VLSI with Multi-Routing-Layer [p. 99]
Takahiro Deguchi, Tetsushi Koide, Shin'ichi Wakabayashi

In the high performance VLSI with multi-layer lay-out model, the complexity of the global routing problem becomes much high under timing constraints. This paper presents a hierarchical global routing method based on a multi-layer routing model for the high performance standard cell layout. In each hierarchical level, the routes of nets are determined by solving a linear programming problem considering wire-sizing and buffer-insertion under timing constraints. We have implemented the proposed method on a workstation and showed the effectiveness of the method from experimental results.

D1.4 Area Routing Oriented Hierarchical Corner Stitching with Partial Bin [p. 105]
Zhang Yan, Wang Baohua, Cai Yici, Hong Xianlong

An efficient layout data structure is of great importance to a gridless area routing algorithm. Nowadays, the rectangular corner stitching structure, whose point searching and tile insertion are O(N1/2), is the most popular data structure used by gridless area routers. In this paper, we discuss a novel database - hierarchical corner stitching with partial bin, or hierarchical PB corner stitching, which combined the bin-based structure and the trapezoidal corner stitching. Its point searching and tile insertion are both enhanced to O(N1/2/r. We also derive the algorithms of the operations, such as point searching, area searching, plowing, etc., according to the specialties of area routers.


Session A2: (Special Session) CAD for Embedded Systems

Chair : Miodrag Potkonjak, Univ. of California, Los Angeles, USA
A2.1 Offline Program Re-mapping to Improve Branch Prediction Efficiency in Embedded Systems [p. 111]
Stephen S. Brown, Jeet Asher, William H. Mangione-Smith

This work presents a technique for improving the efficiency of hardware branch predictors. The key approach is to apply techniques of off-line re-mapping of the program-space in order to reduce the incidence of conflict misses in the branch hardware. This work also presents a new model for organizing temporal information between blocks in the address space, which can be applied effectively to previous re-mapping systems as well. The increased efficiency can be translated to improved performance for fixed hardware specifications, or used to reduce the hardware cost for achieving targeted performance during the design cycle.

A2.2 Timing Driven Co-design of Networked Embedded Systems [p. 117]
Dinesh Ramanathan, Ravindra Jejurikar, Rajesh K. Gupta

Advances in microelectronics integration have led to emergence of tightly integrated systems with high performance network interfaces. Design of such systems especially for single chip implementation is a delicate balance of functionality and available time budget to perform the tasks. Computer-aided design tools and methodologies are needed to ensure the correctness of the design and efficiency of the design process, especially for networked systems that have strict timing requirements both due to technology as well as networking needs. We present an overview of a timing-driven design methodology for networked systems, developed at the University of California, Irvine.

A2.3 Low-power Design Methodology and Applications utilizing Dual Supply Voltages [p. 123]
Kimiyoshi Usami, Mutsunori Igarashi

This paper describes a gate-level power minimization methodology using dual supply voltages. Gates and flip-flops off the critical paths are made to operate at the reduced supply voltage to save power. Core technologies are dual-V DD circuit synthesis and P&R. We give a brief overview on existing low-power EDA technologies as background and discuss advantages and challenges of the dual-VDD approach. Through real design examples, we will show that the approach reduces power effectively while keeping the performance at negligible area overhead.

A2.4 Co-Synthesis with custom ASICs [p. 129]
Yuan Xie, Wayne Wolf

This paper introduces the first hardware/software co-synthesis algorithm that optimizes the implementations of ASICs that are used as processing elements for the embedded systems. Many real time embedded systems are composed of heterogeneous processing elements, such as general purpose CPUs, ASICs and FPGAs. Previous work has not considered how to select one of several possible ASIC implementations for a specific task. We have developed a heuristic iterative improvement algorithm for distributed embedded system co-synthesis. We use Monet, a behavioral level architectural exploration system, to generate multiple implementations of a behavioral description of an ASIC and to analyze their performance. To the best of our knowledge, this is the first co-synthesis algorithm that takes into account the impact of different ASIC implementations of tasks on system performance and cost in the co-synthesis process.


Session B2: System-Level Power Optimization & Estimation

Co-Chairs : Akira Matsuzawa, Matsushita Electrical Industrial, Co. Ltd., Japan
Masato Edahiro, NEC Corp., Japan
B2.1 A New Method for Constructing IP Level Power Model Based on Power Sensitivity [p. 135]
Heng-Liang Huang, Jiing-Yuan Lin, Wen-Zen Shen, Jing-Yang Jou

This paper proposes a nominal point selection method for IP (Intellectual Property) level power model based on power sensitivity. By analyzing the relationship between the dynamic power consumption of CMOS circuits and their input signal statistics, three nominal points are efficiently selected to construct a power model based on power sensitivity. Our experimental results on a number of benchmark circuits show the effectiveness of the proposed method. Estimation accuracy within 5.78% of transistor level simulations is achieved.

B2.2 A hybrid approach for core-based system-level power modeling [p. 141]
Tony Givargis, Frank Vahid, Jörg Henkel

Reducing power consumption has become a key goal for system-on-a-chip (SOC) designs. Fast and accurate power estimation is needed early in the design process, since power reduction methods tend to have greater impact at higher abstraction levels. Unfortunately, current approaches to power estimation, which concentrate on register-transfer-level models or lower, are quite slow. Higher-level approaches, while faster, may suffer from inaccuracy. However, the advent of cores enables a hybrid approach, described in this paper, yielding both fast and accurate estimates from high-level models. In particular, we use power estimation data obtained from the gate-level for a core's representative input stimuli data (instructions), and we propagate this data to a higher (object-oriented) system-level model, which is parameterizable and executable. Depending on the kind of cores, various parameterizable equation or look-up table based techniques are used, resulting in self-analyzing core models. We have applied our technique to several cores of a digital camera SOC and have achieved simulation speedups of over 1000 with accuracies suitable for making reliable power-related system-level design decisions. Although we focus on power estimation, our approach can be used for estimating other metrics as well, such as performance and size.

B2.3 Voltage Reduction of Application-Specific Heterogeneous Multiprocessor Systems for Power Minimisation [p. 147]
Allan Rae, Sri Parameswaran

We present a design strategy to reduce power demands in application-specific, heterogeneous multiprocessor systems with interdependent subtasks. This power reduction scheme can be used with a randomised search such as a genetic algorithm where multiple trial solutions are tested. The scheme is applied to each trial solution after allocation and scheduling have been performed. Power savings are achieved by equally expanding each processor's execution time with a corresponding reduction in their respective operating voltage. Lowest cost solutions achieve average reductions of 24% while minimum power solutions average 58%.

B2.4 Withdrawn

B2.5 Synthesis of Low Power Folded Programmable Coefficient FIR Digital Filters (Short Paper) [p. 153]
Vijay Sundararajan, Keshab K. Parhi

A novel low-power synthesis technique is presented for the design of folded or time-multiplexed programmable-coefficient FIR filters where storage area is traded-off for lowering power consumption. A systematic technique is developed for low power mapping of FIR filters to architectures with arbitrary number of multipliers and adders. Power consumed in multipliers is reduced by reducing switching activity at both the data-input as well as the coefficient input. Common input operands can be exposed by unfolding, which, however leads to a memory increase. Simulation results are obtained for folding 65 and 129 tap bandpass FIR filters. The average power consumed in a multiplier for a fixed number of hardware multipliers with varying unfolding factors is compared. It is observed that most of the gains due to unfolding are obtained for relatively small unfolding factors and therefore for relatively small memory area overhead. Depending on the unfolding factor employed the average power consumed in a multiplier is seen to reduce anywhere from 54.75% to 81.73% when transpose FIR filters are synthesized as opposed to synthesizing direct-form FIR filters with no unfolding.


Session C2: Design Environment for FPGA

Co-Chairs : Shinji Kimura, Nara Inst. of Science and Technology, Japan
Kazutoshi Wakabayashi, NEC Corp., Japan
C2.1 Invited Talk : Synthesis Challenges for Next-Generation High-Performance and High-Density PLDs [p. 157]
Jason Cong, Songjie Xu

Rapid evolution of programmable logic device (PLD) architectures and the increasingly demanding requirements to meet with high-density and high-performance PLD designs call for a new generation of PLD synthesis tools to support many important capabilities not available in the existing synthesis tools. In this paper, we analyze the synthesis challenges and opportunities in supporting next-generation high-performance and high-density PLDs. We classify these challenges into two categories: (i) those 2 from the advance of PLD architectures, including synthesis for hierarchical architectures and synthesis for heterogenous architectures, and (ii) those driven by the requirements for high-density and high-performance PLD designs, including layout-driven synthesis, incremental synthesis, and IP-based synthesis. We shall discuss existing and/or potential solutions to these problems and outline research opportunities and directions for the development of next-generation PLD synthesis systems.

C2.2 KressArray Xplorer: A New CAD Environment to Optimize Reconfigurable Datapath Array Architectures [p. 163]
Reiner Hartenstein, Michael Herz, Thomas Hoffmann, Ulrich Nageldinger

This paper introduces a CAD environment for design-space exploration of coarse grain reconfigurable KressArray architectures and similar platforms. To find an optimal solution to a given application domain this KressArray Xplorer supports experimenting with different architectures. An estimator analyses the input and suggests an architecture onto which then the application is mapped using an heuristic algorithm. A graphic editor supports user interaction. The system generates performance analysis data, which is used by an automatic tool to refine the result iteratively.

C2.3 Hardware-Software Cosynthesis for Run-time Incrementally Reconfigurable FPGAs [p. 169]
Byungil Jeong, Sungjoo Yoo, Sunghyun Lee, Kiyoung Choi

This paper presents a method for hardware-software cosynthesis with run-time incrementally reconfigurable FPGAs. To reduce the run-time overhead of reconfiguring FPGAs, we present a concept called early partial reconfiguration (EPR) which minimizes the overhead by performing reconfiguration for an operation (or a task in our terms) mapped to an FPGA as early as possible so that the operation is ready to start when its execution is requested. For further reduction of the overhead, we integrate the incremental reconfiguration (IR) of FPGAs with the EPR concept. We present an ILP formulation and an efficient heuristic algorithm based on the EPR and IR concepts. Experiments on embedded system examples and synthetic examples show the efficiency of the proposed method.


Session D2: Placement Consistent with Routing

Co-Chairs : Yu-Liang Wu, Chinese University of Hong Kong, China
Hiroshi Murata, Microark, Japan
D2.1 A New Encoding Scheme for Rectangle Packing Problem [p. 175]
Toshihiko Takahashi

In the rectangle packing problem, encoding schemes to represent the placements of rectangles are the key factors of efficient algorithms. In 1995, an epoch-making encoding scheme, known as SEQ-PAIR, was developed[2]. The solution space of SEQ-PAIR has been considered sufficiently small. In this paper, however, we present a simple and natural encoding scheme of rectangle packings whose solution space is smaller than that of SEQ-PAIR.

D2.2 Analytical Minimization of Half-Perimeter Wirelength [p. 179]
Andrew A. Kennings, Igor L. Markov

Global placement of hypergraphs is critical in the top-down placement of large timing-driven designs [10, 16]. Placement quality is evaluated in terms of the half-perimeter wirelength (HPWL) of hyperedges in the original circuit hypergraph provided timing constraints are met. Analytical placers are instrumental in handling non-linear timing models [9, 3], but have two important drawbacks: (a) corresponding optimization algorithms are typically slower than top-down methods driven by multi-level mincut partitioning [2], and (b) hyperedges must be represented with net models [10, 17, 15, 8, 21, 20] which imply a mismatch of objective functions, with the alternative of computationally expensive linear programming (LP) [25, 16]. By comparing to optimal solutions produced by linear programming, we show that net models lead to solution quality loss. To address this problem, we present the first analytical algorithm that does not require net models and permits a direct inclusion of non-linear delay terms [3]; this allows to avoid expensive linearization of delay terms in [16]. Our numerical engine utilizes well-known quadratically convergent Newton-type methods [5, 11] for speed; it produces solutions within 12% of the LP optimum. Empirical results are for industrial placement instances.

D2.3 Modeling and Minimization of Routing Congestion [p. 185]
Maogang Wang, Majid Sarrafzadeh

Typical placement objectives involve reducing net-cut cost or minimizing wirelength. Congestion minimization is least understood, however, it models routability most accurately. In this paper, we study the congestion minimization problem during placement. First we pointed out that the bounding box router used in [12] is not an accurate measurement of the congestion in the placement. We use a realistic global router to evaluate congestion in the placement stage. This ensures that the final placement is truely congestion minimized. We also proposed two new post processing algorithms, the flow-based cell-centric algorithm and the net-centric algorithm. While the flow-based cell-centric algorithm can move multiple cells at the same time to minimize the congestion, it suffers large consumption of memory. Experimental results show that the net-centric algorithm can effectively identify the congested spots in the placement and reduce the congestion. It can produce on an average 7: 7% less congestion than the method proposed in [12]. Finally, we use a final global router to verify that the placement obtained from our algorithm has 39% less congestion than a wirelength-optimized placement obtained by TimberWolf (commercial version 1.3.1).


Session E2: (Special Session) System-In-Package (SIP)

Chair : Wayne W.-M. Dai, Univ. of California, Santa Cruz, USA
E2.1 System-In-Package (SIP): Challenges and Opportunities [p. 191]
King L. Tai

In this paper, we propose the concept of System-In-Package (SIP) as a generalization of System-On-Chip (SOC). System-In-Package overcomes formidable integration barriers without compromising individual chip technologies. The goal of SIP is to match or exceed SOC performance with lower cost. SIP technology platform that provides the needed integration is described. Applications include integrating memory with logic and integrating radio frequency ICs with high-quality (high-Q) passive components. SIP technology will unleash the innovations of designers and unlock the full potential of IC technology.

E2.2 Taiwan Foundry for System-In-Package (SIP) [p. 197]
Albert Lin

System-In-Package (SIP) is a cost-effective alternative to System-On-Chip (SOC) and chips with embedded memory. The key elements of SIP technology include I/O redistribution, solder bumping, flip chip assembly, and high density thin film interconnect substrate with/without integrated passives. To meet the need of SIP, as well as other advanced packaging solutions, APack Technologies has been set up to serve worldwide customers. High quality of the APack's service is built upon the SIP technology licensed from Bell Labs of Lucent Technologies, and the experience of cost competitive wafer fabrication of foundries in Taiwan.

E2.3 Integration of Large-Scale FPGA and DRAM in a Package Using Chip-On-Chip Technology [p. 205]
Michael X. Wang, Katsuharu Suzuki, Wayne W.-M. Dai, Yee L. Low, Kevin J. O'conner, King L. Tai

A field-programmable multi-chip module containing one ORCA 3T/125 FPGA and 4 MByte DRAM was built using chip-on-chip technology. Module architecture and physical design issues are presented. A PCI board consisting of four chip-on-chip modules is also built as the test vehicle. The design environment for this multi-chip module, including visual or C++ design entry and bit-serial datapath synthesis system, is also discussed. Some ongoing approaches, like double-ip technology and area IO are also addressed.

E2.4 Modeling and Analysis of Integrated Spiral Inductors for RF System-In-Package [p. 211]
Minqing Liu, Wayne W.-M. Dai

In this paper, a fast efficient Method of Moments(MoM) based on a new subdomain partitioning technique is presented for rapidly extracting the distributed capacitance and inductance of spiral inductors. By using this approach, the analytical formula of Green's function for multi-layer substrates is obtained through the prioritized wave-tracing method. The computing time is much faster than the state of art capacitance and inductance solvers, such as FMMS[6]. Good agreement between numerical results and measurement data is shown to demonstrate the accuracy of the proposed new method.


Session A3: Low Power Design : Implementation

Co-Chairs : Shoji Kawahito, Toyohashi Univ. of Technology, Japan
Jan M. Rabaey, Univ. of California, Berkeley, USA
A3.1 Narrow Bus Encoding for Low Power Systems [p. 217]
Youngsoo Shin, Kiyoung Choi

High integration in integrated circuits often leads to the problem of running out of pins. Narrow data buses can be used to alleviate this problem at the cost of performance degradation due to wait cycles. In this paper, we address bus coding methods for low power core-based systems incorporating narrow buses. Although the conventional Bus-Invert code performs well for completely random patterns, we show that transition signaling combined with Bus-Invert, which we call BITS coding, can achieve much more power saving for data patterns of typical DSP applications. The application of BITS coding to a real circuit design is limited by the overhead of the encoder and decoder circuits and the extra bus line introduced. We propose an approximate version of BITS coding, which do not require the extra bus line while retaining the advantage of BITS coding.

A3.2 Data Transmission over a Bus with Peak-Limited Transition Activity [p. 221]
Vijay Sundararajan, Keshab K. Parhi

Transitions on high capacitance busses in VLSI systems result in considerable power dissipation. Various coding schemes have been proposed in literature to encode the input signal in order to reduce the number of transitions. Reducing number of transitions comes in exchange for redundancy in data transferred over the busses. For a given amount of redundancy there exists a lower bound on the average number of transitions. In recent times noise and reliability problems have brought the peak/instantaneous power consumed in VLSI systems in to prominence. There has been limited study done on reducing the number of instantaneous transitions and hence the peak power consumed in busses. In this paper we model a bus with a limit on the maximum instantaneous transition activity as a constrained channel and derive an upper bound on the data-rate obtainable using the capacity of the underlying channel. We then demonstrate that some existing bus encoding schemes are near-optimal with respect to the derived bounds thus, perhaps, obviating the need to search for newer more complicated coding schemes. Also considered is a bus with a constraint on number of transitions in a fixed number of (k) bus transmissions. The capacity of such a bus is derived in the same manner as a bus with a constraint on maximum instantaneous transition activity.

A3.3 Power Analysis and Implementation of a Low-Power 300 MHz 8-b x 8-b Pipelined Multiplier [p. 225]
Jinn-Shyan Wang, Po-Hui Yang

This paper analyzes the power consumption of an array pipelined multiplier. To precisely realize a low power pipelined multiplier, the analytical model for a clocking system is presented. Simulation results show that the storage element is the key-component in a high performance pipelined multiplier macro. Compared with the conventional DFF and latch, the new low power DFF as PTTFF [6] achieves total power reduction ranging between 34 and 62 percents in a pipelined multiplier macro.


Session B3: Embedded Software

Co-Chairs : Peter Marwedel, Univ. of Dortmund, Germany
Hiroaki Takada, Toyohashi University of Technology, Japan
B3.1 A New Approach to Assembly Software Retargeting for Microcontrollers [p. 229]
Ing-Jer Huang, Dao-Zhen Chen

A new approach is proposed to translate existing software programs from one instruction set to other instruction sets at the assembly level. The behaviors of instructions are abstractly represented as manipulation of the machine state. The behavior of each basic block of the software program is then represented as a pair of state transition. Instruction set retargeting is then modeled as the problem of finding sequences of instructions accomplishing the same machine state transitions at the target machine as does the software program at the source machine. The proposed approach has been successfully demonstrated on the software translation between several industrial microcontrollers.

B3.2 Register Allocation for Common Subexpressions in DSP Data Paths [p. 235]
Rainer Leupers

This paper presents a new code optimization technique for DSPs with irregular data path structures. We consider the problem of generating machine code for data flow graphs with common subexpressions (CSEs). While in previous work CSEs are supposed to be strictly stored in memory, the technique proposed in this paper also permits the allocation of special purpose registers for temporarily storing CSEs. As a result, both the code size and the number of memory accesses are reduced. The optimization is controlled by a simulated annealing algorithm. We demonstrate its effectiveness for several DSP applications and a widespread DSP processor.

B3.3 A Technique for QoS-based System Partitioning [p. 241]
Johnson S. Kin, Chunho Lee, William H. Mangione-Smith, Miodrag Potkonjak

Quality of service (QoS) has been an important topic of many research communities. Combined with an advanced and retargetable compiler, variability of applications-specific very large instruction word (VLIW) processors has been shown to provide excellent platform for optimizing performance under area constraints. Although media servers and others that are natural candidates for QoS management are embedded systems and implementing QoS management using the applications-specific VLIW platform can provide a new horizon for efficient and effective system design, until now no effort has been reported. In this paper, we introduce a QoS-based system partitioning scheme that addresses the need for maximizing net benefit of a system under resource constraints by exploiting applications-specific VLIW processors. The approach utilizes the modern advances in compiler technology and architectural enhancements that are well matched to the compiler technology. Experimental results indicate that the approach presented in this paper is very effective in system partitioning for QoS given a set of heterogeneous processors.


Session C3: Implementation of Boolean Functions

Co-Chairs : Yusuke Matsunaga, Fujitsu Laboratories, Ltd., Japan
Hiroyuki Ochi, Hiroshima City Univ., Japan
C3.1 Exact Minimization of Fixed Polarity Reed-Muller Expressions for Incompletely Specified Functions [p. 247]
Debatosh Debnath, Tsutomu Sasao

This paper presents an exact minimization algorithm for fixed polarity Reed-Muller expressions (FPRMs) for incompletely specified functions. For an n-variable function with %alpha; unspecified minterms there are 2 n+%alpha; distinct FPRMs. A minimum FPRM is one with the fewest products. The minimization algorithm is based on the multi-terminal binary decision diagrams. Experimental results for a set of functions are shown. The algorithm can be extended to obtain exact minimum Kronecker expressions for incompletely specified functions. Index Terms -AND-EXOR, Reed-Muller expression, Kronecker expression, exact minimization, incompletely specified function.

C3.2 An Efficient Framework of Using Various Decomposition Methods to Synthesize LUT Networks and Its Evaluation [p. 253]
Shigeru Yamashita, Hiroshi Sawada, Akira Nagoya

We present an efficient framework for synthesizing look-up table (LUT) networks. Some of the existing LUT network synthesis methods are based on functional (boolean) decompositions. Our method also uses functional decompositions, but we try to use various decomposition methods, which include algebraic decompositions. Therefore, this method can be thought of as a general framework for synthesizing LUT networks by integrating various decomposition methods. We use a cost database file which is a unique characteristic in our method. We also present comparisons between our method and some well-known LUT network synthesis methods, and evaluate the final results after placement and routing. Although our method is rather heuristic in nature, the experimental results are encouraging.

C3.3 Three Parameters to Find Functional Decompositions [p. 259]
Tsutomu Sasao, Ken-ichi Kurimoto

Finding simple disjoint functional decompositions is a basic problem, but is generally time-consuming since there are nearly 2 n bipartitions of input variable. This paper introduces three parameters to find bipartitions of the input variables. It also defines "ideal random logic functions," and derives their properties. Experimental results using randomly generated functions and benchmark functions show the usefulness of the approach.


Session D3: Physical Design Planning

Co-Chairs : Wayne W.-M. Dai, Univ. of California, Santa Cruz, USA
Shuji Tsukiyama, Chuo Univ., Japan
D3.1 Delay-Optimal Wiring Plan for the Microprocessor of High Performance Computing Machines [p. 265]
Jun Kikuchi, Tetsuo Sasaki, Tohru Hashimoto, Kazuhisa Miyamoto

This paper presents a delay-optimal designing method for high performance microprocessor. In order to develop such a microprocessor in short period, a novel planning method for global wires, called "wiring plan" has been introduced. The goal of this wiring plan is to solve timing issue of wires with minimum usage of wiring resources; wiring channels and inserted buffers. In this wiring plan, global assignment of wiring resources and improvement of local congestion for their usage have been done. As a result of applying them, our chip has been able to work at 400MHz, and channel usage has not exceeded its capacity and gate overhead for buffer insertions has been only 1.6% of total gates.

D3.2 MMP : A Novel Placement Algorithm for Combined Macro Block and Standard Cell Layout Design [p. 271]
Hong Yu, Xianlong Hong, Yici Cai

In this paper, an efficient mixed-mode placement algorithm called MMP is presented for the high performance mixed block and standard cell designs. Our approach combines the well-known quadratic placement with bottom-up clustering, as well as the slicing partitioning strategy. This approach can account for macro blocks and standard cells simultaneously. Our method is both very efficient and effective, while it can be run very fast, too. We have tested our algorithm on a set of sample circuits from industry and consistently obtained excellent results.
Keywords: Mixed-Mode, Clustering, Quadratic Placement, Partitioning

D3.3 Dynamic Weighting Monte Carlo for Constrained Floorplan Designs in Mixed Signal Application [p. 277]
Jason Cong, Tianming Kong, Faming Liang, Jun S. Liu, Wing Hung Wong, Dongmin Xu

Simulated annealing has been one of the most popular stochastic optimization methods used in the VLSI CAD field in the past two decades. Recently, a new Monte Carlo and optimization method, named dynamic weighting Monte Carlo [WL97], has been introduced and successfully applied to the traveling salesman problem, neural network training [WL97], and spin-glasses simulation [LW99]. In this paper, we have successfully applied dynamic weighting Monte Carlo algorithm to the constrained floorplan design with consideration of both area and wirelength minimization. Our application scenario is the constrained floorplan design for mixed signal MCMs, where we need to place all the analog modules together in groups so that they can share common power and ground planes, which are separate from those used by the digital modules. Our experiments indicate that the dynamic weighting Monte Carlo algorithm is very effective for constrained floorplan optimization. It out-performs the simulated annealing for a real mixed signal MCM design by 19.5% in wirelength, with slight area improvement. This is the first work adopting the dynamic weighting Monte Carlo optimization method for solving VLSI CAD problems. We believe that this method has applications to many other VLSI CAD optimization problems.


Session E3: Methodologies for Reliable Design

Co-Chairs : Mitiko Miura-Mattausch, Hiroshima Univ., Japan
Andrzej J. Strojwas, Carnegie Mellon Univ., USA
E3.1 Symbolic Circuit-Noise Analysis and Modeling with Determinant Decision Diagrams [p. 283]
XiangDong Tan, C.-J. Richard Shi

In this paper, a new symbolic noise analysis and modeling technique is presented. The new method exploits the sharing of symbolic expressions in the noise models by using a recently introduced graph, called determinant decision diagrams (DDDs), for symbolic determinant representations. With efficient DDD-based graph manipulations, we are able to generate the exact noise models for analog blocks. Symbolic noise analysis and modeling on real analog circuit examples are presented and compared with SPICE noise simulation.

E3.2 Gate-Level Aged Timing Simulation Methodology for Hot-Carrier Reliability Assurance [p. 289]
Yoshiyuki Kawakami, Jingkun Fang, Hirokazu Yonezawa, Nobufusa Iwanishi, Lifeng Wu, Alvin I-Hsien Chen, Norio Koike, Ping Chen, Chune-Sin Yeh, Zhihong Liu

This paper presents a new aged timing simulation methodology that can be used for hot-carrier reliability assurance of VLSI. This methodology consists of a compact model and a unique algorithm. The ratio based model simplifies the aging I-V characteristics of MOSFET over time into the aged timing and the corresponding ratio at gate-level. A new algorithm is proposed including a gate primitive decomposition method and an aged slew rate propagation method. This algorithm provides good stress representation and can achieve comparable accuracy with the conventional transistor-level approach. The above methodology has been implemented in a new simulator. Experimental results demonstrate that the simulator based on this methodology realizes full-chip circuit capacity and can be applied to various reliability analyses including degradation-sensitive critical paths and clock skew.

E3.3 Embedded Tutorial : Subwavelength Lithography (PSM, OPC) [p. 295]
Tsuneo Terasawa

Fabrication of fine features of smaller 0.15um is vital for future ultra-large scale integrated (ULSI) devices. An area of particular concern is whether and how optical lithography can delineate such feature sizes, i.e., smaller than the exposure wavelength. Resolution enhancement techniques for achieving subwavelength optical lithography are presented. Various types of phase shift mask (PSM) techniques and their imaging characteristics are discussed and compared to conventional binary mask technique. To apply these masks effectively to practical patterns, optical proximity effect correction (OPC) technique and a phase shifter pattern design tool must be established. These techniques offer the capability to improve resolution to exceed the wavelength limitation and to increase depth of focus.


Session A4: Synthesis for System-On-A-Chip

Co-Chairs : Rolf Ernst, Technical Univ. Braunschweig, Germany
Ahmed A. Jerraya, TIMA Laboratory, France
A4.1 Embedded Tutorial : IC Design Technology for Building System-On-A-Chip [p. 301]
Rajesh Gupta

A4.2 Thread Partitioning Method for Hardware Compiler Bach [p. 303]
Mizuki Takahashi, Nagisa Ishiura, Akihisa Yamada, Takashi Kambe

This paper presents a method for thread partitioning for a hardware compiler Bach. Bach synthesizes RT level circuits from a system description written in Bach-C language, where a system is modeled as communicating processes running in parallel. The system description is decomposed into threads, i.e., strings of sequential processes, and then converted into synthesizable behavioral VHDL models. The proposed method attempts to find a partitioning of a given system description into threads that maximize resource sharing among processes in the threads. Experiments on two real designs show that the circuit sizes were reduced by 3.7% and 14.7%. We also show the detailed statistics and analysis of the size of the resulting gate level circuits.

A4.3 An Area/Time Optimizing Algorithm in High-Level Synthesis for Control-Based Hardwares (Short Paper) [p. 309]
Nozomu Togawa, Masayuki Ienaga, Masao Yanagisawa, Tatsuo Ohtsuki

This paper proposes an area/time optimizing algorithm in high-level synthesis for control-based hardwares. Given a call graph whose node corresponds to a control flow of an application program, the algorithm generates a set of state-transition graphs which represents the input call graph under area and timing constraints. In the algorithm, first state-transition graphs which satisfy only timing constraint are generated and second they are transformed so that they can satisfy area constraint. Since the algorithm is directly applied to control-flow graphs, it can deal with control flows such as bit-wise processes and conditional branches. Further, the algorithm synthesizes more than one hardware architecture candidates from a single call graph for an application program. Designers of an application program can select several good hardware architectures among candidates depending on multiple design criteria. Experimental results for several control-based hardwares demonstrate effectiveness and efficiency of the algorithm.

A4.4 A Timing-Driven Synthesis of Arithmetic Circuits using Carry-Save-Adders (Short Paper) [p. 313]
Taewhan Kim, Junhyung Um

Carry-save-adder (CSA) is one of the most widely used types of operation in implementing a fast computation of arithmetics. An inherent limitation of the conventional CSA applications is that the applications are confined to the sections of arithmetic circuit that can be directly translated into addition expressions. To overcome this limitation, from the analysis of the structures of arithmetic circuits found in industry, we derive a set of simple, but effective CSA transformation techniques. Those are (1) optimization across multiplexors, (2) optimization across design boundaries (restricted notion of [3]), and (3) optimization across multiplications. Based on the techniques, we develop a new timing-driven CSA transformation algorithm that is able to utilize CSAs extensively throughout the whole circuits. Experimental data for practical testcases are provided to show the effectiveness of our algorithm.


Session B4: Reconfigurable Computation

Co-Chairs : Toshiaki Miyazaki, NTT Network Innovation Laboratories, Japan
Tetsuo Hironaka, Hiroshima City Univ., Japan
B4.1 Communicating Logic : An Alternative Embedded Stream Processing Paradigm [p. 317]
Norbert Imlig, Ryusuke Konishi, Tsunemichi Shiozawa, Kiyoshi Oguri, Kouichi Nagami, Hideyuki Ito, Minoru Inamori, Hiroshi Nakada

This paper introduces a new concurrent processing model called "Communicating Logic" (CL). As the target architecture we adopt the dual layered "Plastic Cell Architecture" (PCA). Data-path-oriented processing functionality is encapsulated in asynchronous hardware objects with variable graining and implemented using look-up tables. Communication (i.e. connectivity and control) between the distributed processing objects is achieved by means of inter-object message passing. The key point of the CL approach is that it offers the merits of scalable, high performance hardware implementation with the compilation and linking capabilities unique to software.

B4.2 A Scheduling and Allocation Method to Reduce Data Transfer Time by Dynamic Reconfiguration [p. 323]
Kazuhito Ito

In the era of deep submicron technology, wire delay on an LSI chip is becoming relatively larger than operation delay. Increase of execution speed by parallel processing may be limited due to the data transfer time between functional units. If we can dynamically reconfigure nearby functional units into desired operation type and execute operations on the reconfigured units, long data transfer is reduced and hence fast processing can be achieved. In this paper we propose a scheduling method to determine static operation execution time and functional unit allocation to achieve fast signal processing by considering dynamic reconfiguration of functional units. Results show the effectiveness of the proposed method.

B4.3 Invited Talk : Reconfigurable Computing: Its Concept and a Practical Embodiment using Newly Developed Dynamically Reconfigurable Logic (DRL) LSI [p. 329]
Masakazu Yamashina, Masato Motomura

This paper first outlines a broad range of reconfigurable computing research activities from a perspective of system LSI designs. Then, the paper focuses onto dynamically reconfigurable logic (DRL) LSI, a prototype chip that we developed to evaluate the reconfigurable computing concept. Through its ability to exchange hardware contexts quickly, this chip can accelerate media/communication applications with customized hardware configurations, yet maintaining scalability towards varying application sizes.


Session C4: Synthesis for Low Power

Co-Chairs : Akira Nagoya, NTT Communication Science Laboratories, Japan
Manish Pandey, Cadence Design Systems, Inc., USA
C4.1 Power Reduction by Simultaneous Voltage Scaling and Gate Sizing [p. 333]
Chunhong Chen, Majid Sarrafzadeh

This paper proposes to use voltage-scaling (VS) and gate-sizing (GS) simultaneously for reducing power consumption without violating the timing constraints. We present algorithms for simultaneous VS and GS based on the Maximum-Weighted-Independent-Set problem. We describe the slack distribution of circuit, completeness of gate library and discreteness of supply voltage, and discuss their effects on power optimization. Experimental results show that the average power reduction ranges from 23.3% to 56.9% over all tested circuits.

C4.2 Analysis of Power-Clocked CMOS with Application to the Design of Energy-Recovery Circuits [p. 339]
Massoud Pedram, Xunwei Wu

This paper presents our research results on power-clocked CMOS design. First we provide algebraic expressions and describe properties of clocked signals. Next two types of power-clocked CMOS circuit constructions are introduced and analyzed in detail. Since the adiabatic switching requires slow-ramping of the power-clock, a clocked transmission gate and a four-stage clocked NP-domino circuit are presented, which receive trapezoidal and sinusoidal power-clocks, respectively. PSPICE simulations demonstrate the correct operation and energy-saving advantage of the proposed circuits.

C4.3 Low-Power Design of Sequential Circuits Using a Quasi-Synchronous Derived Clock [p. 345]
Xunwei Wu, Jian Wei, Massoud Pedram, Qing Wu

This paper presents a novel circuit design technique to reduce the power dissipation in sequential circuits by generating a quasi-synchronous derived clock from the master clock and using it to isolate the flip flops in the circuit from the unwanted triggering action of the master clock. An example design of a decimal counter demonstrates the large power saving and improved performance of the resulting circuit.

C4.4 FSM Decomposition by Direct Circuit Manipulation Applied to Low Power Design [p. 351]
José C. Monteiro, Arlindo L. Oliveira

Clock-gating techniques are very effective in the reduction of the switching activity in sequential logic circuits. In particular, recent work has shown that significant power reductions are possible with techniques based on finite state machine (FSM) decomposition. A serious limitation of previously proposed techniques is that they require the state transition graph (STG) of the FSM to be given or extracted from the circuit. Since the size of the STG can be exponential on the number of registers in the circuit, explicit techniques can only be applied to relatively small sequential circuits. In this paper, we present a new approach to perform FSM decomposition by direct manipulation of the circuit. This way, we do not require the STG, either explicit or implicit, thus further avoiding the limitations imposed by the use of BDDs. Therefore, this technique can be applied to circuits with very large STGs. We provide a set of experimental results that show that power consumption can be substantially reduced, in some cases by more than 70%.


Session D4: Panel Discussion

Timing Closure : The Solution and Its Problems [p. 359] Organizer: Ralph H.J.M. Otten, Delft University of Technology, The Netherlands
Moderator: Ralph H.J.M. Otten, Delft University of Technology, The Netherlands
Panelists:
Raul Camposano (Synopsys, Inc., USA)
Oliver Coudert (Monterey Design Systems, Inc., USA)
Patrick Groeneveld (Magma Design Automation, USA)
Leon Stok (Thomas J. Watson Research Center, IBM, USA)

In this paper we summarize the derivation of the size equations, the key to timing closure. Next we present a number of problems when applying these equations in practice. The main ones are network generation, discrete libraries, size constraints, and resistive interconnect.


Session E4: MOSFET Device Optimization

Co-Chairs : Naoyuki Shigyo, Toshiba Corp., Japan
C.-J. Richard Shi, Univ. of Washington, USA
E4.1 High Performance of Short-Channel MOSFETs due to an Elevated Central-Channel Doping [p. 365]
M. Tanaka, N. Tokida, T. Okagaki, M. Miura-Mattausch, W. Hansch, H. J. Mattausch

An elevated central-channel doping with a depth similar to the S/D junctions is proposed as the best measure for simultaneously improving MOSFET device and high speed circuit performances as well as minimizing their fluctuations. We base our arguments on hydrodynamic device simulation, measured device data of vertical MOSFETs with a central delta-doped impurity profile and include experimental results on doping-profile fluctuations along the channel, which have not been available previously.

E4.2 Circuit Performance Oriented Device Optimization using BSIM3 Pre-Silicon Model Parameters [p. 371]
Mikako Miyama, Shiro Kamohara

We propose a circuit performance oriented device optimization methodology using pre-silicon parameters and critical paths which represent the performance of the chip. Based on our methodology, we successfully reduced the power consumption by 90% and, at the same time, increased the frequency by 30% from the initial design. The key to this optimization methodology is the pre-silicon parameter generation method, which can predict the device performance within 5% accuracy in a few minutes.

E4.3 Embedded Tutorial : Design for Manufacturability : A Path from System Level to High Yielding Chips [p. 375]
Andrzej J. Strojwas

This tutorial describes a wide spectrum of the Design for Manufacturability (DFM) activities. We start by presenting a new approach to IC design, which takes full advantage of leading edge technology. Then we propose a new methodology for acceleration of yield ramping which accounts for all the dominant yield loss mechanisms and a set of software tools that have been developed to address the yield learning problems. Several real-life examples demonstrate the practical results of employing such a yield ramping strategy.


Session A5: Low Power Design : System Approach

Co-Chairs : Keshab K. Parhi, Univ. of Minnesota, USA
Koji Kotani, Tohoku Univ., Japan
A5.1 Embedded Tutorial : Low-Power Silicon Architectures for Wireless Communications [p. 377]
Jan M. Rabaey

Wireless communication and networking is experiencing a dramatic growth, and all indicators point to an extension of this growth in the foreseeable future. This paper reflects on the demands and the opportunities offered with respect to the integrated implmentataion of these applications in the "systems-on-a-chip" era.

A5.2 Run-time Power Control Scheme using Software Feedback Loop for Low-power Real-time Application [p. 381]
Seongsoo Lee, Takayasu Sakurai

A novel hardware-software cooperative power control scheme, namely a software feedback loop, is proposed to lower power consumption of VLSI systems. The proposed run-time voltage and frequency control scheme guarantees the real-time execution of applications. It avoids interface problems and also provides binary-code compatibility. Using a software analysis environment, the power control scheme is shown to achieve more than 90% power reduction for real-time MPEG-4 SP@L1 video encoding, taking into consideration the transition delay between voltage and frequency levels.

A5.3 An Interleaved Dual-Battery Power Supply for Battery-Operated Electronics [p. 387]
Qing Wu, Qinru Qiu, Massoud Pedram

After a detailed analysis and discussion of two important characteristics of today's battery cells (i.e., their current-capacity and current-voltage curves), this paper describes the design principles and architecture of a dual-battery power supply system for portable electronics. The key idea is to integrate two battery types with different energy capacity and current rate curves into the power supply system, and then use them in an interleaved manner in response to varying current requirement of the VLSI circuit that is powered by this dual-battery system. Analytical and empirical results demonstrate the effectiveness of the new battery architecture in maximizing the service life of a battery system with fixed volume (or weight).


Session B5: System Design and Debugging

Co-Chairs : Rajesh Gupta, Univ. of California, Irvine, USA
Takashi Kambe, Sharp Corp., Japan
B5.1 Embedded Tutorial : Embedded System Design with Multiple Languages [p. 391]
Rolf Ernst, Ahmed A. Jerraya

The use of several languages in the design of embedded systems is very convenient for application development and optimization but it can become an obstacle on the way to higher design productivity. This paper explains solutions and future trends.

B5.2 Symbolic Debugging of Globally Optimized Behavioral Specifications [p. 397]
Inki Hong, Darko Kirovski, Miodrag Potkonjak, Marios C. Papaefthymiou

Symbolic debuggers are system development tools that can accelerate the validation speed of behavioral specifications by allowing a user to interact with an executing code at the source level. In response to a user query, the debugger must be able to retrieve and display the value of a source variable in a manner consistent with what the user expects with respect to the source statement where execution has halted. However, when a behavioral specification has been optimized using transformations, values of variables may either be inaccessible in the run-time state or inconsistent with what the user expects. We address the problem that pertains to the retrieval of source values for the globally optimized behavioral specifications. We describe how transformations effect the retrieval of source values. We present an approach for a symbolic debugger to retrieve and display the value of a variable correctly and efficiently in response to a user inquiry about the variable in the source specification. The implementation of the new debugging approach poses several optimization tasks. We formulate the optimization tasks and develop heuristics to solve them. We demonstrated the effectiveness of the proposed approach on a set of designs.

B5.3 Fast Development of Source-level Debugging System Using Hardware Emulation (Short Paper) [p. 401]
Sang-Joon Nam, Jun-Hee Lee, Byoung-Woon Kim, Yeon-Ho Im, Young-Su Kwon, Kyong-Gu Kang, Chong-Min Kyung

We describe the co-development of a processor and its source-level debugging system using an emulation-based validation technology including hardware emulation, not simulation. Since a source-level debugging system is essential to develop an application system and it takes a long time to validate the functionality of the source-level debugging system, we have adopted hardware emulation for a fast validation and system development. Using this methodology, we were able to validate the source-level debugging system successfully before the chip fabrication.

B5.4 Methodology for Hardware/Software Co-verification in C/C++ (Short Paper)[p. 405]
Luc Séméria, Abhijit Ghosh

In this paper we present our C/C++-based design environment for hardware/software co-verification. Our approach is to use C/C++ to describe both hardware and software throughout the design flow. Our methodology supports the efficient mapping of C/ C++ functional descriptions directly into hardware and software. The advantages of a C/C++-based flow from the verification point of view are presented. The use of C/C++ to model all parts of the system provides great flexibility and enables faster simulation compared to existing methodologies. We show how co-verification can be done efficiently and effectively at the various levels of abstraction, how co-verification can be used to drive co-design through performance estimation and give an example of implementation for the 8051 architecture.


Session C5: Optimization Issues in Logic Synthesis

Co-Chairs : Tsutomu Sasao, Kyushu Inst. of Technology, Japan
Shin'ichi Minato, NTT Network Innovation Laboratories, Japan
C5.1 Performance-Optimal Clustering with Retiming for Sequential Circuits [p. 409]
Tzu-Chieh Tien, Youn-Long Lin

We propose an exact clustering with retiming algorithm to minimize the clock period for sequential circuits. Without moving ip-ops (FF's) by retiming, conventional clustering algorithms can only handle combinational parts and therefore cannot achieve the best cycle time. Pan et al. [2] have proposed an optimal algorithm under the unit gate delay model. We propose a more powerful and faster algorithm that produces optimal results even under the more realistic general gate delay model. Experimental results show that our algorithm is twice as fast as Pan's.

C5.2 IBAW: An Implication-Tree Based Alternative-Wiring Logic Transformation Algorithm [p. 415]
Wangning Long, Yu-Liang Wu, Jinian Bian

The well-known ATPG-based alternative wiring technique, RAMBO, has been shown to be very useful because of its proven powerfulness and flexibility in attacking many design automation problems (e.g. logic optimization, circuit partitioning, and post-layout logic transformation ..etc). Since the ATPG based alternative wire locating procedure is the center engine for all its applications, speeding up of this process should be very crucial and useful. We observe that the bottleneck of the technique lies in the costly redundancy tests among a large number of candidate alternative wires. In this paper, we develop a so-called implication-tree data structure which stores implication relationship between nodes with determined logic values, and propose a new ATPG-based alternative-wiring algorithm to speed up the engine. The algorithm, Implication-tree Based Alternative-Wiring (IBAW), differs from other ATPG-based algorithms in terms that it selects source node of alternative wires from the implication-tree, which makes IBAW be able to trim out many unnecessary redundancy checking quite easily without calling for complicated procedures. Hence, it produces a steady speeding up of around 3.6 times faster while maintains the same rewiring capability of the original RAMBO. Our experimental results show that the overall circuit area optimized by IBAW can be slightly better than that by RAMBO, while the runtime is just one-half of the latter.

C5.3 On Mixture Density and Maximum Likelihood Power Estimation via Expectation-Maximization [p. 423]
R. Chandramouli, Vamsi K. Srikantam

A maximum-likelihood estimation procedure for computing the average power consumption of VLSI circuits is proposed. The method can handle data that has a mixture-density with multiple components unlike most of the previous approaches. An iterative computational procedure based on the expectation-maximization principle is also discussed. This can be used to estimate the parameters of an arbitrary (but finite) number of components of the probability distribution of the simulated power data. Experimental results for ISCAS '85 benchmark circuits and a large industrial circuit are given in order to validate the efficiency and practicality of the algorithm. Comparisons show that the proposed method estimates the multiple components (even those with a low probability of occurrence) while the Monte Carlo estimate captures only the most probable component.


Session D5: Novel Techniques in Advanced Partitioning

Co-Chairs : Shin'ichi Wakabayashi, Hiroshima Univ., Japan
Tetsushi Koide, Univ. of Tokyo, Japan
D5.1 Edge Separability Based Circuit Clustering with Application to Circuit Partitioning [p. 429]
Jason Cong, Sung Kyu Lim

In this paper, we introduce a new efficient O(n log n) graph search based bottom-up clustering algorithm named ESC (Edge Separability based Clustering). Unlike existing bottom-up algorithms that are based on local connectivity information of the netlist, ESC exploits more global connectivity information "edge separability" to guide clustering process while carefully monitoring cluster area balance. Computing the edge separability for a given edge e = (x; y) in an edge weighted undirected graph G(V; E; s; w) is equivalent to finding the x-y mincut. Then, we show that a simple and efficient algorithm CAPFOREST [14] can be used to provide a good estimation of edge separability for all edges in G without using any network flow computation. Related experiments based on large scale ISPD98 [1] benchmark circuits confirm that exploiting edge separability yields better quality partitioning solution compared to various bottom-up clustering algorithms proposed in the literature including Absorption [18], Density [9, 11], Rent Parameter [15], Ratio Cut [19], Closeness [17], and Connectivity [16] method. In addition, our ESC based multiway partitioning algorithm LR/ESC-PM provides comparable results to state-of-the-art hMetis [12] and hMetis-Kway [13].

D5.2 Feasible Two-Way Circuit Partitioning with Complex Resource Constraints [p. 435]
Hsun-Cheng Lee, Ting-Chi Wang

We study in this paper the feasibility problem for two-way circuit partitioning subject to complex resource constraints. We then consider that the problem is in general NP-complete. We then consider two special cases of the problem, and present polynomail-time algorithms for them. Finally, we give a backtracking algorithm to solve the general case. To reduce the run time and the storage space of the backtracking algorithm, an incremental flow computation technique is employed. For each algorithm presented in this paper, the corresponding experimental results are also given to support its efficiency. To the best of our knowledge, this paper is the first one to address the feasibility problem for two-way circuit partitioning with complex resource constraints.

D5.3 Performance Driven Multiway Partitioning [p. 441]
Jason Cong, Sung Kyu Lim

Under the interconnect-centric design paradigm, partitioning is seen as the crucial step that defines the interconnect [1]. To meet the performance requirement of today's complex design, performance driven partitioners must consider the amount of interconnect induced by partitioning as well as its impact on performance. In this paper, we provide new performance driven formulation for cell move based top-down multiway partitioning algorithms with consideration of the local and global interconnect delay. In our "constrained acyclic partitioning" formulation, cell moves are restricted to maintain acyclicity in partitioning solution to prevent cyclic dependency among cells in different partitions. In our "relaxed acyclic partitioning" formulation, acyclic constraints are relaxed to give partitioners capability of minimizing cutsize and delay. Our new acyclic constraint based performance driven multiway partitioning algorithm FLARE obtains (i) 4% to 13% better delay compared to the state-of-the-art cutsize minimization based hMetis [10] at almost no increase in cutsize, and (ii) 84% better cutsize compared to the state-of-the-art delay minimization based PRIME [2] at an expense of 16% increase in delay.


Session E5: Efficient Estimation for Interconnection

Co-Chairs : Hiroshi Matsumoto, NEC Corp., Japan
Xianlong Hong, Tsinghua Univ., China
E5.1 Hierarchical Computation of 3-D Interconnect Capacitance using Direct Boundary Element Method [p. 447]
Jiangchun Gu, Zeyi Wang, Xianlong Hong

The idea of Appel's hierarchical algorithm handling the many-body problem is implemented in the direct boundary element method (BEM) for computation of 3-D VLSI parasitic capacitance. Both the electric potential and normal electric field intensity on the boundary are involved, so it can be much easier to handle problems with multiple dielectrics and finite dielectric structure than the indirect BEM. Three kinds of boundaries (forced boundary, natural boundary and dielectric interface) are treated. Two integral kernels with different singularity (1/r, 1/r3) are involved while computing the interaction between the boundary elements. These features make it significantly distinct from the hierarchical algorithm based on the indirect BEM, which only handles single dielectric, one integral kernal and one forced boundary. The coefficient matric is generated and stored hierarchically in this paper. As a result, computation cost of the matrix is reduced, and the matric-vector multiplication in the GMRES iteration is accelerated, so computation speed is improved significantly.
Key Words: VLSI, Parasitic Capacitance, Direct Boundary Element Method, Hierarchical Computation.

E5.2 A Simplified Hybrid Method for Calculating the Frequency-dependent Inductances of Transmission Lines with Rectangular Cross Section [p. 453]
Shuzhou Fang, Xiaobo Tang, Zeyi Wang, Xianlong Hong

This paper describes a simplified hybrid method to calculate the frequency-dependent inductances of transmission lines, which combines cross-section coupled circuit method with boundary element method (BEM) just solving Laplace's equation. The coupled circuit method is used to calculate inductance in low frequency range; The BEM builds the capacitance matrix, then we invert this matrix (omit the row and column relate to ground plane), we can obtain inductance relating to the high frequency at which the current only distributes on surfaces of conductor. A cubic spline interpolation between inductance at low and very high frequency gives a good result over entire frequency range. Comparing this method with previous hybrid method [3], it avoids solving Helmholtz's equation and obtains the capacitance matrix at the same time, greatly reducing the total burden of calculation. Numerical tests show it is applicable and efficient.

E5.3 An Analytic Calculation Method for Delay Time of RC-class Interconnects [p. 457]
W. K. Kal, S. Y. Kim

This paper presents an analytic 3rd order calculation method, without simulations, for delay time of RC-class circuits, which can be conveniently used to model on-chip interconnects. While the proposed method requires comparable evaluation time to the previous 2nd order calculation method, it ensures more accurate results than those of 2nd order method. The proposed analytic delay calculation method guarantees allowable error tolerance when compared to the results obtained from the AWE technique or HSPICE and has better performance in evaluation time as well as numerical stability. The first algorithm of the proposed method requires 7 moments for the 3rd order approximation and yields accurate delay time approximation. The second algorithm requires 5 moments for the 3rd order approximation and results in shorter evaluation time, the accuracy of which may be less than the first algorithm.

E5.4 A New Efficient Waveform Simulation Method for RLC Interconnect via Amplitude and Phase Approximation [p. 463]
Xiaodong Yang, Walter H. Ku, Chung-Kuan Cheng

In this paper we use a segmented Chebyshev orthongonal polynomial expansion approach to approximate amplitude and phase response for RLC interconnect simulation. This method has been shown to be efficient in reducing frequency sampling point number or expansion order. Experiments also show that the number of sampling points is not necessarily dependent on the interconnect circuit size, which makes possible large network simulation. Additionally, because of the smoothing-effect of the atan function in evaluation, the phase response, which has major impact on the transient edge of output response, can be easily approximated. Experiments show the proposed simulation approach achieves 30-50 times speed-up over spice3f4.


Session A6: Optimized LSI Design

Co-Chairs : Chong-Min Kyung, Korea Advanced Inst. of Science and Technology Korea
Takashi Morie, Hiroshima Univ., Japan
A6.1 Optimization of VDD and VTH for Low-Power and High Speed Applications [p. 469]
Koichi Nose, Takayasu Sakurai

Closed-form formulas are presented for optimum supply voltage (VDD) and threshold voltage (VTH) that minimize power dissipation when technology parameters and required speed are given. The formulas take into account short-channel effects and the variation of VTH and temperature. Using typical device parameters, it is shown that a simple guideline to optimize the power consumption is to set the ratio of maximum leakage power to total power about 30%. Extending the analysis, the future VLSI design trend is discussed. The optimum VDD coincides with the SIA roadmap and the optimum VTH for logic blocks at the highest temperature and at the lowest process variation corner is in the range of 0V~0.1V over generations.

A6.2 Compact yet High Performance (CyHP) Library for Short Time-to-Market with New Technologies [p. 475]
Nguyen Minh Duc, Takayasu Sakurai

Two compact yet high performance standard cell libraries (CyHP libraries), which contain only 11 and 20 cells respectively, are proposed. The first CyHP library leads to 5% increase in delay compared to a library containing about 400 cells. The second CyHP library suppresses delay increase up to 2%. The compact nature of these libraries not only reduces the cost and time for generation and maintenance substantially (thus enables new technologies to be adopted immediately), but also shortens synthesis time to about a half. Application results of these libraries to standard benchmarks and an industry design of about 80K gates are presented.

A6.3 A New CMAC Neural Network Architecture and Its ASIC Realization [p. 481]
Yuan-Bao Hsu, Kao-Shing Hwang, Chien-Yuan Pao, Jinn-Shyan Wang

A new VLSI architecture of the CMAC neural network, called CCMAC, which is composed of an embedded content addressable memory (CAM) is proposed. The CCMAC can determine whether the addressing content resides in the memory or not faster. If a match doesn't occur, the activated address is stored in a vacant cell of the CAM. With this mechanism, memory utilization rate can be improved to 100% and control noise can be suppressed. Moreover, the associated mask function, which is triggered while the CAM is full, can improve control performance. Comparison between CCMAC and a conventional CMAC is examined on the truck backer-upper control simulation. A CMAC ASIC chip has been implemented using a 0.6-um CMOS technology under 3.3- or 2.5-V supply voltages after obtaining design trade-off. The chip contains 0.33-million transistors and operates at 20 MH for a 2.5-V supply with only 10-mW power dissipation.


Session B6: DSP and Memory Architecture

Co-Chairs : Ichiro Kuroda, NEC Corp., Japan
Keshab K. Parhi, Univ. of Minnesota, USA
B6.1 Retargetable Estimation Scheme for DSP Architecture Selection [p. 485]
Naji Ghazal, Richard Newton, Jan Rabaey

Given the recent wave of innovation and diversification in digital signal processor (DSP) architecture, the need for quickly evaluating the true potential of considered architectural choices for a given application has been rising. We propose a new scheme, called Retargetable Estimation, that involves analysis of a high-level description of a DSP application, with aggressive optimization search, to provide a performance estimate of its optimal implementation on the architectures considered. With this scheme, we present a new parameterized architecture model that allows quick retargeting to a wide range of architectural choices, and that emphasizes capturing an architecture's salient optimizing features. We show that for a set of DSP benchmarks and two full applications, hand-optimized performance can be predicted reliably. We applied this scheme to two different processors.

B6.2 Data Memory Minimization by Sharing Large Size Buffers [p. 491]
Hyunok Oh, Soonhoi Ha

This paper presents software synthesis techniques to deal with non-primitive data type from graphical dataflow programs based on the synchronous dataflow (SDF) model. Non-primitive data types, often used in multimedia and graphics applications, require buffer memory of large size. To minimize the buffer requirement, we separate global data buffers and local pointer buffers. The proposed approach first allocates the minimum size of global buffers and next binds the local buffers to the global buffers by setting the pointers. Static binding and dynamic binding techniques are devised. Experimental results prove the significance of the proposed techniques.

B6.3 Array Allocation Taking into Account SDRAM Characteristics [p. 497]
Hong-Kai Chang, Youn-Long Lin

Multimedia, image processing and other signal processing applications often involve data stored in large arrays. Due to chip area limitation, arrays are typically assigned to off-chip memories, such as DRAM. This being the case, we try to optimize off-chip memory accesses to improve performance. We take the characteristics of the current mainstream SDRAM memory into account. We propose an algorithm to allocate arrays to different banks to increase the probability of utilizing SDRAMās multi-bank characteristic. Experimental results show significant improvement over traditional approaches.


Session C6: Validation and Test

Co-Chairs : Cheng-Wen Wu, Univ. of California, Santa Barbara, USA
Tomoo Inoue, Hiroshima City Univ., Japan
C6.1 Causality Based Generation Of Directed Test Cases [p. 503]
Nina Saxena, Jacob Abraham, Avijit Saha

Simulation is considered to be the irreplaceable part of design verification. However, the efficiency of this method depends greatly on the test cases used. Random test cases can be generated quickly but have poor coverage. Directed test cases on the other hand require time and manual effort. This paper presents a method for generating directed test cases automatically by making use of signal relationships in the specification. An algorithm is presented that was applied to the GL85 microprocessor, a clone of Intel's 8085. The results are compared with other methods to see the gain with the proposed method.

C6.2 Embedded Tutorial : Fault Models and Test Generation for IDDQ Testing [p. 509]
Yoshinobu Higami, Yuzo Takamatsu, Kewal K. Saluja, Kozo Kinoshita

This paper surveys recent research related to IDDQ testing, particularly focuses on fault models and test generation methods. (1) The paper provides a taxonomy of fault models that have been studied in literature, and classifies these models into a small set of faults. (2) The paper describes efficient test generation methods and fault simulation methods. Test compaction methods, including reduction of the total number of test vectors and selection of IDDQ measurement vectors, are also described.

C6.3 Embedded Tutorial : Issues on SOC Testing in DSM Era [p. 515]
Takashi Aikyo

Deep sub-micron technology is rapidly leading to exceedingly complex, billion-transistor chips. By these technology evolutions, a system is integrated into a chip so called a system-on-a-chip (SOC). In order to bridge the productivity gap between available transistors and able to be designed in SOC, higher-level behavioral language and re-use of already designed intellectual property (IP) will become more common. However, these techniques affect test methodologies and failure analysis of SOC.


Session D6: Cell Generation & Process Dependent Issues

Co-Chairs : Jun-Dong Cho, Sungkyunkwan Univ., Korea
Yoichi Shiraishi, Gunma Univ., Japan
D6.1 A Cell Synthesis Method for Salicide Process [p. 517]
Kazuhisa Okada, Takayuki Yamanouchi, Takashi Kambe

In this paper, we propose a cell synthesis method for a Salicide process. Our method utilizes the local interconnect between adjacent transistors, which is available in some Salicide processes, and optimizes the transistor placement of a cell considering both area and the number of local interconnects. In this way we reduce the number of metal wires and contacts. The circuit model is not restricted to conventional series-parallel CMOS logic, and our method enables us to synthesize CMOS pass-transistor circuits. Experimental results show that our method uses the local interconnect effectively, and optimizes both cell area and metal wire length.

D6.2 Monte-Carlo Algorithms for Layout Density Control [p. 523]
Yu Chen, Andrew B. Kahng, Gabriel Robins, Alex Zelikovsky

Chemical-mechanical polishing (CMP) and other manufacturing steps in very deep submicron VLSI have varying effects on device and interconnect features, depending on local characteristics of the layout. To enhance manufacturability and performance predictability, we seek to make the layout uniform with respect to prescribed density criteria, by inserting "fill" geometries into the layout. We propose several new Monte-Carlo based filling methods with fast dynamic data structures and report the tradeoff between runtime and accuracy for the suggested methods. Compared to existing linear programming based approaches, our Monte-Carlo methods seem very promising as they produce nearly-optimal solutions within reasonable runtimes.

D6.3 Layout Generation of Array Cell for NMOS 4-phase Dynamic Logic (Short Paper) [p. 529]
Makoto Furuie, Bao-Yu Song, Yukihiro Yoshida, Takao Onoye, Isao Shirakawa

An array cell (AC) architecture for the layout design is described, which is dedicated to low-power design by means of the NMOS 4-phase dynamic logic. An AC is constructed of (MxN)+2 transistors so as to constitute each type of NMOS 4-phase logic gates. A graph theoretic approach is exploited in the layout design to reduce the layout area. A number of experimental results demonstrate the practicability of the proposed approach.

D6.4 A New Efficient Method for Substrate-Aware Device-Level Placement (Short Paper) [p. 533]
C. Lin, D. M. W. Leenaerts

The performance of high-frequently mixed-signal and analog designs heavily relies on the actual layout of the circuit components on device level. Substrate coupling has been recognized as one of the major physical design bottlenecks in this field. To deal with this problem in an efficient way, a new technique is proposed based on the sequence pair structure and corner stitching. Performance gain is guaranteed at the cost of O(M) run-time complexity, which is optimal. Simulations with randomly generated problem instances show that practical run-times are indeed linear in the size of the problem instance, thus enabling other design optimization algorithms to incorporate the proposed method without increasing run-time complexity.


Session E6: Analysis Techniques for Analog Circuits

Co-Chairs : Mineo Kaneko, Japan Advanced Inst. of Science and Technology, Japan
Peter M. Lee, Hitachi Ltd., Japan
E6.1 The Enhancing of Efficiency of the Harmonic Balance Analysis by Adaptation of Preconditioner to Circuit Nonlinearity [p. 537]
M.M. Gourary, S. G. Rusakov, S. L. Ulyanov, M.M. Zharov, K.K. Gullapalli, B.J. Mulvaney

Krylov subspace techniques in harmonic balance simulations become increasingly ineffective when applied to strongly nonlinear circuits. This limitation is particularly important in the simulation if the circuit it has components being operated in a very nonlinear region. Even if the circuit contains only a few very nonlinear components, Krylov methods using standard preconditioners can become ineffective. To overcome this problem, we present two adaptive preconditioners that dynamically exploit the properties of the harmonic balance Jacobian. With these techniques we have been able to retain the advantages of Krylov methods even for strongly nonlinear circuits. Some numerical experiments illustrating the techniques are presented.

E6.2 Analog-Testability Analysis by Determinant-Decision-Diagrams based Symbolic Analysis [p. 541]
Tao Pi, C.-J. Richard Shi

The use of the column-rank of the sensitivity matrix as a testability measure for parametric faults in linear analog circuits was pioneered by Saeks in 1970s, and later re-discovered by several others. Its practical use has been, however, limited by how it can be calculated. Numerical algorithms suffer from inevitable round-off errors, while classical symbolic techniques can only handle very small circuits. In this paper , an innovative and efficient graph based symbolic analysis approach, called Determinant Decision Diagrams, is applied to testability measurement and selection for optimum test vectors. The new approach is promising in testability analysis of much larger analog circuits.

E6.3 A Method for Linking Process-level Variability to System Performances [p. 547]
Tomohiro Fujita, Ken-ichi Okada, Hiroaki Fujita, Hidetoshi Onodera, Keikichi Tamaru

In this paper we present a statistical analysis method which bridges the statistical information between process-level and system-level. This enables us to evaluate the effect of process variation at the system level. Also, we can derive constraints on the process variation from a performance requirement. We show an example of the hierarchical statistical analysis applied to a Phase Locked Loop (PLL) circuit.


Session A7: Advanced Design Techniques for Deep-Submicron System-On-A-Chip

Co-Chairs : Kazuo Yano, Hitachi, Ltd., Japan
Takao Onoye, Kyoto Univ., Japan
A7.1 Embedded Tutorial : Design Challenges for 0.1um and Beyond [p. 553]
Takayasu Sakurai

If we look into the scaling law carefully, we find that three crises can be stringent in realizing LSI's for 0.1um and beyond: namely power crisis, interconnection crisis, and complexity crisis. This paper describes these crises and possible solutions to cope with the problems.

A7.2 A Hardware Accelerator for the Specular Intensity of Phong Illumination Model in 3-Dimensional Graphics [p. 559]
Young-Su Kwon, In-Cheol Park, Chong-Min Kyung

This paper presents a special hardware implementation developed for the computation of the specular term which is the most time consuming part in the Phong's illumination. In the Phong shading, the exponentiation operation of two floating-point numbers is necessary for each point inside a polygon. An approximation algorithm is developed to speed up the exponentiation operation, and it is supported by simple hardware that can be easily merged into a floating-point multiplier. The exponentiation operation takes just 4 cycles in the proposed hardware while it takes about 100-200 cycles in conventional floating-point units. Although an approximation algorithm is employed for the exponentiation operation, the amount of error is so minimal that the difference is virtually indistinguishable.

A7.3 Radix-4 Modular Multiplication and Exponentiation Algorithms for the RSA Public-Key Cryptosystem [p. 565]
Jin-Hua Hong, Cheng-Wen Wu

We propose a radix-4 modular multiplication algorithm based on Montgomery's algorithm, and a radix-4 cellular-array modular multiplier based on Booth's multiplication algorithm. The radix-4 modular multiplier can be used to implement fast RSA cryptosystem. Due to reduced number of iterations and pipelining, our modular multiplier is four times faster than the cellular-array modular multiplier based on the original Montgomery's algorithm. The time to calculate a modular exponentiation is about n 2 clock cycles, where n is the word length, and the clock cycle is roughly equal to the delay time of a full adder. The utilization of the multiplier is 100% by interleaving consecutive exponentiations. Locality, regularity, and modularity make the proposed architecture suitable for VLSI implementation.
Keywords: cellular array, Montgomery algorithm, modular multiplication, high radix modular multiplier, public-key cryptography, RSA.


Session B7: (Special Session) Future of System Level Design Languages

Chair : Masaharu Imai, Osaka Univ., Japan
B7.1 An Introduction to SLDL and Rosetta [p. 571]
Steven E. Schulz, Texas Instruments, Inc., USA

The Rosetta Stone (permanently displayed at the British Museum of Art) has played a pivotal role in enabling scientists to translate across a number of widely varying languages used around the world throughout our history. Through this mapping process, linguists have been able to more fully understand the semantic meaning embodied in the written word of our culture. Similarly, the Rosetta language (as we will refer to it in this text) represents a modern-day effort to map across semantic domains within the electronics-centric systems engineering world, motivated by the trend towards System-Level Integration, or System-on-a-Chip (SoC). Rosetta plays a key role in the System Level Design Language (SLDL) industry standards initiative.

B7.2 SystemC Standard [p. 573]
Guido Arnout

The emergence and great popularity of system-on-chip (SoC) designs has brought with it a variety of suggestions for a single language that can describe all of the functional requirements for those highly complex designs. This paper takes a look at the requirements for system-level design languages and evaluates what it will take for any of these languages to be successful.

B7.3 Java Based Object Oriented Hardware Specification and Synthesis [p. 579]
Tommy Kuhn, Wolfgang Rosenstiel

In this contribution we show how the object oriented programming language Java can be used for the specification of synthesizable hardware. In this context the JavaBeans component model plays an important role. We show an integration of the JavaBeans model which allows the specification of hardware from different views and on different levels of abstraction. Furthermore we point out restrictions of the language necessary to perform high level synthesis.

B7.4 Superlog, A Unified Design Language for System-on-chip [p. 583]
Peter L. Flake, Simon J. Davidmann

The design of systems consisting of custom software controlling custom digital hardware is easier if a single language can be used for system specification, software development, hardware design and hardware verification. Superlog takes features of existing languages for software development and hardware design, adds features for system specification and hardware verification, and blends them into a single, coherent language.


Session C7: Delay Testing and Design-For-Testability

Co-Chairs : Yukiya Miura, Tokyo Metropolitan Univ., Japan
Hiroshi Date, Inst. of Systems & Information Technologies Kyushu, Japan
C7.1 Performance Sensitivity Analysis Using Statistical Method and Its Applications to Delay Testing [p. 587]
Jing-Jia Liou, Angela Krstic, Kwang-Ting Cheng, Deb Aditya Mukherjee, Sandip Kundu

The performance of deep submicron designs can be affected by various parametric variations, manufacturing defects, noise or modeling errors that are all statistical in nature. We propose a statistical framework for analyzing the performance sensitivity of designs to various timing related defects/noise/variations. The core engine of our approach is a highly efficient statistical timing analysis tool. We describe the application of our framework for delay fault modeling and analysis of resistive opens and shorts and as well as interconnect crosstalk. We present experimental results demonstrating the accuracy of our statistical framework as compared to SPICE (for a given set of input patterns) and nominal worst-case analysis. Experimental results for analysis of resistive opens and shorts are also included.

C7.2 A Testability Metric for Path Delay Faults and Its Application [p. 593]
Huan-Chih Tsai, Kwang-Ting Cheng, Vishwani D. Agrawal

In this paper, we propose a new testability metric for path delay faults. The metric is computed efficiently using a non-enumerative algorithm. It has been validated through extensive experiments and the results indicate a strong correlation between the proposed metric and the path delay fault testability of the circuit. We further apply this metric to derive a path delay fault test application scheme for scan-based BIST. The selection of the test scheme is guided by the proposed metric. The experimental results illustrate that the derived test application scheme can achieve a higher path delay fault coverage in scan-based BIST. Because of the effectiveness and efficient computation of this metric, it can be used to derive other design-for-testability techniques for path delay faults.

C7.3 A Non-Scan DFT Method at Register-Transfer Level to Achieve Complete Fault Efficiency [p. 599]
Satoshi Ohtake, Hiroki Wada, Toshimitsu Masuzawa, Hideo Fujiwara

This paper presents a non-scan design-for-testability (DFT) method for VLSIs designed at register-transfer level (RTL) to achieve complete fault efficiency. In RTL design, a VLSI generally consists of a controller and a data path. The controller and the data path are connected with internal signals: control signals and status signals. The proposed method consists of the following two steps. First, we apply our DFT methods [1] and [2, 3] to the controller and the data path, respectively. Then, to support at-speed testing, we append a test plan generator which generates a sequence of test control vectors for the modified data path. Our experimental results show that the proposed method can reduce significantly both of test generation time and test application time compared with the full-scan design, though the hardware overhead of our method is slightly larger than that of the full-scan design.

C7.4 A Sigma-Delta Modulation Based BIST Scheme for Mixed-Signal Circuits [p. 605]
Jiun-Lang Huang, Kwang-Ting Cheng

In this work, we present the analysis of a built-in self-test (BIST) scheme for mixed-signal circuits that is intended to provide on-chip stimulus generation and response analysis. Based on the sigma-delta modulation principle, the proposed scheme can produce high-quality stimuli and obtain accurate measurements without the need of precise analog circuitry. Numerical simulations are conducted to validate our idea and the results show that the scheme is a promising BIST approach for mixed-signal circuits.


Session D7: (Panel Discussion) Industry-Academia Cooperation [p. 611]

Organizer: Tokinori Kozawa, STARC, Japan
Moderator: Hiroto Yasuura, Kyushu Univ., Japan
Panelists:
Bill Joyner (SRC, USA)
Jan Rabaey (Univ. of California, Berkeley, USA)
Ivo Bolson (IMEC, Belgium)
Yoshiaki Masuhara (Hitachi Ltd., Japan)


Session E7: Signal Integrity / Noise Issues in Deep-Submicron

Co-Chairs : Hidetoshi Onodera, Kyoto Univ., Japan
Nobuo Fujii, Tokyo Inst. of Technology, Japan
E7.1 A 12b 50 MHz 3.3V CMOS Acquisition Time Minimized A/D Converter [p. 613]
Young-Deuk Jeon, Byeong-Lyeol Jeon, Seung-Chul Lee, Sang-Min Yoo, Seung-Hoon Lee

A 12b 50 MHz CMOS analog-to-digital converter (ADC) based on a pipelined architecture was designed to demonstrate acquisition time minimization techniques for high-speed two-stage amplifiers. The proposed techniques reduce overshoots and undershoots of amplifier outputs and acquisition time by controlling the bias currents of amplifiers. The prototype ADC was fabricated in a 0.35 um double-poly triple-metal n-well CMOS technology. The measured signal-to-noise-and-distortion ratio is improved by more than 5 dB using the proposed techniques at a 50 MHz clock. The ADC power consumption is 200 mW at 3.3 V and 50 MHz.

E7.2 A Benchmark Suite for Substrate Analysis [p. 617]
Edoardo Charbon, Luis Miguel Silveira, Paolo Miliozzi

The paper proposes an initial benchmark set, suitable for substrate analysis and test. The aim is to help accurately represent electrical noise injected into and picked up from substrate in a variety of high performance circuits. Creating an accurate image of such noise is becoming a critical requirement with the expansion of real plug-and-play style designs. Several important methods for the analysis of substrate parasitic coupling are reviewed in light of the effect substrate noise has on the performance of analog and digital ICs over a wide frequency spectrum. The requirements and formats for each benchmark are described in full detail to allow possible algorithmic as well as signal integrity tests.

E7.3 Embedded Tutorial : Substrate Crosstalk Analysis in Mixed Signal CMOS Integrated Circuits [p. 623]
Makoto Nagata, Atsushi Iwata

Substrate crosstalk modeling and analysis techniques for a reliable SoC oriented mixed signal IC design are outlined. Substrate noise measurements by use of a source follower and a latched comparator clearly indicate that L_di/dt and R_i effects on Vdd/Gnd wirings appear in the substrate. An experimental full chip substrate crosstalk verification system based on a macroscopic substrate noise model efficiently analyzes the analog performance degradation relevant to the digital activity on the same chip.


Session A8: High Speed LSI Design for Entertainment Application

Co-Chairs : Takayasu Sakurai, Univ. of Tokyo, Japan
Makoto Ikeda, Univ. of Tokyo, Japan
A8.1 Invited Talk: Importance of CAD Tools and Methodology in High Speed CPU Design [p. 631]
Haruyuki Tago, Kazuhiro Hashimoto, Nobuyuki Ikumi, Masato Nagamatsu, Masakazu Suzuoki, Yasuyuki Yamamoto

Design methodologies and CAD for "Emotion Engine" LSI are presented with emphasis on practical aspects of verification and timing closure. A combination of simulation, emulation and formal verification ensured the functional first silicon for system evaluation. In order to control wire delay in early design stage, floor-plan based synthesis and wire load estimation are adopted for quick timing closure.

A8.2 300MHz Design Methodology of VU for Emotion Synthesis [p. 635]
Takayuki Kamei, Hideki Takeda, Yukio Ootaguro, Takayoshi Shimazawa, Kazuhiko Tachibana, Shin'ichi Kawakami, Seiji Norimatsu, Fujio Ishihara, Toshinori Sato, Hiroaki Murakami, Nobuhiro Ide, Yukio Endo, Akira Aono, Atsushi Kunimatsu

With advance in silicon technology and system integrating technology, utmost care for RC-delay must be taken in high performance LSI design. This paper describes the new design method applied to the Vector Unit(VU) implemented in the recently announced high-performance 3D-graphics engine, "Emotion Engine". The key features of the design method are the followings:
careful functional design of the VU architecture
design hierarchy consistency between RTL descriptions and floorplan plan
accurate estimation of wire load in pre-layout static timing analysis
With the above design features, the VU is has achieved the operating frequency of 300MHz.

A8.3 Repeater Insertion Method and its Application to a 300MHz 128-bit 2-way Superscalar Microprocessor [p. 641]
Norman Kojima, Yukiko Parameswar, Christian Klingner, Yukio Ohtaguro, Masataka Matsui, Shigeaki Iwasa, Tatsuo Teruyama, Takayoshi Shimazawa, Hideki Takeda, Kouji Hashizume, Haruyuki Tago, Masaaki Yamada

Interconnect delay is dominant in today's high speed VLSI circuits and there have been various works to resolve it [1],[2],[3]. A repeater insertion tool "RePertory" has been newly developed to solve the interconnect timing problem on a 300MHz 128-bit 2-way Superscalar Microprocessor. Because of its practical simple algorithm, the location of over 5700 repeaters distributed over 5 modules is calculated in 26 minutes on 5 UltraSPARC-II 360MHz in parallel. This paper describes the goals of RePertory for this project, the timing improvement flow with RePertory, the results and the analysis of this repeater insertion method.

A8.4 Clock Design of 300MHz 128-bit 2-way Superscalar Microprocessor [p. 647]
Fujio Ishihara, Christian Klingner, Ken-ichi Agawa

Less than 116ps overall clock skew has been achieved across the 15.02mm x 15.04mm die by balanced clock path routing and differential clock signal distribution in the global clock tree of 300 MHz 128-bit 2-way superscalar microprocessor. The shared clock wire configuration and clock buffer layout patterns over the whole die enhance the clock skew insensitivity to process fluctuation. Combination of three different clock tuning methods is successfully applied to the entire clock tree and the clock skew is minimized efficiently within a limited design period.


Session B8: Panel Discussion

One Language or More? -How Can We Design an SoC at a System Level? - [p. 653] Organizer: Masaharu Imai, Osaka Univ., Japan
Moderator: Gary Smith, Dataquest, USA
Panelists:
Steven Schulz - Texas Instruments, Inc., USA (SLDL)
Karen Bartleson - Synopsys, Inc., USA (SystemC)
Daniel D. Gajski - Univ. of Calfornia, Irvine, USA (SpecC)
Wolfgang Rosenstiel - Univ. of Tuebingen, Germany (Java)
Peter Flake - Co-Design Automation, Inc., USA (Superlog)
Hiroto Yasuura - Kyushu Univ., Japan (C,C++)
Masaharu Imai - Osaka Univ., Japan(VHDL)

The emerging Core Based Design Methodology is forcing engineers up into the Electronic System level of design. It is only there that they can make the hardware and software trade-offs needed for optimal design performance. Unfortunately, as we move up into ES Level Design, our two hardware description languages, VHDL and Verilog, are running out of steam. The design language dilemma is, "Which language will fill our need for a new ESL language ?" Some start-ups are pushing alternate hardware languages, while others are approaching the problem with modifications of software languages. The modifications are necessary, as software languages are sequential languages. A hardware design language must be concurrent. This panel will discuss the merits of various languages and various methodologies being presented to solve this problem.


Session D8: High Performance Partitioning

Co-Chairs : Chung-Kuan Cheng, Univ. of California, San Diego, USA
Toshiyuki Shibuya, Fujitsu Laboratories, Ltd., Japan
D8.1 Circuit Partitioning with Coupled Logic Restructuring Techniques [p. 655]
Yu-Liang Wu, Xiao-Long Yuan, David Ihsin Cheng

Traditionally, the circuit partitioning is done by modeling the circuit as a graph and the partitioning is carried out without altering the circuit itself. Applying the technique of circuit rewiring, the partitioning can be further improved by doing some logic logic perturbation along the cut-line to drag the solution out of some local minimal. In this paper, we propose an effective coupling scheme of two powerful rewiring techniques to further improve upon those already selected best partition results produced by other conventional partition tools. The improvement is attributed to the additional capability of exercising a guided circuit perturbation, which was not available in the conventional schemes. The known ATPG-based and the recently proposed graph-based rewiring techniques compose our coupling scheme. The ATPG-based rewiring technique is very flexible; however the graph-based rewiring technique is faster and can couple quite well with the ATPG-based scheme to exploit a much larger room for logic perturbations. Our encouraging experiment results show that these two techniques couple each other quite well for this application without costing much CPU overhead. This scheme is also quite efficient thus should be very useful for large circuits.

D8.2 Improved Algorithms for Hypergraph Bipartitioning [p. 661]
Andrew E. Caldwell, Andrew B. Kahng, Igor L. Markov

Multilevel Fiduccia-Mattheyses (MLFM) hypergraph partitioning [3, 22, 24] is a fundamental optimization in VLSI CAD physical design. The leading implementation, hMetis [23], has since 1997 proved itself substantially superior in both runtime and solution quality to even very recent works (e.g., [13, 17, 25]). In this work, we present two sets of results: (i) new techniques for at FM-based hypergraph partitioning (which is the core of multilevel implementations), and (ii) a new multilevel implementation that offers leading-edge performance. Our new techniques for at partitioning confirm the conjecture from [10], suggesting that specialized partitioning heuristics may be able to actively exploit fixed nodes in partitioning instances arising in the driving top-down placement context. Our FM variant is competitive with traditional FM on instances without terminals [1] and considerably superior on instances with fixed nodes (i.e., arising during top-down placement [8]). Our multilevel FM variant avoids several complex heuristics and non-trivial tunings that often lead to complex implementations; it achieves trade-offs between solution quality and run time that are comparable or better than those achieved by hMetis-1.5.3 (the latest available version). Following [6], we attempt to provide algorithm descriptions that are as detailed and unambiguous as possible, to allow replicability and speed improvements in future research.

D8.3 Multi-way Partitioning Using Bi-partition Heuristics [p. 667]
Maogang Wang, Sung Lim, Jason Cong, Majid Sarrafzadeh

The multi-way partition problem is very important in various applications. In this paper, we use analytical and experimental results to study the k -way partition problem. We introduce the concept of embedding graph for the the k -way partition problem. Based on this concept, we explain different scenarios of using a bi-partition heuristic to solve the k -way partition problem. If C denote the optimal cut cost for the k -way partition problem and the bi-partition heuristics we use are %delta;-approzimation heuristics (defined in Section 2), we prove that the cut cost from the hierarchical approach has an approximate upper bound of %delta;C_log k while the cut cost from the all-way bipartition, or at approach, has an upper bound of %delta;Ck . This is contrary to some claims made in recent literature ( and CAD tools designed based on it). Experimental results strongly support our theoretical analysis. Our results show that for large target graph, the hierarchical approach is about 77% better than the single-pass all-way bi-partition approach. The all-way bi-partition approach will perform better in a multi pass set-up. However, the hierarchical approach is still on average 7 : 1% better in quality and 144 times faster than the multi-partition all-way bi-partition approach. The main conclusion of this paper is, contradicted to what has been suggested in literature, hierarchical bi-partitioning is a more effective multi-way partitioning scheme.