| |
ASP-DAC 2000 Abstracts
Sessions:
[A1]
[B1]
[C1]
[D1]
[A2]
[B2]
[C2]
[D2]
[E2]
[A3]
[B3]
[C3]
[D3]
[E3]
[A4]
[B4]
[C4]
[D4]
[E4]
[A5]
[B5]
[C5]
[D5]
[E5]
[A6]
[B6]
[C6]
[D6]
[E6]
[A7]
[B7]
[C7]
[D7]
[E7]
[A8]
[B8]
[D8]
Co-Chairs : Ryota Kasai, NTT, Japan
Anantha Chandrakasan, MIT, USA
-
A1.1 A VLSI Implementation of the Blowfish Encryption/Decryption Algorithm [p. 1]
- Michael C.-J. Lin, Youn-L. Lin
We propose an efficient hardware architecture for
the Blowfish algorithm [1]. The speed is up to 4 bit/clock,
which is 9 times faster than a Pentium. By applying
operator-rescheduling method, the critical path delay is
improved by 21.7%. We have successfully implemented it
using Compass cell library targeted at a 0.6 mm TSMC
SPTM CMOS process. The die size is 5.7 x 6.1 mm2
and the maximum frequency is 50MHz.
-
A1.2 VLSI Implementation of Rake Receiver for IS-95 CDMA Testbed using FPGA [p. 3]
- Oliver Y-h. Leung, Chi-Y.Tsui, Roger S. Cheng
In this work, an implementation of a time-multiplexed
downlink Rake receiver complied with the IS-95 CDMA standard
is presented. A low power architecture of the Rake receiver
is implemented. A structure which provides the offset
changing for the pseudo-random sequence (PN sequence) used
for despreading of the CDMA signals is discussed. Architecture
for the efficient time multiplexing of the Rake fingers is
also presented. The design was implemented using Xilinx FPGA.
It was tested to be functionally correct and the performance
was complied with IS-95.
-
A1.3 VLSI Implementation of a Switch Fabric for Mixed ATM and IP Traffic [p. 5]
- Chi-Y. Tsui, Louis Chung-Y. Kwan, Chin-T. Lea
A VLSI implementation of a multistage self-routing
ATM switch fabric is presented. The size of the
switch is 16x16 and can handle the OC-12 (622 Mbps) link
rate. Based on a bit-slice architecture, the entire 16x16
switch is implemented using four identical chips. The switch
has multiple paths, created by a randomizer in front of the
routing stages, between each input-output pair. The switch
uses an input/output-buffering scheme and contains no
buffers inside the fabric. A priority structure, which
supports four levels, allows the delay sensitive ATM cells to
be switched with the shortest latency. It also enables the
non-interleaving routing scheme of IP cells. The switch
fabric was designed and fabricated using MOSIS 0.8µm
technology and was tested to run at 93MHz with 3.3V supply
voltage.
-
A1.4 Design of Digital Neural Cell Scheduler for Intelligent IB-ATM Switch [p. 7]
- J.-K. Lee, S.-M. Lee, M.M.-O.Lee, D.-W. Lee, Y.-C. Kim, S.-J. Jeong
We present the architecture of the ATM banyan
switch composed of pattern process and high-speed digital
neural cell scheduler. An input buffer type ATM switch with a
window-based contention algorithm is proposed, modeling in
VHDL for high-speed cell scheduler of ATM switching. A digital
hopfield neural cell scheduler which has the ability of real-time
processing is used to solve loss of throughput due to head-of-line(
HOL) and internal blocking when FIFO queueing is
employed at the Banyan network. In this scheduler, it is found
we can minimize the delay for scheduling and select non-blocking
cells in maximum leading to high performance. Our
proposed ATM switch is modeled in VHDL, synthesized,
implemented into a FPGA chip set and fabricated using 0.6¤ Š
CMOS technology.
-
A1.5 Genetic Algorithm Accelerator GAA-II [p. 9]
- Shin'ichi Wakabayashi, Tetsuhi Koide, Nayoshi Toshine, Masataka Yamane,
Hajime Ueno
We have developed a new GA hardware called GAA-I(Genetic Algorithm
Accelerator-I), in which the crossover operator to be applied to each
individual was dynamically selected during the algorithm execution,
GAA-I has some restrictions due to the limited chip size. In this
paper, we extend the GAA-I and propose a new GA hardware, GAA-II,
so that large, complex optimization problems can be solved. Furthermore,
GAA-II has capability of parallel processing with other GAA-II chips.
The GAA-II chip has been fabricated as a CMOS standard cell chip with
0.6 technology.
-
A1.6 A Programmable Built-In Self-Test Core for Embedded Memories [p. 11]
- Chih-T. Huang, Jing-R. Huang, Cheng-W. Wu
Testing embedded memories is becoming an
industry-wide concern with the advent of deep-submicron technology
and system-on-chip applications. We present a prototype
chip for a programmable built-in self-test (BIST) design that is
used for testing embedded memories, especially DRAMs. The
BIST chip supports various memory test algorithms by a novel
controller and sequencer design. The area of the core circuit is
about 1,020 x 1,020 um 2 using a 0:6um CMOS
process, and the clock speed is over 100MHz.
-
A1.7 An Algorithm for VLSI Implementation of Highly Efficient Cubic-Polynomial Evaluation [p. 13]
- Fan Mo, Yihua Zhang, Jun Yu, Qianling Zhang
In this paper, we present a novel cubic-polynomial evaluation algorithm.
It is suitable for VLSI implementation and the computational cost is
reduced to about 66% of the previously reported method.
-
A1.8 Design of Self-timed Asynchronous Boothās multiplier [p. 15]
- Tin-Y. Tang, Chin-S. Choy, Pui-L. Siu, Cheong-F. Chan
This paper presents a design of multiplier for the
multiplication of two 8-bit two-complement numbers.
The multiplier applies the self-timed asynchronous
methodology such that the multiplier can be assumed
to operate on average case delay. And also, modified
booth's algorithm [1] is used to reduce the number of
partial product generated. As a result, the speed of
the multiplier can be improved.
-
A1.9 High Speed and Ultra-Low Power 16x16 MAC Design using TG Techniques
for Web-based Multimedia System [p. 17]
- Seung-M. Lee, Jin-H. Chung, Hying-S. Yoon, Mike M-O. Lee
In this paper a study has been presented on
High Speed(HS) and 79mW Low Power(LP) 16X16 MAC
performance of new XOR-Based circuits using transmission
gate logic(TG) implemented on 0.65um CMOS DLP/DLM
technology. It is shown that our proposed MAC results in better
performance than other published MACs due to no DC leakage
currents for low power and bypassing unnecessary switching
activities with latches before and after multiplier for high
speed.
-
A1.10 A Smart Imager for the Vision Processing Front-END [p. 19]
- Noriaki Takeda, Mituru Homma, Makoto Nagata, Takashi Morie, Atsushi Iwata
A CMOS PWM imager which realizes block summation
and 2D projection of a thresholded image, in addition to row-parallel
PWM readout with gray scale, is reported. An imager
including 56 x 56 pixels, an address signal generator, and a signal
processing circuit is fabricated in a 6mm x 6mm chip with a 0.8um
CMOS technology.
-
A1.11 A Binary Image Sensor with Flexible Motion Vector Detection using Block
Matching Method [p. 21]
- Tomohiro Nezuka, Takafumi Fujita, Makoto Ikeda, Kunihiro Asada
A binary image sensor with motion vector detection has been
developed and successfully
tested. The sensor acquires binary images and detects
motion vectors using block matching method. The
size of matching blocks and the search area of motion
vectors are variable. The block matching is realized by
a 2-D shift-register, XOR circuits and block current-sum
circuits. The sensor integrates 32x32 pixels on a
7.3mmx7.3mm die in a 1.2um CMOS 2-Metal 2-poly-Si
process.
-
A1.12 An Arbitrary Chaos Generator Core Circuit Using PWM/PPM Signals [p. 23]
- Kenichi Murakoshi, Takashi Morie, Makoto Nagata, Atushi Iwata
A compact chaos generator core circuit has been
designed, fabricated, and tested. It generates PWM and PPM
(pulse width and phase modulation) signals that follow arbitrary
nonlinear analog dynamics. The measurement results show that
PWM/PPM chaotic signals are generated at 1 MHz with a calculation
precision of 6.7 bits at a supply voltage of 3.3 V. The circuit
can be used for large-scale integrated noise sources or high-performance
advanced neural networks using nonlinear analog
dynamics.
-
A1.13 An Application Specific Java Processor with Reconfigurabilities [p. 25]
- Shinji Kimura, Hiroyuki Kida, Kazuyoshi Takagi, Tatsumori Abematsu, Katsumasa
Watanabe
The paper presents an application specific Java
processor including reconfigurabilities, which is a DLX like
pipeline processor with 5 stages and executes Java byte codes directly.
Reconfigurabilities are the key technologies for application
specific operations. We have introduced two reconfigurabilities:
(1) a mechanism to override the control signals for a specific instruction,
(2) external components can be attached with the same
input and output ports as the internal ALU.
-
A1.14 Reconfigurable Synchronized Dataflow Processor [p. 27]
- Hiroshi Sasaki, Hitoshi Maruyama, Hideaki Tsukioka, Nobuyoshi Shoji, Hiroaki
Kobayashi, Tadao Nakamura
This paper describes the design and implementation of a reconfigurable
synchronized dataflow processor (RSDP). The RSDP can configure its
hardware to directly represent dataflow graphs (DFGs) of applications.
Data are processed while they flow along application-specific datapaths
in the RSDP. We have designed three DFGs for benchmarking and evaluated
their performance on an RSDP board. The results show that the RSDP
running at relatively lower frequency can achieve a competitive performance
with a general-purpose processor.
-
A1.15 Prototype Microprocessor LSI with Scheduling Support Hardware for
Operating System on Multiprocessor System [p. 29]
- Naoki Nishimura, Takahiro Sasaki, Tetsuo Hironaka
In this paper we describe the details of SSH (Scheduling
Support Hardware) which makes it possible to use fine grain
parallelism effectively in multiprocessor systems. The SSH supports
fast and concurrent thread scheduling that is very important
in fine grain parallel processing.
-
A1.16 A Floating Point Arithmetic Unit for a Static Scheduling and Compiler
Oriented Multiprocessor System ASCA [p. 31]
- Takahiro Kawaguchi, Takayuki Suzuki, Hideharu Amano
We describe a floating point arithmetic unit (FPU) which
supports static scheduling by automatic parallelizing
compiler. This FPU designed to work with 50MHz clock
with the assistance of EDA synthesis and layout tools.
Under the clock rate condition, it appears that this FPU
requires about 120,000 gates and marks 8.2 MFLOPS with the
clock level simulation.
-
A1.17 A 16-bit Redundant Binary Multiplier Using Low-Power Pass-Transistor
Logic SPL [p. 33]
- Hirofumi Sakamoto, Ken'ichiro Uda, Bu-Y. Lee, Hiroyuki Ochi, Kazuo Taki, Takao Tsuda
We have designed a 16-bit redundant
binary multiplier using pass-transistor logic SPL on a
0.35 um technology. Number of transistors is 12,349,
and area is 1,322 um x 332 um. Measured power dissipation
and maximum delay at T=25 C and VDD =3.3
V are 33.7 mW/100 MHz and 7.4 nsec, respectively.
-
A1.18 An 8x8 nRERL Serial Multiplier for Ultra-Low-Power Applications [p. 35]
- Joonho Lim, Dong-G. Kim, Sang-C. Kang, Soo-I. Chae
An 8 x 8-b nRERL serial multiplier is
implemented in a 0.6-um n-well 3-metal CMOS process.
nERL (nMOS Reversible Energy Recovery
Logic) is a new reversible adiabatic logic circuit, which
can be operated at the leakage-current level for ultra-low-energy
applications. Measurement results showed
that the nRERL serial multiplier consumed only 0.9
% of the energy dissipation of the static CMOS one
at the operating frequency 100 kHz at 5V, where its
adiabatic and leakage losses were about equal.
Co-Chairs : Graham R. Hellestrand, VaST Systems Technology Corp., USA
Minoru Yamamoto, Fujitsu Ltd., Japan
-
B1.1 Embedded Tutorial: Essential Issues for IP Reuse [p. 37]
-
Daniel D. Gajski,
Allen C.-H. Wu,
Viraphol Chaiyakul,
Shojiro Mori,
Tom Nukiyama,
Pierre Bricaud
With widespread recent emphasis on System-On-a-Chip (SOC),
IP reuse has emerged as a vital
and growing business in semiconductor industry.
In this paper, we will address essential issues for
IP reuse by discussing current challenges to the
success of IP businesses and identifying the
obstacles that need to be overcome.
-
B1.2 Usage-Based Characterization of Complex Functional Blocks for Reuse in
Behavioral Synthesis [p. 43]
- Nong Fan, Viraphol Chaiyakul, Daniel D. Gajski
This paper presents a novel usage-based characterization
method to capture pre-designed complex functional blocks
for automatic reuse in behavioral synthesis. We identify attributes
necessary for reuse of such complex components and illustrate
how the attributes are captured into a design database. A complex
component MTX_MULT 8X8, which computes product of two
8 x 8 matrices, is captured with the proposed method, and its
reuse in behavioral synthesis is demonstrated with design of a
DCT example. Feasibility of the method for capturing various
components is demonstrated as well.
-
B1.3 Reuse and Protection of Intellectual Property in the SpecC System [p. 49]
- Rainer Dömer, Daniel D. Gajski
In system-level design, the key to cope with the complexities
involved with System-on-Chip (SOC) designs, is the reuse of Intellectual
Property (IP). With the increasing demand for IP, the
mechanism to protect an IP component from being copied, modified,
or reverse-engineered, becomes very important. This paper
describes how reuse and protection of IP is supported by the
SpecC language and the SpecC design environment.
-
B1.4 Fair Watermarking Techniques [p. 55]
- Gang Qu, Jennifer L. Wong, Miodrag Potkonjak
Many intellectual property protection (IPP) techniques
have been proposed. Their primary objectives are providing
convincible proof of authorship with least degradation
of the quality of the intellectual property (IP),
and achieving robustness against attacks. These are
also well accepted as the most important criteria to
evaluate different IPP techniques. The essence of such
techniques is to limit the solution space by embedding
signatures as constraints. One key issue that should be
addressed but has not been discussed is the fairness
of the techniques: what is the quality of the solution
subspace for different signatures, that is, how large the
solution subspace is (uniqueness), and how difficulty it
is to get a solution from such subspace (hardness)? In
this paper, we introduce fairness as one of the metrics
for good IPP techniques and post the challenge problem
of how to design fair watermarking techniques.
We claim that all fair techniques have to be instance-oriented
and due to the complexity of the problem itself,
we propose an approach that utilizes the statistical
information of the problem instance. We use the
satisfiability (SAT) problem as an example to illustrate
how fairness could be achieved. We make the observation
that the unfairness of the previous watermarking
techniques comes from the global embedding of the
signature and propose fair watermarking techniques.
We test the uniqueness and hardness on a model with
full knowledge of the solution and real life benchmarks
as well. The experimental results show fairness can be
achieved.
Co-Chairs : Yirng-An Chen, National Chiao Tung Univ., Taiwan
Kiyoharu Hamaguchi, Osaka Univ., Japan
-
C1.1 An Efficient Heuristic for State Encoding Minimizing the BDD Representations of the
Transition Relations of Finite State Machines [p. 61]
- Riccardo Forth, Paul Molitor
An efficient representation of the (smoothed) transition
relation of a synchronous finite state machine (FSM) speeds
up the traversal based symbolic verification and the functional
simulation of the FSM. When using reduced ordered binary decision
diagrams (BDDs), dynamic reordering algorithms are applied
in order to keep the sizes of the BDDs tractable. However,
when FSMs are represented by BDDs, the state encoding can be
used as an additional optimization criteria. In this paper, we
present a new algorithm for state encoding of FSMs that minimizes
the BDD representations of the corresponding (smoothed)
transition relations. Experimental results show the approach to
be very efficient.
-
C1.2 Automatic Partitioning for Efficient Combinational Verification [p. 67]
- Rajarshi Mukherjee, Jawahar Jain, Koichiro Takayama, Masahiro Fujita
A majority of the state-of-the-art combinational
verification techniques are based on the extraction
and use of internal equivalences between two circuits.
Verification can become difficult if the two
circuits have none or very few internal correspondences.
In this paper we investigate automatic circuit
partitioning as a methodology to make otherwise
intractable circuits relatively tractable to the
verifier. We show that given any two circuits to be
verified, finding the best partitions that minimize
the verification runtime is NP-hard. Therefore, we
propose efficient heuristics to utilize certain characteristics
of typical circuit design styles to find good
partitions for the circuits. A key difference between
our approach and earlier approaches to circuit partitioning
is that ours is fully automated and does
not require any prior knowledge of the type of function
being implemented by the circuit. Using circuit
partitioning we are able to verify several hard industrial
circuits that could not be verified otherwise.
-
C1.3 A Hardware Simulation Engine Based on Decision Diagrams (Short Paper) [p. 73]
- Yukihiro Iguchi, Tsutomu Sasao, Munehiro Matsuura, Atsumu Iseno
A hardware logic simulation engine based
on decision diagrams is presented. For the data structure
of the engine, we propose PMDDs (Paged reduced
ordered Multi-valued Decision Diagrams). A unit of this
engine consists of memory (RAMs) and control circuits
: RAMs store the PMDD data, and the control circuits
trace the edges according to the input vectors. The engine
consists of several units, and is accelerated by pipelining.
Experimental results using a prototype are shown.
Keywords : Logic simulation, Hardware simulation
engine, BDD, MDD, PMDD.
-
C1.4 Formal Verification based on Assume and Guarantee Approach - A Case Study
(Short Paper) [p. 77]
- Subir K. Roy, Hiroaki Iwashita, Tsuneo Nakata
In this paper, we present a verification case study of
the Audio Output Interface (AOF) sub-system based on
a compositional verification approach known as the
Assume-Guarantee approach. We illustrate, how designers
can leverage their detailed knowledge of a design to
partition it at the module level, and thereby, enable
the Assume-Guarantee approach to overcome intrinsic
limitations of a formal verification tool to successfully
verify large designs.
-
C1.5 Multi-Clock Path Analysis Using Propositional Satisfiability [p. 81]
- Kazuhiro Nakamura, Shinji Maruoka, Shinji Kimura, Katsumasa Watanabe
We present a satisfiability based multi-clock with analysis
method. The method uses propositional satisfiability (SAT)
in the direction of multi-clock paths. We show a method to
reduce the multi-clock path detection problems to SAT problems.
We also show heuristics on the conversion from multi-level circuits
into CNF formulae. We have applied out method to ISCAS89 benchmarks
and other sample circuits. Experimental results show the improvement
on the manipulatable size of circuits by using SAT.
Co-Chairs : Jason Cong, Univ. of California, Los Angeles, USA
Takumi Okamoto, NEC Corp., Japan
-
D1.1 Self-Reforming Routing for Stochastic Search in VLSI Interconnection Layout [p. 87]
- Yukiko Kubo, Yasuhiro Takashima, Shigetoshi Nakatake, Yoji Kajitani
Given a route which connects terminals on a one-layer
routing area(Steiner tree), ip is a
procedure that makes a current route change its configuration
within its peripheral domain. A ip reforms
a route by replacing one of its edges with a minimal
detour. A route can ip one nearby obstacle. If the
obstacle is another route, a more organized operation,
called the dual ip, is applied to a pair of routes. The
idea is enhanced to the 2-layer hv-routing. The performance
of ip and dual ip was experimented in
simulated annealing which reforms a route or a set of
routes with respect to the evaluation function of multiple
objectives. Some unique and satisfiable results
were observed.
-
D1.2 An Interconnect Topology Optimization by a Tree Transformation [p. 93]
- Naofumi Tsujii, Katsutoshi Baba, Shuji Tsukiyama
Since the interconnect delay has become the dominating factor in circuit
performance, demands for a good delay-minimization router are very high.
In this paper, we propose two algorithms to find an interconnect tree
of a net which minimizes a weighted sum T of delays to all sinks, where
the weight assigned to a sink represents a criticality of the delay to
the sink. The algorithms start from a Steiner tree and repeat a tree
transformation as monitoring the change of T. We also show some
experimental results to evaluate the performance of the algorithms.
-
D1.3 Timing-Driven Hierarchical Global Routing with Wire-Sizing and
Buffer-Insertion for VLSI with Multi-Routing-Layer [p. 99]
- Takahiro Deguchi, Tetsushi Koide, Shin'ichi Wakabayashi
In the high performance VLSI with multi-layer lay-out
model, the complexity of the global routing problem becomes
much high under timing constraints. This paper presents a hierarchical
global routing method based on a multi-layer routing
model for the high performance standard cell layout. In each hierarchical
level, the routes of nets are determined by solving a
linear programming problem considering wire-sizing and buffer-insertion
under timing constraints. We have implemented the proposed
method on a workstation and showed the effectiveness of the
method from experimental results.
-
D1.4 Area Routing Oriented Hierarchical Corner Stitching with Partial Bin [p. 105]
- Zhang Yan, Wang Baohua, Cai Yici, Hong Xianlong
An efficient layout data structure is of great importance to a gridless area
routing algorithm. Nowadays, the rectangular corner stitching structure,
whose point searching and tile insertion are O(N1/2), is the
most popular data structure used by gridless area routers. In this paper,
we discuss a novel database - hierarchical corner stitching with partial
bin, or hierarchical PB corner stitching, which combined the bin-based
structure and the trapezoidal corner stitching. Its point searching
and tile insertion are both enhanced to O(N1/2/r. We also
derive the algorithms of the operations, such as point searching, area
searching, plowing, etc., according to the specialties of area routers.
Chair : Miodrag Potkonjak, Univ. of California, Los Angeles, USA
-
A2.1 Offline Program Re-mapping to Improve Branch Prediction Efficiency in
Embedded Systems [p. 111]
- Stephen S. Brown, Jeet Asher, William H. Mangione-Smith
This work presents a technique for improving the efficiency of hardware branch
predictors. The key approach is to apply techniques of off-line re-mapping of
the program-space in order to reduce the incidence of conflict misses in the
branch hardware. This work also presents a new model for organizing temporal
information between blocks in the address space, which can be applied
effectively to previous re-mapping systems as well. The increased efficiency
can be translated to improved performance for fixed hardware specifications,
or used to reduce the hardware cost for achieving targeted performance
during the design cycle.
-
A2.2 Timing Driven Co-design of Networked Embedded Systems [p. 117]
- Dinesh Ramanathan, Ravindra Jejurikar, Rajesh K. Gupta
Advances in microelectronics integration have led to emergence of tightly
integrated systems with high performance network interfaces. Design of such
systems especially for single chip implementation is a delicate balance
of functionality and available time budget to perform the tasks. Computer-aided
design tools and methodologies are needed to ensure the correctness of the
design and efficiency of the design process, especially for networked systems
that have strict timing requirements both due to technology as well as
networking needs. We present an overview of a timing-driven design
methodology for networked systems, developed at the University of
California, Irvine.
-
A2.3 Low-power Design Methodology and Applications utilizing Dual Supply
Voltages [p. 123]
- Kimiyoshi Usami, Mutsunori Igarashi
This paper describes a gate-level power
minimization methodology using dual supply voltages.
Gates and flip-flops off the critical paths are made to
operate at the reduced supply voltage to save power.
Core technologies are dual-V DD circuit synthesis and
P&R. We give a brief overview on existing low-power
EDA technologies as background and discuss advantages
and challenges of the dual-VDD approach. Through real
design examples, we will show that the approach reduces
power effectively while keeping the performance at negligible
area overhead.
-
A2.4 Co-Synthesis with custom ASICs [p. 129]
- Yuan Xie, Wayne Wolf
This paper introduces the first hardware/software
co-synthesis algorithm that optimizes the implementations of
ASICs that are used as processing elements for the embedded
systems. Many real time embedded systems are composed of
heterogeneous processing elements, such as general purpose
CPUs, ASICs and FPGAs. Previous work has not considered
how to select one of several possible ASIC implementations for
a specific task. We have developed a heuristic iterative
improvement algorithm for distributed embedded system co-synthesis.
We use Monet, a behavioral level architectural
exploration system, to generate multiple implementations of a
behavioral description of an ASIC and to analyze their
performance. To the best of our knowledge, this is the first co-synthesis
algorithm that takes into account the impact of
different ASIC implementations of tasks on system
performance and cost in the co-synthesis process.
Co-Chairs : Akira Matsuzawa, Matsushita Electrical Industrial, Co. Ltd., Japan
Masato Edahiro, NEC Corp., Japan
-
B2.1 A New Method for Constructing IP Level Power Model Based on Power
Sensitivity [p. 135]
- Heng-Liang Huang, Jiing-Yuan Lin, Wen-Zen Shen, Jing-Yang Jou
This paper proposes a nominal point selection method for IP (Intellectual
Property) level power model based on power sensitivity. By analyzing
the relationship between the dynamic power consumption of CMOS circuits
and their input signal statistics, three nominal points are efficiently
selected to construct a power model based on power sensitivity. Our
experimental results on a number of benchmark circuits show the effectiveness
of the proposed method. Estimation accuracy within 5.78% of transistor
level simulations is achieved.
-
B2.2 A hybrid approach for core-based system-level power modeling [p. 141]
- Tony Givargis, Frank Vahid, Jörg Henkel
Reducing power consumption has become a key goal for system-on-a-chip
(SOC) designs. Fast and accurate power estimation is
needed early in the design process, since power reduction methods
tend to have greater impact at higher abstraction levels. Unfortunately,
current approaches to power estimation, which concentrate
on register-transfer-level models or lower, are quite slow. Higher-level
approaches, while faster, may suffer from inaccuracy. However,
the advent of cores enables a hybrid approach, described in
this paper, yielding both fast and accurate estimates from high-level
models. In particular, we use power estimation data obtained from
the gate-level for a core's representative input stimuli data (instructions),
and we propagate this data to a higher (object-oriented)
system-level model, which is parameterizable and executable. Depending
on the kind of cores, various parameterizable equation or
look-up table based techniques are used, resulting in self-analyzing
core models. We have applied our technique to several cores of a
digital camera SOC and have achieved simulation speedups of over
1000 with accuracies suitable for making reliable power-related
system-level design decisions. Although we focus on power estimation,
our approach can be used for estimating other metrics as well,
such as performance and size.
-
B2.3 Voltage Reduction of Application-Specific Heterogeneous Multiprocessor
Systems for Power Minimisation [p. 147]
- Allan Rae, Sri Parameswaran
We present a design strategy to reduce power
demands in application-specific, heterogeneous multiprocessor
systems with interdependent subtasks. This power
reduction scheme can be used with a randomised search
such as a genetic algorithm where multiple trial solutions
are tested. The scheme is applied to each trial solution after
allocation and scheduling have been performed. Power
savings are achieved by equally expanding each processor's
execution time with a corresponding reduction in their respective
operating voltage. Lowest cost solutions achieve
average reductions of 24% while minimum power solutions
average 58%.
-
B2.4 Withdrawn
-
B2.5 Synthesis of Low Power Folded Programmable Coefficient FIR Digital Filters (Short Paper) [p. 153]
- Vijay Sundararajan, Keshab K. Parhi
A novel low-power synthesis technique is presented for the design of
folded or time-multiplexed programmable-coefficient FIR filters where
storage area is traded-off for lowering power consumption. A systematic
technique is developed for low power mapping of FIR filters to architectures
with arbitrary number of multipliers and adders. Power consumed in
multipliers is reduced by reducing switching activity at both the data-input
as well as the coefficient input. Common input operands can be exposed by
unfolding, which, however leads to a memory increase. Simulation results
are obtained for folding 65 and 129 tap bandpass FIR filters. The average
power consumed in a multiplier for a fixed number of hardware multipliers
with varying unfolding factors is compared. It is observed that most of the
gains due to unfolding are obtained for relatively small unfolding factors
and therefore for relatively small memory area overhead. Depending on
the unfolding factor employed the average power consumed in a multiplier
is seen to reduce anywhere from 54.75% to 81.73% when transpose FIR
filters are synthesized as opposed to synthesizing direct-form FIR filters
with no unfolding.
Co-Chairs : Shinji Kimura, Nara Inst. of Science and Technology, Japan
Kazutoshi Wakabayashi, NEC Corp., Japan
-
C2.1 Invited Talk : Synthesis Challenges for Next-Generation High-Performance and High-Density PLDs [p. 157]
- Jason Cong, Songjie Xu
Rapid evolution of programmable logic device (PLD) architectures and the
increasingly demanding requirements to meet with high-density and high-performance PLD designs call for a new generation of PLD synthesis tools to support many
important capabilities not available in the existing synthesis tools. In this
paper, we analyze the synthesis challenges and opportunities in supporting
next-generation high-performance and high-density PLDs. We classify these
challenges into two categories: (i) those 2 from the advance of PLD
architectures, including synthesis for hierarchical architectures and synthesis
for heterogenous architectures, and (ii) those driven by the requirements for
high-density and high-performance PLD designs, including layout-driven
synthesis, incremental synthesis, and IP-based synthesis. We shall discuss
existing and/or potential solutions to these problems and outline research
opportunities and directions for the development of next-generation PLD
synthesis systems.
-
C2.2 KressArray Xplorer: A New CAD Environment to Optimize
Reconfigurable Datapath Array Architectures [p. 163]
- Reiner Hartenstein, Michael Herz, Thomas Hoffmann, Ulrich Nageldinger
This paper introduces a CAD environment for
design-space exploration of coarse grain reconfigurable KressArray
architectures and similar platforms. To find an optimal solution to a
given application domain this KressArray Xplorer supports experimenting
with different architectures. An estimator analyses the
input and suggests an architecture onto which then the application
is mapped using an heuristic algorithm. A graphic editor supports
user interaction. The system generates performance analysis data,
which is used by an automatic tool to refine the result iteratively.
-
C2.3 Hardware-Software Cosynthesis for Run-time Incrementally Reconfigurable FPGAs [p. 169]
- Byungil Jeong, Sungjoo Yoo, Sunghyun Lee, Kiyoung Choi
This paper presents a method for hardware-software
cosynthesis with run-time incrementally reconfigurable
FPGAs. To reduce the run-time overhead of reconfiguring FPGAs,
we present a concept called early partial reconfiguration (EPR)
which minimizes the overhead by performing reconfiguration for
an operation (or a task in our terms) mapped to an FPGA as
early as possible so that the operation is ready to start
when its execution is requested. For further reduction
of the overhead, we integrate the incremental reconfiguration
(IR) of FPGAs with the EPR concept. We
present an ILP formulation and an efficient heuristic
algorithm based on the EPR and IR concepts. Experiments
on embedded system examples and synthetic
examples show the efficiency of the proposed method.
Co-Chairs : Yu-Liang Wu, Chinese University of Hong Kong, China
Hiroshi Murata, Microark, Japan
-
D2.1 A New Encoding Scheme for Rectangle Packing Problem [p. 175]
- Toshihiko Takahashi
In the rectangle packing problem, encoding schemes to
represent the placements of rectangles are the key factors of
efficient algorithms. In 1995, an epoch-making encoding scheme,
known as SEQ-PAIR, was developed[2]. The solution space of
SEQ-PAIR has been considered sufficiently small. In
this paper, however, we present a simple and natural
encoding scheme of rectangle packings whose solution
space is smaller than that of SEQ-PAIR.
-
D2.2 Analytical Minimization of Half-Perimeter Wirelength [p. 179]
- Andrew A. Kennings, Igor L. Markov
Global placement of hypergraphs is critical in the top-down
placement of large timing-driven designs [10, 16]. Placement
quality is evaluated in terms of the half-perimeter
wirelength (HPWL) of hyperedges in the original circuit
hypergraph provided timing constraints are met. Analytical
placers are instrumental in handling non-linear timing
models [9, 3], but have two important drawbacks: (a) corresponding
optimization algorithms are typically slower than
top-down methods driven by multi-level mincut partitioning
[2], and (b) hyperedges must be represented with net
models [10, 17, 15, 8, 21, 20] which imply a mismatch of
objective functions, with the alternative of computationally
expensive linear programming (LP) [25, 16].
By comparing to optimal solutions produced by linear
programming, we show that net models lead to solution
quality loss. To address this problem, we present the first
analytical algorithm that does not require net models and
permits a direct inclusion of non-linear delay terms [3]; this
allows to avoid expensive linearization of delay terms in [16].
Our numerical engine utilizes well-known quadratically convergent
Newton-type methods [5, 11] for speed; it produces
solutions within 12% of the LP optimum. Empirical results
are for industrial placement instances.
-
D2.3 Modeling and Minimization of Routing Congestion [p. 185]
- Maogang Wang, Majid Sarrafzadeh
Typical placement objectives involve reducing net-cut cost or
minimizing wirelength. Congestion minimization is least understood,
however, it models routability most accurately. In this paper,
we study the congestion minimization problem during placement. First
we pointed out that the bounding box router used in [12] is not an
accurate measurement of the congestion in the placement. We use a
realistic global router to evaluate congestion in the placement
stage. This ensures that the final placement is truely congestion
minimized. We also proposed two new
post processing algorithms, the flow-based cell-centric
algorithm and the net-centric algorithm. While the
flow-based cell-centric algorithm can move multiple
cells at the same time to minimize the congestion, it
suffers large consumption of memory. Experimental
results show that the net-centric algorithm can effectively
identify the congested spots in the placement
and reduce the congestion. It can produce on an average
7: 7% less congestion than the method proposed
in [12]. Finally, we use a final global router to verify
that the placement obtained from our algorithm
has 39% less congestion than a wirelength-optimized
placement obtained by TimberWolf (commercial version 1.3.1).
Chair : Wayne W.-M. Dai, Univ. of California, Santa Cruz, USA
-
E2.1 System-In-Package (SIP): Challenges and Opportunities [p. 191]
- King L. Tai
In this paper, we propose the concept of System-In-Package
(SIP) as a generalization of System-On-Chip (SOC).
System-In-Package overcomes formidable integration barriers
without compromising individual chip technologies. The goal of
SIP is to match or exceed SOC performance with lower cost. SIP
technology platform that provides the needed integration is
described. Applications include integrating memory with logic and
integrating radio frequency ICs with high-quality (high-Q) passive
components. SIP technology will unleash the innovations of
designers and unlock the full potential of IC technology.
-
E2.2 Taiwan Foundry for System-In-Package (SIP) [p. 197]
- Albert Lin
System-In-Package (SIP) is a cost-effective
alternative to System-On-Chip (SOC) and chips with embedded
memory. The key elements of SIP technology include I/O
redistribution, solder bumping, flip chip assembly, and high
density thin film interconnect substrate with/without integrated
passives. To meet the need of SIP, as well as other advanced
packaging solutions, APack Technologies has been set up to
serve worldwide customers. High quality of the APack's service
is built upon the SIP technology licensed from Bell Labs of
Lucent Technologies, and the experience of cost competitive
wafer fabrication of foundries in Taiwan.
-
E2.3 Integration of Large-Scale FPGA and DRAM in a Package Using Chip-On-Chip Technology [p. 205]
- Michael X. Wang, Katsuharu Suzuki, Wayne W.-M. Dai, Yee L. Low, Kevin J.
O'conner, King L. Tai
A field-programmable multi-chip module
containing one ORCA 3T/125 FPGA and 4 MByte
DRAM was built using chip-on-chip technology. Module
architecture and physical design issues are presented.
A PCI board consisting of four chip-on-chip modules is also
built as the test vehicle. The design environment for this
multi-chip module, including visual or C++ design entry
and bit-serial datapath synthesis system, is also discussed.
Some ongoing approaches, like double-ip technology and area
IO are also addressed.
-
E2.4 Modeling and Analysis of Integrated Spiral Inductors for RF
System-In-Package [p. 211]
- Minqing Liu, Wayne W.-M. Dai
In this paper, a fast efficient Method
of Moments(MoM) based on a new subdomain partitioning
technique is presented for rapidly extracting
the distributed capacitance and inductance of
spiral inductors. By using this approach, the analytical
formula of Green's function for multi-layer
substrates is obtained through the prioritized wave-tracing
method. The computing time is much faster
than the state of art capacitance and inductance
solvers, such as FMMS[6]. Good agreement between
numerical results and measurement data is shown
to demonstrate the accuracy of the proposed new
method.
Co-Chairs : Shoji Kawahito, Toyohashi Univ. of Technology, Japan
Jan M. Rabaey, Univ. of California, Berkeley, USA
-
A3.1 Narrow Bus Encoding for Low Power Systems [p. 217]
- Youngsoo Shin, Kiyoung Choi
High integration in integrated circuits often leads to the problem
of running out of pins. Narrow data buses can be used to
alleviate this problem at the cost of performance degradation due
to wait cycles. In this paper, we address bus coding methods for
low power core-based systems incorporating narrow buses. Although
the conventional Bus-Invert code performs well for completely
random patterns, we show that transition signaling combined
with Bus-Invert, which we call BITS coding, can achieve
much more power saving for data patterns of typical DSP applications.
The application of BITS coding to a real circuit design is
limited by the overhead of the encoder and decoder circuits and
the extra bus line introduced. We propose an approximate version
of BITS coding, which do not require the extra bus line while
retaining the advantage of BITS coding.
-
A3.2 Data Transmission over a Bus with Peak-Limited Transition Activity [p. 221]
- Vijay Sundararajan, Keshab K. Parhi
Transitions on high capacitance busses in VLSI systems result
in considerable power dissipation. Various coding schemes have
been proposed in literature to encode the input signal in order
to reduce the number of transitions. Reducing number of transitions
comes in exchange for redundancy in data transferred over
the busses. For a given amount of redundancy there exists a lower
bound on the average number of transitions. In recent times noise
and reliability problems have brought the peak/instantaneous
power consumed in VLSI systems in to prominence. There has
been limited study done on reducing the number of instantaneous
transitions and hence the peak power consumed in busses. In this
paper we model a bus with a limit on the maximum instantaneous
transition activity as a constrained channel and derive an
upper bound on the data-rate obtainable using the capacity of
the underlying channel. We then demonstrate that some existing
bus encoding schemes are near-optimal with respect to the
derived bounds thus, perhaps, obviating the need to search for
newer more complicated coding schemes. Also considered is a
bus with a constraint on number of transitions in a fixed number
of (k) bus transmissions. The capacity of such a bus is derived
in the same manner as a bus with a constraint on maximum
instantaneous transition activity.
-
A3.3 Power Analysis and Implementation of a Low-Power 300 MHz 8-b x 8-b
Pipelined Multiplier [p. 225]
- Jinn-Shyan Wang, Po-Hui Yang
This paper analyzes the power consumption of an
array pipelined multiplier. To precisely realize a low power
pipelined multiplier, the analytical model for a clocking system
is presented. Simulation results show that the storage element is
the key-component in a high performance pipelined multiplier
macro. Compared with the conventional DFF and latch, the new
low power DFF as PTTFF [6] achieves total power reduction
ranging between 34 and 62 percents in a pipelined multiplier
macro.
Co-Chairs : Peter Marwedel, Univ. of Dortmund, Germany
Hiroaki Takada, Toyohashi University of Technology, Japan
-
B3.1 A New Approach to Assembly Software Retargeting for Microcontrollers [p. 229]
- Ing-Jer Huang, Dao-Zhen Chen
A new approach is proposed to translate existing software
programs from one instruction set to other instruction sets at the
assembly level. The behaviors of instructions are abstractly
represented as manipulation of the machine state. The behavior
of each basic block of the software program is then represented
as a pair of state transition. Instruction set retargeting is then
modeled as the problem of finding sequences of instructions
accomplishing the same machine state transitions at the target
machine as does the software program at the source machine.
The proposed approach has been successfully demonstrated on
the software translation between several industrial
microcontrollers.
-
B3.2 Register Allocation for Common Subexpressions in DSP Data Paths [p. 235]
- Rainer Leupers
This paper presents a new code optimization technique for
DSPs with irregular data path structures. We consider
the problem of generating machine code for data flow
graphs with common subexpressions (CSEs). While in
previous work CSEs are supposed to be strictly stored
in memory, the technique proposed in this paper also
permits the allocation of special purpose registers
for temporarily storing CSEs. As a result, both the
code size and the number of memory accesses are reduced.
The optimization is controlled by a simulated annealing
algorithm. We demonstrate its effectiveness for
several DSP applications and a widespread DSP processor.
-
B3.3 A Technique for QoS-based System Partitioning [p. 241]
- Johnson S. Kin, Chunho Lee, William H. Mangione-Smith, Miodrag Potkonjak
Quality of service (QoS) has been an important
topic of many research communities. Combined with an advanced
and retargetable compiler, variability of applications-specific
very large instruction word (VLIW) processors has been
shown to provide excellent platform for optimizing performance
under area constraints. Although media servers and others that
are natural candidates for QoS management are embedded systems
and implementing QoS management using the applications-specific
VLIW platform can provide a new horizon for efficient
and effective system design, until now no effort has been reported.
In this paper, we introduce a QoS-based system partitioning
scheme that addresses the need for maximizing net benefit of
a system under resource constraints by exploiting applications-specific
VLIW processors. The approach utilizes the modern advances
in compiler technology and architectural enhancements
that are well matched to the compiler technology.
Experimental results indicate that the approach presented in
this paper is very effective in system partitioning for QoS given a
set of heterogeneous processors.
Co-Chairs : Yusuke Matsunaga, Fujitsu Laboratories, Ltd., Japan
Hiroyuki Ochi, Hiroshima City Univ., Japan
-
C3.1 Exact Minimization of Fixed Polarity Reed-Muller Expressions for Incompletely
Specified Functions [p. 247]
- Debatosh Debnath, Tsutomu Sasao
This paper presents an exact minimization algorithm
for fixed polarity Reed-Muller expressions (FPRMs)
for incompletely specified functions. For an n-variable
function with %alpha; unspecified minterms there are 2
n+%alpha; distinct
FPRMs. A minimum FPRM is one with the fewest
products. The minimization algorithm is based on the
multi-terminal binary decision diagrams. Experimental results
for a set of functions are shown. The algorithm can be
extended to obtain exact minimum Kronecker expressions
for incompletely specified functions.
Index Terms -AND-EXOR, Reed-Muller expression, Kronecker
expression, exact minimization, incompletely specified
function.
-
C3.2 An Efficient Framework of Using Various Decomposition Methods to
Synthesize LUT Networks and Its Evaluation [p. 253]
- Shigeru Yamashita, Hiroshi Sawada, Akira Nagoya
We present an efficient framework for synthesizing
look-up table (LUT) networks.
Some of the existing LUT network synthesis methods are based
on functional (boolean) decompositions. Our method also uses functional
decompositions, but we try to use various decomposition
methods, which include algebraic decompositions. Therefore, this
method can be thought of as a general framework for synthesizing
LUT networks by integrating various decomposition methods.
We use a cost database file which is a unique characteristic in our
method.
We also present comparisons between our method and some
well-known LUT network synthesis methods, and evaluate the
final results after placement and routing. Although our method
is rather heuristic in nature, the experimental results are encouraging.
-
C3.3 Three Parameters to Find Functional Decompositions [p. 259]
- Tsutomu Sasao, Ken-ichi Kurimoto
Finding simple disjoint functional decompositions is a basic problem,
but is generally time-consuming since there are nearly 2 n
bipartitions of input variable. This paper introduces three parameters
to find bipartitions of the input variables. It also defines "ideal
random logic functions," and derives their properties. Experimental
results using randomly
generated functions and benchmark functions show the usefulness of
the approach.
Co-Chairs : Wayne W.-M. Dai, Univ. of California, Santa Cruz, USA
Shuji Tsukiyama, Chuo Univ., Japan
-
D3.1 Delay-Optimal Wiring Plan for the Microprocessor of High Performance
Computing Machines [p. 265]
- Jun Kikuchi, Tetsuo Sasaki, Tohru Hashimoto, Kazuhisa Miyamoto
This paper presents a delay-optimal designing
method for high performance microprocessor. In order to
develop such a microprocessor in short period, a novel
planning method for global wires, called "wiring plan"
has been introduced. The goal of this wiring plan is to
solve timing issue of wires with minimum usage of
wiring resources; wiring channels and inserted buffers.
In this wiring plan, global assignment of wiring
resources and improvement of local congestion for their
usage have been done. As a result of applying them, our
chip has been able to work at 400MHz, and channel
usage has not exceeded its capacity and gate overhead
for buffer insertions has been only 1.6% of total gates.
-
D3.2 MMP : A Novel Placement Algorithm for Combined Macro Block and Standard
Cell Layout Design [p. 271]
- Hong Yu, Xianlong Hong, Yici Cai
In this paper, an efficient mixed-mode placement algorithm
called MMP is presented for the high performance mixed
block and standard cell designs. Our approach combines the
well-known quadratic placement with bottom-up clustering,
as well as the slicing partitioning strategy. This approach
can account for macro blocks and standard cells simultaneously.
Our method is both very efficient and effective, while it can be
run very fast, too. We have tested our algorithm on a set of
sample circuits from industry and consistently obtained excellent
results.
Keywords: Mixed-Mode, Clustering, Quadratic Placement, Partitioning
-
D3.3 Dynamic Weighting Monte Carlo for Constrained Floorplan Designs in Mixed
Signal Application [p. 277]
- Jason Cong, Tianming Kong, Faming Liang, Jun S. Liu, Wing Hung Wong,
Dongmin Xu
Simulated annealing has been one of the most popular
stochastic optimization methods used in the VLSI
CAD field in the past two decades. Recently, a new
Monte Carlo and optimization method, named dynamic
weighting Monte Carlo [WL97], has been introduced
and successfully applied to the traveling salesman
problem, neural network training [WL97], and
spin-glasses simulation [LW99]. In this paper, we have
successfully applied dynamic weighting Monte Carlo
algorithm to the constrained floorplan design with
consideration of both area and wirelength minimization.
Our application scenario is the constrained floorplan design
for mixed signal MCMs, where we need
to place all the analog modules together in groups
so that they can share common power and ground
planes, which are separate from those used by the
digital modules. Our experiments indicate that the
dynamic weighting Monte Carlo algorithm is very effective
for constrained floorplan optimization. It out-performs
the simulated annealing for a real mixed signal MCM design
by 19.5% in wirelength, with slight
area improvement. This is the first work adopting the
dynamic weighting Monte Carlo optimization method
for solving VLSI CAD problems. We believe that this
method has applications to many other VLSI CAD
optimization problems.
Co-Chairs : Mitiko Miura-Mattausch, Hiroshima Univ., Japan
Andrzej J. Strojwas, Carnegie Mellon Univ., USA
-
E3.1 Symbolic Circuit-Noise Analysis and Modeling with Determinant Decision
Diagrams [p. 283]
- XiangDong Tan, C.-J. Richard Shi
In this paper, a new symbolic noise analysis
and modeling technique is presented. The new
method exploits the sharing of symbolic expressions
in the noise models by using a recently introduced
graph, called determinant decision diagrams (DDDs),
for symbolic determinant representations. With efficient
DDD-based graph manipulations, we are able
to generate the exact noise models for analog blocks.
Symbolic noise analysis and modeling on real analog
circuit examples are presented and compared with
SPICE noise simulation.
-
E3.2 Gate-Level Aged Timing Simulation Methodology for Hot-Carrier Reliability
Assurance [p. 289]
- Yoshiyuki Kawakami, Jingkun Fang, Hirokazu Yonezawa, Nobufusa Iwanishi,
Lifeng Wu, Alvin I-Hsien Chen, Norio Koike, Ping Chen, Chune-Sin Yeh, Zhihong Liu
This paper presents a new aged timing simulation
methodology that can be used for hot-carrier reliability
assurance of VLSI. This methodology consists of a compact
model and a unique algorithm. The ratio based model simplifies
the aging I-V characteristics of MOSFET over time into the
aged timing and the corresponding ratio at gate-level. A new
algorithm is proposed including a gate primitive decomposition
method and an aged slew rate propagation method. This
algorithm provides good stress representation and can achieve
comparable accuracy with the conventional transistor-level
approach. The above methodology has been implemented in a
new simulator. Experimental results demonstrate that the
simulator based on this methodology realizes full-chip circuit
capacity and can be applied to various reliability analyses
including degradation-sensitive critical paths and clock skew.
-
E3.3 Embedded Tutorial : Subwavelength Lithography (PSM, OPC) [p. 295]
- Tsuneo Terasawa
Fabrication of fine features of smaller 0.15um is vital for future
ultra-large scale integrated (ULSI) devices. An area of particular
concern is whether and how optical lithography can delineate such
feature sizes, i.e., smaller than the exposure wavelength. Resolution
enhancement techniques for achieving subwavelength optical lithography
are presented. Various types of phase shift mask (PSM) techniques and
their imaging characteristics are discussed and compared to conventional
binary mask technique. To apply these masks effectively to practical
patterns, optical proximity effect correction (OPC) technique and a
phase shifter pattern design tool must be established. These techniques
offer the capability to improve resolution to exceed the wavelength limitation
and to increase depth of focus.
Co-Chairs : Rolf Ernst, Technical Univ. Braunschweig, Germany
Ahmed A. Jerraya, TIMA Laboratory, France
-
A4.1 Embedded Tutorial : IC Design Technology for Building System-On-A-Chip
[p. 301]
- Rajesh Gupta
-
A4.2 Thread Partitioning Method for Hardware Compiler Bach [p. 303]
- Mizuki Takahashi, Nagisa Ishiura, Akihisa Yamada, Takashi Kambe
This paper presents a method for thread partitioning
for a hardware compiler Bach. Bach synthesizes RT level circuits
from a system description written in Bach-C language, where a
system is modeled as communicating processes running in parallel.
The system description is decomposed into threads, i.e., strings
of sequential processes, and then converted into synthesizable behavioral
VHDL models. The proposed method attempts to find a
partitioning of a given system description into threads that maximize
resource sharing among processes in the threads. Experiments
on two real designs show that the circuit sizes were reduced
by 3.7% and 14.7%. We also show the detailed statistics and analysis
of the size of the resulting gate level circuits.
-
A4.3 An Area/Time Optimizing Algorithm in High-Level Synthesis for Control-Based
Hardwares (Short Paper) [p. 309]
- Nozomu Togawa, Masayuki Ienaga, Masao Yanagisawa, Tatsuo Ohtsuki
This paper proposes an area/time optimizing algorithm in high-level synthesis
for control-based hardwares. Given a call graph whose node corresponds to
a control flow of an application program, the algorithm generates a set of
state-transition graphs which represents the input call graph under area
and timing constraints. In the algorithm, first state-transition graphs
which satisfy only timing constraint are generated and second they are
transformed so that they can satisfy area constraint. Since the algorithm
is directly applied to control-flow graphs, it can deal with control flows
such as bit-wise processes and conditional branches. Further, the algorithm
synthesizes more than one hardware architecture candidates from a single call
graph for an application program. Designers of an application program can
select several good hardware architectures among candidates depending on
multiple design criteria. Experimental results for several control-based
hardwares demonstrate effectiveness and efficiency of the algorithm.
-
A4.4 A Timing-Driven Synthesis of Arithmetic Circuits using Carry-Save-Adders (Short
Paper) [p. 313]
- Taewhan Kim, Junhyung Um
Carry-save-adder (CSA) is one of the most widely used types of operation
in implementing a fast computation of arithmetics. An inherent limitation
of the conventional CSA applications is that the applications are confined
to the sections of arithmetic circuit that can be directly translated
into addition expressions. To overcome this limitation, from the analysis
of the structures of arithmetic circuits found in industry, we derive a
set of simple, but effective CSA transformation techniques. Those are
(1) optimization across multiplexors, (2) optimization across
design boundaries (restricted notion of [3]), and (3) optimization
across multiplications. Based on the techniques, we develop a new
timing-driven CSA transformation algorithm that is able to utilize CSAs
extensively throughout the whole circuits. Experimental data for practical
testcases are provided to show the effectiveness of our algorithm.
Co-Chairs : Toshiaki Miyazaki, NTT Network Innovation Laboratories, Japan
Tetsuo Hironaka, Hiroshima City Univ., Japan
-
B4.1 Communicating Logic : An Alternative Embedded Stream Processing Paradigm [p. 317]
- Norbert Imlig, Ryusuke Konishi, Tsunemichi Shiozawa, Kiyoshi Oguri, Kouichi
Nagami, Hideyuki Ito, Minoru Inamori, Hiroshi Nakada
This paper introduces a new concurrent processing model called
"Communicating Logic" (CL). As the target architecture we adopt the
dual layered "Plastic Cell Architecture" (PCA). Data-path-oriented
processing functionality is encapsulated in
asynchronous hardware objects with variable graining and implemented
using look-up tables. Communication (i.e. connectivity and control)
between the distributed processing objects is achieved by means of
inter-object message passing. The key point of the CL
approach is that it offers the merits of scalable, high
performance hardware implementation with the compilation
and linking capabilities unique to software.
-
B4.2 A Scheduling and Allocation Method to Reduce Data Transfer Time by Dynamic Reconfiguration [p. 323]
- Kazuhito Ito
In the era of deep submicron technology, wire delay on an LSI chip
is becoming relatively larger than operation delay. Increase of
execution speed by parallel processing may be limited due to the
data transfer time between functional units. If we can dynamically
reconfigure nearby functional units into desired operation type and
execute operations on the reconfigured units, long data transfer is
reduced and hence fast processing can be achieved. In this paper
we propose a scheduling method to determine static operation
execution time and functional unit allocation to achieve fast
signal processing by considering dynamic reconfiguration of
functional units. Results show the effectiveness of the proposed
method.
-
B4.3 Invited Talk : Reconfigurable Computing: Its Concept and a Practical Embodiment
using Newly Developed Dynamically Reconfigurable Logic (DRL) LSI [p. 329]
- Masakazu Yamashina, Masato Motomura
This paper first outlines a broad range of
reconfigurable computing research activities from a perspective
of system LSI designs. Then, the paper focuses onto
dynamically reconfigurable logic (DRL) LSI, a prototype chip
that we developed to evaluate the reconfigurable computing
concept. Through its ability to exchange hardware contexts
quickly, this chip can accelerate media/communication
applications with customized hardware configurations, yet
maintaining scalability towards varying application sizes.
Co-Chairs : Akira Nagoya, NTT Communication Science Laboratories, Japan
Manish Pandey, Cadence Design Systems, Inc., USA
-
C4.1 Power Reduction by Simultaneous Voltage Scaling and Gate Sizing [p. 333]
- Chunhong Chen, Majid Sarrafzadeh
This paper proposes to use voltage-scaling (VS)
and gate-sizing (GS) simultaneously for reducing power
consumption without violating the timing constraints. We present
algorithms for simultaneous VS and GS based on the
Maximum-Weighted-Independent-Set problem. We describe the slack
distribution of circuit, completeness of gate library and
discreteness of supply voltage, and discuss their effects on power
optimization. Experimental results show that the average power
reduction ranges from 23.3% to 56.9% over all tested circuits.
-
C4.2 Analysis of Power-Clocked CMOS with Application to the Design of Energy-Recovery
Circuits [p. 339]
- Massoud Pedram, Xunwei Wu
This paper presents our research results on
power-clocked CMOS design. First we provide algebraic
expressions and describe properties of clocked signals. Next two
types of power-clocked CMOS circuit constructions are introduced
and analyzed in detail. Since the adiabatic switching requires
slow-ramping of the power-clock, a clocked transmission gate and a
four-stage clocked NP-domino circuit are presented, which receive
trapezoidal and sinusoidal power-clocks, respectively. PSPICE
simulations demonstrate the correct operation and energy-saving
advantage of the proposed circuits.
-
C4.3 Low-Power Design of Sequential Circuits Using a Quasi-Synchronous Derived
Clock [p. 345]
- Xunwei Wu, Jian Wei, Massoud Pedram, Qing Wu
This paper presents a novel circuit design technique to
reduce the power dissipation in sequential circuits by generating
a quasi-synchronous derived clock from the master clock and
using it to isolate the flip flops in the circuit from the unwanted
triggering action of the master clock. An example design of a
decimal counter demonstrates the large power saving and
improved performance of the resulting circuit.
-
C4.4 FSM Decomposition by Direct Circuit Manipulation Applied to Low Power
Design [p. 351]
-
José C. Monteiro, Arlindo L. Oliveira
Clock-gating techniques are very effective in the reduction
of the switching activity in sequential logic circuits. In
particular, recent work has shown that significant power reductions
are possible with techniques based on finite state machine
(FSM) decomposition. A serious limitation of previously proposed
techniques is that they require the state transition graph
(STG) of the FSM to be given or extracted from the circuit. Since
the size of the STG can be exponential on the number of registers
in the circuit, explicit techniques can only be applied to relatively
small sequential circuits. In this paper, we present a new
approach to perform FSM decomposition by direct manipulation
of the circuit. This way, we do not require the STG, either explicit
or implicit, thus further avoiding the limitations imposed by the
use of BDDs. Therefore, this technique can be applied to circuits
with very large STGs. We provide a set of experimental results
that show that power consumption can be substantially reduced,
in some cases by more than 70%.
Timing Closure : The Solution and Its Problems [p. 359]
Organizer: Ralph H.J.M. Otten, Delft University of Technology, The Netherlands
Moderator: Ralph H.J.M. Otten, Delft University of Technology, The Netherlands
Panelists:
Raul Camposano (Synopsys, Inc., USA)
Oliver Coudert (Monterey Design Systems, Inc., USA)
Patrick Groeneveld (Magma Design Automation, USA)
Leon Stok (Thomas J. Watson Research Center, IBM, USA)
In this paper we summarize the derivation of the size
equations, the key to timing closure.
Next we present a number of problems when applying
these equations in practice. The main ones are network
generation, discrete libraries, size constraints,
and resistive interconnect.
Co-Chairs : Naoyuki Shigyo, Toshiba Corp., Japan
C.-J. Richard Shi, Univ. of Washington, USA
-
E4.1 High Performance of Short-Channel MOSFETs due to an Elevated Central-Channel
Doping [p. 365]
- M. Tanaka, N. Tokida, T. Okagaki, M. Miura-Mattausch, W. Hansch, H. J. Mattausch
An elevated central-channel doping with
a depth similar to the S/D junctions is proposed as the
best measure for simultaneously improving MOSFET
device and high speed circuit performances as well as
minimizing their fluctuations. We base our arguments
on hydrodynamic device simulation, measured device
data of vertical MOSFETs with a central delta-doped
impurity profile and include experimental results on
doping-profile fluctuations along the channel, which
have not been available previously.
-
E4.2 Circuit Performance Oriented Device Optimization using BSIM3 Pre-Silicon Model
Parameters [p. 371]
- Mikako Miyama, Shiro Kamohara
We propose a circuit performance oriented device
optimization methodology using pre-silicon parameters
and critical paths which represent the performance of
the chip. Based on our methodology, we successfully
reduced the power consumption by 90% and, at the
same time, increased the frequency by 30% from the
initial design. The key to this optimization
methodology is the pre-silicon parameter generation
method, which can predict the device performance
within 5% accuracy in a few minutes.
-
E4.3 Embedded Tutorial : Design for Manufacturability : A Path from System
Level to High Yielding Chips [p. 375]
- Andrzej J. Strojwas
This tutorial describes a wide
spectrum of the Design for Manufacturability
(DFM) activities. We start by presenting a new
approach to IC design, which takes full advantage
of leading edge technology. Then we propose a new
methodology for acceleration of yield ramping
which accounts for all the dominant yield loss
mechanisms and a set of software tools that have
been developed to address the yield learning
problems. Several real-life examples demonstrate
the practical results of employing such a yield
ramping strategy.
Co-Chairs : Keshab K. Parhi, Univ. of Minnesota, USA
Koji Kotani, Tohoku Univ., Japan
-
A5.1 Embedded Tutorial : Low-Power Silicon Architectures for Wireless
Communications [p. 377]
- Jan M. Rabaey
Wireless communication and networking
is experiencing a dramatic growth, and all indicators
point to an extension of this growth in the foreseeable
future. This paper reflects on the demands and the
opportunities offered with respect to the integrated
implmentataion of these applications in the "systems-on-a-chip"
era.
-
A5.2 Run-time Power Control Scheme using Software Feedback Loop for Low-power Real-time
Application [p. 381]
- Seongsoo Lee, Takayasu Sakurai
A novel hardware-software cooperative power
control scheme, namely a software feedback loop, is proposed to
lower power consumption of VLSI systems. The proposed run-time
voltage and frequency control scheme guarantees the real-time
execution of applications. It avoids interface problems and
also provides binary-code compatibility. Using a software
analysis environment, the power control scheme is shown to
achieve more than 90% power reduction for real-time MPEG-4
SP@L1 video encoding, taking into consideration the transition
delay between voltage and frequency levels.
-
A5.3 An Interleaved Dual-Battery Power Supply for Battery-Operated Electronics
[p. 387]
- Qing Wu, Qinru Qiu, Massoud Pedram
After a detailed analysis and discussion of two
important characteristics of today's battery cells (i.e., their
current-capacity and current-voltage curves), this paper describes
the design principles and architecture of a dual-battery power
supply system for portable electronics. The key idea is to integrate
two battery types with different energy capacity and current rate
curves into the power supply system, and then use them in an
interleaved manner in response to varying current requirement of
the VLSI circuit that is powered by this dual-battery system.
Analytical and empirical results demonstrate the effectiveness of
the new battery architecture in maximizing the service life of a
battery system with fixed volume (or weight).
Co-Chairs : Rajesh Gupta, Univ. of California, Irvine, USA
Takashi Kambe, Sharp Corp., Japan
-
B5.1 Embedded Tutorial : Embedded System Design with Multiple Languages [p. 391]
- Rolf Ernst, Ahmed A. Jerraya
The use of several languages in the design of embedded
systems is very convenient for application development
and optimization but it can become an obstacle on the way to
higher design productivity. This paper explains solutions and
future trends.
-
B5.2 Symbolic Debugging of Globally Optimized Behavioral Specifications
[p. 397]
- Inki Hong, Darko Kirovski, Miodrag Potkonjak, Marios C. Papaefthymiou
Symbolic debuggers are system development tools that can
accelerate the validation speed of behavioral specifications
by allowing a user to interact with an executing code at
the source level. In response to a user query, the debugger
must be able to retrieve and display the value of a source
variable in a manner consistent with what the user expects
with respect to the source statement where execution has
halted. However, when a behavioral specification has been
optimized using transformations, values of variables may
either be inaccessible in the run-time state or inconsistent
with what the user expects. We address the problem that
pertains to the retrieval of source values for the globally
optimized behavioral specifications. We describe how
transformations effect the retrieval of source values.
We present an approach for a symbolic debugger to retrieve
and display the value of a variable correctly and efficiently
in response to a user inquiry about the variable in the
source specification. The implementation of the new debugging
approach poses several optimization tasks. We formulate the
optimization tasks and develop heuristics to solve them. We
demonstrated the effectiveness of the proposed approach on a
set of designs.
-
B5.3 Fast Development of Source-level Debugging System Using Hardware Emulation
(Short Paper) [p. 401]
- Sang-Joon Nam, Jun-Hee Lee, Byoung-Woon Kim, Yeon-Ho Im, Young-Su Kwon,
Kyong-Gu Kang, Chong-Min Kyung
We describe the co-development of a
processor and its source-level debugging system using
an emulation-based validation technology including
hardware emulation, not simulation. Since a source-level
debugging system is essential to develop an application
system and it takes a long time to validate the
functionality of the source-level debugging system, we
have adopted hardware emulation for a fast validation
and system development. Using this methodology, we
were able to validate the source-level debugging system
successfully before the chip fabrication.
-
B5.4 Methodology for Hardware/Software Co-verification in C/C++ (Short Paper)[p. 405]
- Luc Séméria, Abhijit Ghosh
In this paper we present our C/C++-based design environment
for hardware/software co-verification. Our approach is to
use C/C++ to describe both hardware and software throughout the
design flow. Our methodology supports the efficient mapping of C/
C++ functional descriptions directly into hardware and software.
The advantages of a C/C++-based flow from the verification point of
view are presented. The use of C/C++ to model all parts of the system
provides great flexibility and enables faster simulation compared
to existing methodologies. We show how co-verification can be done
efficiently and effectively at the various levels of abstraction, how
co-verification can be used to drive co-design through performance
estimation and give an example of implementation for the 8051
architecture.
Co-Chairs : Tsutomu Sasao, Kyushu Inst. of Technology, Japan
Shin'ichi Minato, NTT Network Innovation Laboratories, Japan
-
C5.1 Performance-Optimal Clustering with Retiming for Sequential Circuits [p. 409]
- Tzu-Chieh Tien, Youn-Long Lin
We propose an exact clustering with retiming algorithm to minimize the clock period for sequential circuits.
Without moving ip-ops (FF's) by retiming, conventional clustering algorithms can only handle combinational parts and therefore cannot achieve the best cycle
time. Pan et al. [2] have proposed an optimal algorithm
under the unit gate delay model. We propose a more
powerful and faster algorithm that produces optimal results even under the more realistic general gate delay
model. Experimental results show that our algorithm is
twice as fast as Pan's.
-
C5.2 IBAW: An Implication-Tree Based Alternative-Wiring Logic Transformation
Algorithm [p. 415]
- Wangning Long, Yu-Liang Wu, Jinian Bian
The well-known ATPG-based alternative wiring technique, RAMBO, has been shown
to be very useful because of its proven powerfulness and flexibility in attacking
many design automation problems (e.g. logic optimization, circuit partitioning,
and post-layout logic transformation ..etc). Since the ATPG based alternative
wire locating procedure is the center engine for all its applications, speeding
up of this process should be very crucial and useful.
We observe that the bottleneck of the technique lies in the costly redundancy
tests among a large number of candidate alternative wires. In this paper,
we develop a so-called implication-tree data structure which stores implication
relationship between nodes with determined logic values, and propose a new
ATPG-based alternative-wiring algorithm to speed up the engine. The algorithm,
Implication-tree Based Alternative-Wiring (IBAW), differs from other ATPG-based
algorithms in terms that it selects source node of alternative wires from the
implication-tree, which makes IBAW be able to trim out many unnecessary
redundancy checking quite easily without calling for complicated procedures.
Hence, it produces a steady speeding up of around 3.6 times faster while
maintains the same rewiring capability of the original RAMBO. Our
experimental results show that the overall circuit area optimized by IBAW
can be slightly better than that by RAMBO, while the runtime is just one-half
of the latter.
-
C5.3 On Mixture Density and Maximum Likelihood Power Estimation via
Expectation-Maximization [p. 423]
- R. Chandramouli, Vamsi K. Srikantam
A maximum-likelihood estimation procedure for computing the average power
consumption of VLSI circuits is proposed. The method can
handle data that has a mixture-density with multiple
components unlike most of the previous approaches.
An iterative computational procedure based on the
expectation-maximization principle is also discussed.
This can be used to estimate the parameters of an
arbitrary (but finite) number of components of the
probability distribution of the simulated power data.
Experimental results for ISCAS '85 benchmark circuits and a large industrial circuit are given in order
to validate the efficiency and practicality of the algorithm.
Comparisons show that the proposed method
estimates the multiple components (even those with a
low probability of occurrence) while the Monte Carlo
estimate captures only the most probable component.
Co-Chairs : Shin'ichi Wakabayashi, Hiroshima Univ., Japan
Tetsushi Koide, Univ. of Tokyo, Japan
-
D5.1 Edge Separability Based Circuit Clustering with Application to Circuit Partitioning [p. 429]
- Jason Cong, Sung Kyu Lim
In this paper, we introduce a new efficient O(n log n) graph search based
bottom-up clustering algorithm named ESC (Edge Separability based
Clustering). Unlike existing bottom-up algorithms
that are based on local connectivity information of the
netlist, ESC exploits more global connectivity information
"edge separability" to guide clustering process
while carefully monitoring cluster area balance. Computing
the edge separability for a given edge e = (x; y)
in an edge weighted undirected graph G(V; E; s; w) is
equivalent to finding the x-y mincut. Then, we show
that a simple and efficient algorithm CAPFOREST
[14] can be used to provide a good estimation of edge
separability for all edges in G without using any network
flow computation. Related experiments based on
large scale ISPD98 [1] benchmark circuits confirm that
exploiting edge separability yields better quality partitioning
solution compared to various bottom-up clustering
algorithms proposed in the literature including
Absorption [18], Density [9, 11], Rent Parameter [15],
Ratio Cut [19], Closeness [17], and Connectivity [16]
method. In addition, our ESC based multiway partitioning
algorithm LR/ESC-PM provides comparable results to
state-of-the-art hMetis [12] and hMetis-Kway [13].
-
D5.2 Feasible Two-Way Circuit Partitioning with Complex Resource Constraints
[p. 435]
- Hsun-Cheng Lee, Ting-Chi Wang
We study in this paper the feasibility problem for two-way
circuit partitioning subject to complex resource constraints.
We then consider that the problem is in general NP-complete.
We then consider two special cases of the problem, and present
polynomail-time algorithms for them. Finally, we give a backtracking
algorithm to solve the general case. To reduce the run time and
the storage space of the backtracking algorithm, an incremental
flow computation technique is employed. For each algorithm presented
in this paper, the corresponding experimental results are also given
to support its efficiency. To the best of our knowledge, this paper
is the first one to address the feasibility problem for two-way
circuit partitioning with complex resource constraints.
-
D5.3 Performance Driven Multiway Partitioning [p. 441]
- Jason Cong, Sung Kyu Lim
Under the interconnect-centric design paradigm, partitioning
is seen as the crucial step that
defines the interconnect [1]. To meet the performance
requirement of today's complex design, performance
driven partitioners must consider the amount of interconnect
induced by partitioning as well as its impact
on performance. In this paper, we provide new
performance driven formulation for cell move based
top-down multiway partitioning algorithms with consideration
of the local and global interconnect delay. In our "constrained
acyclic partitioning" formulation, cell moves are restricted to
maintain acyclicity in partitioning solution to prevent cyclic dependency
among cells in different partitions. In our "relaxed
acyclic partitioning" formulation, acyclic constraints
are relaxed to give partitioners capability of minimizing
cutsize and delay. Our new acyclic constraint
based performance driven multiway partitioning algorithm
FLARE obtains (i) 4% to 13% better delay
compared to the state-of-the-art cutsize minimization
based hMetis [10] at almost no increase in cutsize, and
(ii) 84% better cutsize compared to the state-of-the-art
delay minimization based PRIME [2] at an expense
of 16% increase in delay.
Co-Chairs : Hiroshi Matsumoto, NEC Corp., Japan
Xianlong Hong, Tsinghua Univ., China
-
E5.1 Hierarchical Computation of 3-D Interconnect Capacitance using Direct Boundary
Element Method [p. 447]
- Jiangchun Gu, Zeyi Wang, Xianlong Hong
The idea of Appel's hierarchical algorithm handling
the many-body problem is implemented in the direct
boundary element method (BEM) for computation of 3-D
VLSI parasitic capacitance. Both the electric potential
and normal electric field intensity on the boundary
are involved, so it can be much easier to handle problems
with multiple dielectrics and finite dielectric structure
than the indirect BEM. Three kinds of boundaries (forced
boundary, natural boundary and dielectric interface) are
treated. Two integral kernels with different singularity
(1/r, 1/r3) are involved while computing the
interaction between the boundary elements. These features
make it significantly distinct from the hierarchical algorithm
based on the indirect BEM, which only handles single dielectric,
one integral kernal and one forced boundary. The coefficient
matric is generated and stored hierarchically in this paper.
As a result, computation cost of the matrix is reduced, and the
matric-vector multiplication in the GMRES iteration is accelerated,
so computation speed is improved significantly.
Key Words: VLSI, Parasitic Capacitance, Direct Boundary
Element Method, Hierarchical Computation.
-
E5.2 A Simplified Hybrid Method for Calculating the Frequency-dependent Inductances
of Transmission Lines with Rectangular Cross Section [p. 453]
- Shuzhou Fang, Xiaobo Tang, Zeyi Wang, Xianlong Hong
This paper describes a simplified hybrid method
to calculate the frequency-dependent inductances of
transmission lines, which combines cross-section coupled circuit
method with boundary element method (BEM) just solving
Laplace's equation. The coupled circuit method is used to
calculate inductance in low frequency range; The BEM builds
the capacitance matrix, then we invert this matrix (omit the row
and column relate to ground plane), we can obtain inductance
relating to the high frequency at which the current only
distributes on surfaces of conductor. A cubic spline interpolation
between inductance at low and very high frequency gives a good
result over entire frequency range. Comparing this method with
previous hybrid method [3], it avoids solving Helmholtz's
equation and obtains the capacitance matrix at the same time,
greatly reducing the total burden of calculation. Numerical tests
show it is applicable and efficient.
-
E5.3 An Analytic Calculation Method for Delay Time of RC-class Interconnects
[p. 457]
- W. K. Kal, S. Y. Kim
This paper presents an analytic 3rd
order calculation method, without simulations, for delay
time of RC-class circuits, which can be conveniently
used to model on-chip interconnects. While the proposed
method requires comparable evaluation time to the previous
2nd order calculation method, it ensures
more accurate results than those of 2nd
order method. The proposed analytic delay calculation method guarantees
allowable error tolerance when compared to the
results obtained from the AWE technique or HSPICE
and has better performance in evaluation time as well
as numerical stability. The first algorithm of the proposed
method requires 7 moments for the 3rd
order
approximation and yields accurate delay time approximation.
The second algorithm requires 5 moments
for the 3rd
order approximation and results in shorter
evaluation time, the accuracy of which may be less
than the first algorithm.
-
E5.4 A New Efficient Waveform Simulation Method for RLC Interconnect via Amplitude
and Phase Approximation [p. 463]
- Xiaodong Yang, Walter H. Ku, Chung-Kuan Cheng
In this paper we use a segmented Chebyshev orthongonal polynomial
expansion approach to approximate amplitude and phase
response for RLC interconnect simulation. This method has been
shown to be efficient in reducing frequency sampling point number
or expansion order. Experiments also show that the number
of sampling points is not necessarily dependent on the interconnect
circuit size, which makes possible large network simulation.
Additionally, because of the smoothing-effect of the atan function
in evaluation, the phase response, which has major impact on the
transient edge of output response, can be easily approximated.
Experiments show the proposed simulation approach achieves
30-50 times speed-up over spice3f4.
Co-Chairs : Chong-Min Kyung, Korea Advanced Inst. of Science and
Technology Korea
Takashi Morie, Hiroshima Univ., Japan
-
A6.1 Optimization of VDD and VTH for Low-Power and High Speed Applications [p. 469]
- Koichi Nose, Takayasu Sakurai
Closed-form formulas are presented for optimum
supply voltage (VDD) and threshold voltage (VTH) that minimize
power dissipation when technology parameters and required
speed are given. The formulas take into account short-channel
effects and the variation of VTH and temperature. Using
typical device parameters, it is shown that a simple guideline to
optimize the power consumption is to set the ratio of maximum
leakage power to total power about 30%. Extending the analysis,
the future VLSI design trend is discussed. The optimum VDD
coincides with the SIA roadmap and the optimum VTH for logic
blocks at the highest temperature and at the lowest process
variation corner is in the range of 0V~0.1V over generations.
-
A6.2 Compact yet High Performance (CyHP) Library for Short Time-to-Market with New
Technologies [p. 475]
- Nguyen Minh Duc, Takayasu Sakurai
Two compact yet high performance standard cell
libraries (CyHP libraries), which contain only 11 and 20 cells
respectively, are proposed. The first CyHP library leads to 5%
increase in delay compared to a library containing about 400 cells.
The second CyHP library suppresses delay increase up to 2%. The
compact nature of these libraries not only reduces the cost and time
for generation and maintenance substantially (thus enables new
technologies to be adopted immediately), but also shortens synthesis
time to about a half. Application results of these libraries to standard
benchmarks and an industry design of about 80K gates are
presented.
-
A6.3 A New CMAC Neural Network Architecture and Its ASIC Realization [p. 481]
- Yuan-Bao Hsu, Kao-Shing Hwang, Chien-Yuan Pao, Jinn-Shyan Wang
A new VLSI architecture of the CMAC neural network, called CCMAC, which
is composed of an embedded content addressable memory (CAM) is proposed.
The CCMAC can determine whether the addressing content resides in the
memory or not faster. If a match doesn't occur, the activated address
is stored in a vacant cell of the CAM. With this mechanism, memory
utilization rate can be improved to 100% and control noise can be suppressed.
Moreover, the associated mask function, which is triggered while the CAM
is full, can improve control performance. Comparison between CCMAC
and a conventional CMAC is examined on the truck backer-upper control
simulation. A CMAC ASIC chip has been implemented using a 0.6-um CMOS
technology under 3.3- or 2.5-V supply voltages after obtaining design
trade-off. The chip contains 0.33-million transistors and operates
at 20 MH for a 2.5-V supply with only 10-mW power dissipation.
Co-Chairs : Ichiro Kuroda, NEC Corp., Japan
Keshab K. Parhi, Univ. of Minnesota, USA
-
B6.1 Retargetable Estimation Scheme for DSP Architecture Selection [p. 485]
- Naji Ghazal, Richard Newton, Jan Rabaey
Given the recent wave of innovation and
diversification in digital signal processor (DSP) architecture, the
need for quickly evaluating the true potential of considered
architectural choices for a given application has been rising. We
propose a new scheme, called Retargetable Estimation, that
involves analysis of a high-level description of a DSP application,
with aggressive optimization search, to provide a performance
estimate of its optimal implementation on the architectures
considered. With this scheme, we present a new parameterized
architecture model that allows quick retargeting to a wide range
of architectural choices, and that emphasizes capturing an
architecture's salient optimizing features.
We show that for a set of DSP benchmarks and two full
applications, hand-optimized performance can be predicted
reliably. We applied this scheme to two different processors.
-
B6.2 Data Memory Minimization by Sharing Large Size Buffers [p. 491]
- Hyunok Oh, Soonhoi Ha
This paper presents software synthesis techniques to
deal with non-primitive data type from graphical dataflow
programs based on the synchronous dataflow (SDF) model.
Non-primitive data types, often used in multimedia and
graphics applications, require buffer memory of large size. To
minimize the buffer requirement, we separate global data buffers
and local pointer buffers. The proposed approach first
allocates the minimum size of global buffers and next binds the
local buffers to the global buffers by setting the pointers. Static
binding and dynamic binding techniques are devised. Experimental
results prove the significance of the proposed techniques.
-
B6.3 Array Allocation Taking into Account SDRAM Characteristics [p. 497]
- Hong-Kai Chang, Youn-Long Lin
Multimedia, image processing and other signal
processing applications often involve data stored in large
arrays. Due to chip area limitation, arrays are typically
assigned to off-chip memories, such as DRAM. This being the
case, we try to optimize off-chip memory accesses to improve
performance. We take the characteristics of the current
mainstream SDRAM memory into account. We propose an
algorithm to allocate arrays to different banks to increase the
probability of utilizing SDRAMās multi-bank characteristic.
Experimental results show significant improvement over
traditional approaches.
Co-Chairs : Cheng-Wen Wu, Univ. of California, Santa Barbara, USA
Tomoo Inoue, Hiroshima City Univ., Japan
-
C6.1 Causality Based Generation Of Directed Test Cases [p. 503]
- Nina Saxena, Jacob Abraham, Avijit Saha
Simulation is considered to be the irreplaceable part of design verification. However, the efficiency of this method depends greatly on the test cases
used. Random test cases can be generated quickly
but have poor coverage. Directed test cases on the
other hand require time and manual effort. This paper
presents a method for generating directed test cases
automatically by making use of signal relationships in
the specification. An algorithm is presented that was
applied to the GL85 microprocessor, a clone of Intel's
8085. The results are compared with other methods
to see the gain with the proposed method.
-
C6.2 Embedded Tutorial : Fault Models and Test Generation for IDDQ Testing [p. 509]
- Yoshinobu Higami, Yuzo Takamatsu, Kewal K. Saluja, Kozo Kinoshita
This paper surveys recent research related to IDDQ testing,
particularly focuses on fault
models and test generation methods.
(1) The paper provides a taxonomy of fault models
that have been studied in literature, and classifies
these models into a small set of faults.
(2) The paper describes efficient test generation
methods and fault simulation methods. Test compaction methods,
including reduction of the total
number of test vectors and selection of IDDQ measurement
vectors, are also described.
-
C6.3 Embedded Tutorial : Issues on SOC Testing in DSM Era [p. 515]
- Takashi Aikyo
Deep sub-micron technology is rapidly leading to exceedingly complex, billion-transistor chips. By these technology
evolutions, a system is integrated into a chip so called a system-on-a-chip (SOC). In order to bridge the productivity
gap between available transistors and able to be designed in SOC, higher-level behavioral language and re-use of
already designed intellectual property (IP) will become more common. However, these techniques affect test
methodologies and failure analysis of SOC.
Co-Chairs : Jun-Dong Cho, Sungkyunkwan Univ., Korea
Yoichi Shiraishi, Gunma Univ., Japan
-
D6.1 A Cell Synthesis Method for Salicide Process [p. 517]
- Kazuhisa Okada, Takayuki Yamanouchi, Takashi Kambe
In this paper, we propose a cell synthesis method
for a Salicide process. Our method utilizes the local interconnect
between adjacent transistors, which is available in
some Salicide processes, and optimizes the transistor placement
of a cell considering both area and the number of local
interconnects. In this way we reduce the number of
metal wires and contacts. The circuit model is not restricted
to conventional series-parallel CMOS logic, and our method
enables us to synthesize CMOS pass-transistor circuits. Experimental
results show that our method uses the local interconnect
effectively, and optimizes both cell area and metal
wire length.
-
D6.2 Monte-Carlo Algorithms for Layout Density Control [p. 523]
- Yu Chen, Andrew B. Kahng, Gabriel Robins, Alex Zelikovsky
Chemical-mechanical polishing (CMP) and
other manufacturing steps in very deep submicron VLSI have
varying effects on device and interconnect features, depending
on local characteristics of the layout. To enhance manufacturability
and performance predictability, we seek to make the
layout uniform with respect to prescribed density criteria, by
inserting "fill" geometries into the layout. We propose several
new Monte-Carlo based filling methods with fast dynamic
data structures and report the tradeoff between runtime and
accuracy for the suggested methods. Compared to existing linear
programming based approaches, our Monte-Carlo methods
seem very promising as they produce nearly-optimal solutions
within reasonable runtimes.
-
D6.3 Layout Generation of Array Cell for NMOS 4-phase Dynamic Logic (Short
Paper) [p. 529]
- Makoto Furuie, Bao-Yu Song, Yukihiro Yoshida, Takao Onoye, Isao Shirakawa
An array cell (AC) architecture for the
layout design is described, which is dedicated to low-power
design by means of the NMOS 4-phase dynamic
logic. An AC is constructed of (MxN)+2 transistors
so as to constitute each type of NMOS 4-phase logic
gates. A graph theoretic approach is exploited in the
layout design to reduce the layout area. A number of
experimental results demonstrate the practicability of
the proposed approach.
-
D6.4 A New Efficient Method for Substrate-Aware Device-Level Placement (Short
Paper) [p. 533]
- C. Lin, D. M. W. Leenaerts
The performance of high-frequently mixed-signal and
analog designs heavily relies on the actual layout of
the circuit components on device level. Substrate coupling
has been recognized as one of the major physical design
bottlenecks in this field. To deal with this problem
in an efficient way, a new technique is proposed based
on the sequence pair structure and corner stitching. Performance
gain is guaranteed at the cost of O(M) run-time complexity,
which is optimal. Simulations with randomly generated problem
instances show that practical run-times are indeed linear in the
size of the problem instance, thus enabling other design optimization
algorithms to incorporate the proposed method without increasing
run-time complexity.
Co-Chairs : Mineo Kaneko, Japan Advanced Inst. of Science and Technology, Japan
Peter M. Lee, Hitachi Ltd., Japan
-
E6.1 The Enhancing of Efficiency of the Harmonic Balance Analysis by Adaptation
of Preconditioner to Circuit Nonlinearity [p. 537]
- M.M. Gourary, S. G. Rusakov, S. L. Ulyanov, M.M. Zharov, K.K.
Gullapalli, B.J. Mulvaney
Krylov subspace techniques in harmonic balance
simulations become increasingly ineffective when applied to
strongly nonlinear circuits. This limitation is particularly
important in the simulation if the circuit it has components being
operated in a very nonlinear region. Even if the circuit contains
only a few very nonlinear components, Krylov methods using
standard preconditioners can become ineffective. To overcome
this problem, we present two adaptive preconditioners that
dynamically exploit the properties of the harmonic balance
Jacobian. With these techniques we have been able to retain the
advantages of Krylov methods even for strongly nonlinear circuits.
Some numerical experiments illustrating the techniques
are presented.
-
E6.2 Analog-Testability Analysis by Determinant-Decision-Diagrams based Symbolic
Analysis [p. 541]
- Tao Pi, C.-J. Richard Shi
The use of the column-rank of the sensitivity
matrix as a testability measure for parametric faults in
linear analog circuits was pioneered by Saeks in 1970s,
and later re-discovered by several others. Its practical
use has been, however, limited by how it can be
calculated. Numerical algorithms suffer from
inevitable round-off errors, while classical symbolic
techniques can only handle very small circuits. In this
paper , an innovative and efficient graph based
symbolic analysis approach, called Determinant
Decision Diagrams, is applied to testability
measurement and selection for optimum test vectors.
The new approach is promising in testability analysis
of much larger analog circuits.
-
E6.3 A Method for Linking Process-level Variability to System Performances [p. 547]
- Tomohiro Fujita, Ken-ichi Okada, Hiroaki Fujita, Hidetoshi Onodera, Keikichi
Tamaru
In this paper we present a statistical analysis method which
bridges the statistical information between process-level and
system-level. This enables us to evaluate the effect of process
variation at the system level. Also, we can derive constraints
on the process variation from a performance requirement. We
show an example of the hierarchical statistical analysis applied
to a Phase Locked Loop (PLL) circuit.
Co-Chairs : Kazuo Yano, Hitachi, Ltd., Japan
Takao Onoye, Kyoto Univ., Japan
-
A7.1 Embedded Tutorial : Design Challenges for 0.1um and Beyond [p. 553]
- Takayasu Sakurai
If we look into the scaling law carefully, we find that three crises
can be stringent in realizing LSI's for 0.1um and beyond: namely power
crisis, interconnection crisis, and complexity crisis. This paper
describes these crises and possible solutions to cope with the problems.
-
A7.2 A Hardware Accelerator for the Specular Intensity of Phong Illumination Model in
3-Dimensional Graphics [p. 559]
- Young-Su Kwon, In-Cheol Park, Chong-Min Kyung
This paper presents a special hardware implementation developed
for the computation of the specular term which is the most time
consuming part in the Phong's illumination. In the Phong shading, the
exponentiation
operation of two floating-point numbers is necessary for each
point inside a polygon. An approximation algorithm is developed to speed
up the exponentiation operation, and it is supported by simple hardware
that can be easily merged into a floating-point multiplier. The exponentiation
operation takes just 4 cycles in the proposed hardware while it takes
about 100-200 cycles in conventional floating-point units. Although an
approximation algorithm is employed for the exponentiation operation,
the amount of error is so minimal that the difference is virtually
indistinguishable.
-
A7.3 Radix-4 Modular Multiplication and Exponentiation Algorithms for the RSA Public-Key
Cryptosystem [p. 565]
- Jin-Hua Hong, Cheng-Wen Wu
We propose a radix-4 modular multiplication algorithm
based on Montgomery's algorithm, and a radix-4 cellular-array
modular multiplier based on Booth's multiplication
algorithm. The radix-4 modular multiplier can be used to
implement fast RSA cryptosystem. Due to reduced number
of iterations and pipelining, our modular multiplier is
four times faster than the cellular-array modular multiplier
based on the original Montgomery's algorithm. The time
to calculate a modular exponentiation is about n 2 clock
cycles, where n is the word length, and the clock cycle
is roughly equal to the delay time of a full adder. The
utilization of the multiplier is 100% by interleaving consecutive
exponentiations. Locality, regularity, and modularity
make the proposed architecture suitable for VLSI
implementation.
Keywords: cellular array, Montgomery algorithm, modular
multiplication, high radix modular multiplier, public-key
cryptography, RSA.
Chair : Masaharu Imai, Osaka Univ., Japan
-
B7.1 An Introduction to SLDL and Rosetta [p. 571]
- Steven E. Schulz, Texas Instruments, Inc., USA
The Rosetta Stone (permanently displayed at the British Museum of Art) has played a pivotal role
in enabling scientists to translate across a number of widely varying languages used around the
world throughout our history. Through this mapping process, linguists have been able to more
fully understand the semantic meaning embodied in the written word of our culture. Similarly, the
Rosetta language (as we will refer to it in this text) represents a modern-day effort to map across
semantic domains within the electronics-centric systems engineering world, motivated by the
trend towards System-Level Integration, or System-on-a-Chip (SoC). Rosetta plays a key role
in the System Level Design Language (SLDL) industry standards initiative.
-
B7.2 SystemC Standard [p. 573]
- Guido Arnout
The emergence and great popularity of system-on-chip
(SoC) designs has brought with it a variety of suggestions
for a single language that can describe all of the functional
requirements for those highly complex designs. This paper
takes a look at the requirements for system-level design
languages and evaluates what it will take for any of these
languages to be successful.
-
B7.3 Java Based Object Oriented Hardware Specification and Synthesis [p. 579]
- Tommy Kuhn, Wolfgang Rosenstiel
In this contribution we show how the object oriented
programming language Java can be used for the specification
of synthesizable hardware. In this context the JavaBeans
component model plays an important role. We show an integration
of the JavaBeans model which allows the specification of
hardware from different views and on different levels of abstraction.
Furthermore we point out restrictions of the language necessary
to perform high level synthesis.
-
B7.4 Superlog, A Unified Design Language for System-on-chip [p. 583]
- Peter L. Flake, Simon J. Davidmann
The design of systems consisting of custom software
controlling custom digital hardware is easier if a single
language can be used for system specification, software
development, hardware design and hardware verification.
Superlog takes features of existing languages for software
development and hardware design, adds features for system
specification and hardware verification, and blends them into a
single, coherent language.
Co-Chairs : Yukiya Miura, Tokyo Metropolitan Univ., Japan
Hiroshi Date, Inst. of Systems & Information Technologies Kyushu, Japan
-
C7.1 Performance Sensitivity Analysis Using Statistical Method and Its Applications to Delay Testing [p. 587]
- Jing-Jia Liou, Angela Krstic, Kwang-Ting Cheng, Deb Aditya Mukherjee, Sandip
Kundu
The performance of deep submicron designs can be
affected by various parametric variations, manufacturing defects,
noise or modeling errors that are all statistical in nature. We propose
a statistical framework for analyzing the performance sensitivity
of designs to various timing related defects/noise/variations.
The core engine of our approach is a highly efficient statistical timing
analysis tool. We describe the application of our framework
for delay fault modeling and analysis of resistive opens and shorts
and as well as interconnect crosstalk. We present experimental results
demonstrating the accuracy of our statistical framework as
compared to SPICE (for a given set of input patterns) and nominal
worst-case analysis. Experimental results for analysis of resistive
opens and shorts are also included.
-
C7.2 A Testability Metric for Path Delay Faults and Its Application [p. 593]
- Huan-Chih Tsai, Kwang-Ting Cheng, Vishwani D. Agrawal
In this paper, we propose a new testability metric
for path delay faults. The metric is computed efficiently using
a non-enumerative algorithm. It has been validated through extensive
experiments and the results indicate a strong correlation
between the proposed metric and the path delay fault testability
of the circuit. We further apply this metric to derive a path delay
fault test application scheme for scan-based BIST. The selection of
the test scheme is guided by the proposed metric. The experimental
results illustrate that the derived test application scheme can
achieve a higher path delay fault coverage in scan-based BIST.
Because of the effectiveness and efficient computation of this metric,
it can be used to derive other design-for-testability techniques
for path delay faults.
-
C7.3 A Non-Scan DFT Method at Register-Transfer Level to Achieve Complete Fault Efficiency [p. 599]
- Satoshi Ohtake, Hiroki Wada, Toshimitsu Masuzawa, Hideo Fujiwara
This paper presents a non-scan design-for-testability
(DFT) method for VLSIs designed at register-transfer
level (RTL) to achieve complete fault efficiency.
In RTL design, a VLSI generally consists of a controller
and a data path. The controller and the data path are connected
with internal signals: control signals and status signals.
The proposed method consists of the following two
steps. First, we apply our DFT methods [1] and [2, 3] to
the controller and the data path, respectively. Then, to
support at-speed testing, we append a test plan generator
which generates a sequence of test control vectors for the
modified data path. Our experimental results show that
the proposed method can reduce significantly both of test
generation time and test application time compared with
the full-scan design, though the hardware overhead of our
method is slightly larger than that of the full-scan design.
-
C7.4 A Sigma-Delta Modulation Based BIST Scheme for Mixed-Signal Circuits [p. 605]
- Jiun-Lang Huang, Kwang-Ting Cheng
In this work, we present the analysis of a built-in self-test
(BIST) scheme for mixed-signal circuits that is intended to
provide on-chip stimulus generation and response analysis.
Based on the sigma-delta modulation principle, the proposed
scheme can produce high-quality stimuli and obtain accurate
measurements without the need of precise analog circuitry.
Numerical simulations are conducted to validate our idea and
the results show that the scheme is a promising BIST approach
for mixed-signal circuits.
Organizer: Tokinori Kozawa, STARC, Japan
Moderator: Hiroto Yasuura, Kyushu Univ., Japan
Panelists:
Bill Joyner (SRC, USA)
Jan Rabaey (Univ. of California, Berkeley, USA)
Ivo Bolson (IMEC, Belgium)
Yoshiaki Masuhara (Hitachi Ltd., Japan)
Co-Chairs : Hidetoshi Onodera, Kyoto Univ., Japan
Nobuo Fujii, Tokyo Inst. of Technology, Japan
-
E7.1 A 12b 50 MHz 3.3V CMOS Acquisition Time Minimized A/D Converter [p. 613]
- Young-Deuk Jeon, Byeong-Lyeol Jeon, Seung-Chul Lee, Sang-Min Yoo,
Seung-Hoon Lee
A 12b 50 MHz CMOS analog-to-digital converter
(ADC) based on a pipelined architecture was designed to
demonstrate acquisition time minimization techniques for high-speed
two-stage amplifiers. The proposed techniques reduce
overshoots and undershoots of amplifier outputs and
acquisition time by controlling the bias currents of amplifiers.
The prototype ADC was fabricated in a 0.35 um double-poly
triple-metal n-well CMOS technology. The measured signal-to-noise-and-distortion ratio is improved by more than 5 dB
using the proposed techniques at a 50 MHz clock. The ADC
power consumption is 200 mW at 3.3 V and 50 MHz.
-
E7.2 A Benchmark Suite for Substrate Analysis [p. 617]
- Edoardo Charbon, Luis Miguel Silveira, Paolo Miliozzi
The paper proposes an initial benchmark set, suitable for substrate analysis and test. The
aim is to help accurately represent electrical noise injected into and picked up from substrate in a variety
of high performance circuits. Creating an accurate
image of such noise is becoming a critical requirement
with the expansion of real plug-and-play style designs.
Several important methods for the analysis of substrate parasitic coupling are reviewed in light of the
effect substrate noise has on the performance of analog and digital ICs over a wide frequency spectrum.
The requirements and formats for each benchmark are
described in full detail to allow possible algorithmic as
well as signal integrity tests.
-
E7.3 Embedded Tutorial : Substrate Crosstalk Analysis in Mixed Signal CMOS Integrated
Circuits [p. 623]
- Makoto Nagata, Atsushi Iwata
Substrate crosstalk modeling and analysis techniques for a reliable SoC oriented mixed signal
IC design are outlined. Substrate noise measurements
by use of a source follower and a latched comparator clearly indicate that
L_di/dt and R_i effects on
Vdd/Gnd wirings appear in the substrate. An experimental full chip
substrate crosstalk verification system based on a macroscopic substrate
noise model efficiently analyzes the analog performance degradation
relevant to the digital activity on the same chip.
Co-Chairs : Takayasu Sakurai, Univ. of Tokyo, Japan
Makoto Ikeda, Univ. of Tokyo, Japan
-
A8.1 Invited Talk: Importance of CAD Tools and Methodology in High Speed CPU
Design [p. 631]
- Haruyuki Tago, Kazuhiro Hashimoto, Nobuyuki Ikumi,
Masato Nagamatsu, Masakazu Suzuoki, Yasuyuki Yamamoto
Design methodologies and CAD for "Emotion
Engine" LSI are presented with emphasis on practical aspects
of verification and timing closure. A combination of simulation,
emulation and formal verification ensured the functional first
silicon for system evaluation. In order to control wire delay in
early design stage, floor-plan based synthesis and wire load
estimation are adopted for quick timing closure.
-
A8.2 300MHz Design Methodology of VU for Emotion Synthesis [p. 635]
- Takayuki Kamei, Hideki Takeda, Yukio Ootaguro, Takayoshi Shimazawa, Kazuhiko
Tachibana, Shin'ichi Kawakami, Seiji Norimatsu, Fujio Ishihara, Toshinori Sato,
Hiroaki Murakami, Nobuhiro Ide, Yukio Endo, Akira Aono, Atsushi Kunimatsu
With advance in silicon technology and system
integrating technology, utmost care for RC-delay must be taken
in high performance LSI design. This paper describes the new
design method applied to the Vector Unit(VU) implemented in
the recently announced high-performance 3D-graphics engine,
"Emotion Engine". The key features of the design method are the
followings:
careful functional design of the VU architecture
design hierarchy consistency between RTL descriptions and
floorplan plan
accurate estimation of wire load in pre-layout static timing
analysis
With the above design features, the VU is has achieved the
operating frequency of 300MHz.
-
A8.3 Repeater Insertion Method and its Application to a 300MHz 128-bit 2-way
Superscalar Microprocessor [p. 641]
- Norman Kojima, Yukiko Parameswar, Christian Klingner, Yukio Ohtaguro, Masataka
Matsui, Shigeaki Iwasa, Tatsuo Teruyama, Takayoshi Shimazawa, Hideki Takeda,
Kouji Hashizume, Haruyuki Tago, Masaaki Yamada
Interconnect delay is dominant in today's high
speed VLSI circuits and there have been various works to resolve
it [1],[2],[3]. A repeater insertion tool "RePertory" has been
newly developed to solve the interconnect timing problem on a
300MHz 128-bit 2-way Superscalar Microprocessor. Because of
its practical simple algorithm, the location of over 5700 repeaters
distributed over 5 modules is calculated in 26 minutes on 5
UltraSPARC-II 360MHz in parallel. This paper describes the
goals of RePertory for this project, the timing improvement flow
with RePertory, the results and the analysis of this repeater
insertion method.
-
A8.4 Clock Design of 300MHz 128-bit 2-way Superscalar Microprocessor [p. 647]
- Fujio Ishihara, Christian Klingner, Ken-ichi Agawa
Less than 116ps overall clock skew has been achieved across the 15.02mm
x 15.04mm die by balanced clock path routing and differential clock signal
distribution in the global clock tree of 300 MHz 128-bit 2-way superscalar
microprocessor. The shared clock wire configuration and clock buffer layout
patterns over the whole die enhance the clock skew insensitivity to process
fluctuation. Combination of three different clock tuning methods is successfully
applied to the entire clock tree and the clock skew is minimized efficiently
within a limited design period.
One Language or More? -How Can We Design an SoC at a System Level? - [p. 653]
Organizer: Masaharu Imai, Osaka Univ., Japan
Moderator: Gary Smith, Dataquest, USA
Panelists:
Steven Schulz - Texas Instruments, Inc., USA (SLDL)
Karen Bartleson - Synopsys, Inc., USA (SystemC)
Daniel D. Gajski - Univ. of Calfornia, Irvine, USA (SpecC)
Wolfgang Rosenstiel - Univ. of Tuebingen, Germany (Java)
Peter Flake - Co-Design Automation, Inc., USA (Superlog)
Hiroto Yasuura - Kyushu Univ., Japan (C,C++)
Masaharu Imai - Osaka Univ., Japan(VHDL)
The emerging Core Based Design Methodology is forcing engineers up into the
Electronic System level of design. It is only there that they can make the
hardware and software trade-offs needed for optimal design performance.
Unfortunately, as we move up into ES Level Design, our two hardware
description languages, VHDL and Verilog, are running out of steam.
The design language dilemma is, "Which language will fill our need
for a new ESL language ?" Some start-ups are pushing alternate
hardware languages, while others are approaching the problem with
modifications of software languages. The modifications are necessary,
as software languages are sequential languages. A hardware design
language must be concurrent.
This panel will discuss the merits of various languages and
various methodologies being presented to solve this problem.
Co-Chairs : Chung-Kuan Cheng, Univ. of California, San Diego, USA
Toshiyuki Shibuya, Fujitsu Laboratories, Ltd., Japan
-
D8.1 Circuit Partitioning with Coupled Logic Restructuring Techniques [p. 655]
- Yu-Liang Wu, Xiao-Long Yuan, David Ihsin Cheng
Traditionally, the circuit partitioning is done by modeling the circuit as
a graph and the partitioning is carried out without altering the circuit
itself. Applying the technique of circuit rewiring, the partitioning can be
further improved by doing some logic logic perturbation along the cut-line
to drag the solution out of some local minimal. In this paper, we propose
an effective coupling scheme of two powerful rewiring techniques to further
improve upon those already selected best partition results produced by other
conventional partition tools. The improvement is attributed to the additional
capability of exercising a guided circuit perturbation, which was not available
in the conventional schemes.
The known ATPG-based and the recently proposed graph-based rewiring techniques
compose our coupling scheme. The ATPG-based rewiring technique is very flexible;
however the graph-based rewiring technique is faster and can couple quite well
with the ATPG-based scheme to exploit a much larger room for logic perturbations.
Our encouraging experiment results show that these two techniques couple each
other quite well for this application without costing much CPU overhead. This
scheme is also quite efficient thus should be very useful for large circuits.
-
D8.2 Improved Algorithms for Hypergraph Bipartitioning [p. 661]
- Andrew E. Caldwell, Andrew B. Kahng, Igor L. Markov
Multilevel Fiduccia-Mattheyses (MLFM) hypergraph partitioning [3, 22, 24] is a fundamental optimization in VLSI
CAD physical design. The leading implementation, hMetis
[23], has since 1997 proved itself substantially superior in
both runtime and solution quality to even very recent works
(e.g., [13, 17, 25]). In this work, we present two sets of results: (i) new techniques for at FM-based hypergraph partitioning (which is the core of multilevel implementations),
and (ii) a new multilevel implementation that offers leading-edge performance.
Our new techniques for at partitioning confirm the conjecture from [10], suggesting that specialized partitioning
heuristics may be able to actively exploit fixed nodes in partitioning instances arising in the driving top-down placement
context. Our FM variant is competitive with traditional FM
on instances without terminals [1] and considerably superior
on instances with fixed nodes (i.e., arising during top-down
placement [8]).
Our multilevel FM variant avoids several complex heuristics and
non-trivial tunings that often lead to complex implementations; it
achieves trade-offs between solution quality and run time that are comparable or better than those
achieved by hMetis-1.5.3 (the latest available version).
Following [6], we attempt to provide algorithm descriptions
that are as detailed and unambiguous as possible, to allow
replicability and speed improvements in future research.
-
D8.3 Multi-way Partitioning Using Bi-partition Heuristics [p. 667]
- Maogang Wang, Sung Lim, Jason Cong, Majid Sarrafzadeh
The multi-way partition problem is very
important in various applications. In this paper, we
use analytical and experimental results to study the
k -way partition problem. We introduce the concept of
embedding graph for the the k -way partition problem.
Based on this concept, we explain different scenarios of
using a bi-partition heuristic to solve the k -way
partition problem. If C denote the optimal cut cost
for the k -way partition problem and the bi-partition
heuristics we use are %delta;-approzimation heuristics (defined in
Section 2), we prove that the cut cost from
the hierarchical approach has an approximate upper
bound of %delta;C_log k while the cut cost from the all-way bipartition,
or at approach, has an upper bound of %delta;Ck .
This is contrary to some claims made in recent literature
( and CAD tools designed based on it). Experimental
results strongly support our theoretical analysis.
Our results show that for large target graph, the
hierarchical approach is about 77% better than the
single-pass all-way bi-partition approach. The all-way
bi-partition approach will perform better in a multi pass
set-up. However, the hierarchical approach is still
on average 7 : 1% better in quality and 144 times faster
than the multi-partition all-way bi-partition approach.
The main conclusion of this paper is, contradicted to
what has been suggested in literature, hierarchical bi-partitioning is
a more effective multi-way partitioning scheme.
|