|
ASP-DAC 2003 ABSTRACTS
Sessions:
[1A]
[1B]
[1C]
[1D]
[2A]
[2B]
[2C]
[2D]
[3A]
[3B]
[3C]
[3D]
[4A]
[4B]
[4C]
[4D]
[5A]
[5B]
[5C]
[5D]
[6A]
[6B]
[6C]
[6D]
[7A]
[7B]
[7C]
[7D]
[8A]
[8B]
[8C]
[8D]
[9A]
[9B]
[9C]
[9D]
Co-Chairs: Luca Benini, Hiroshi Nakamura
-
1A-1 BEAM: Bus Encoding Based on Instruction-Set-Aware Memories [p. 3]
-
Y. Aghaghiri, F. Fallah, M. Pedram
This paper introduces a new approach for
minimizing power dissipation on the memory address bus. The
proposed approach relies on the availability of smart memories
that have certain awareness of the instruction format of one or
more architectures. Based on this knowledge, the memory
calculates or predicts the instruction and data addresses.
Hence, not all addresses are sent from the processor to the
memory. This, in turn, significantly reduces the activity on the
memory bus. The proposed method can eliminate up to 97% of
the transitions on the instruction address bus and 75% of the
transitions on the data address bus with a small hardware
overhead. The actual power savings of 85% for the instruction
bus and 64% for the data bus were achieved for a per-line bus
capacitance of 10pF.
-
1A-2 Irredundant Address Bus Encoding Techniques Based on Adaptive Codebooks for Low Power [p. 9]
-
S. Komatsu, M. Fujita
The power dissipation at the off-chip bus is a significant
part of the overall power dissipation in digital systems.
This paper presents irredundant address bus encoding methods
which reduce signal transitions on the instruction address buses
by using adaptive codebook methods. These methods are based
on the temporal locality and spatial locality of instruction address.
Since applications tend to JUMP / BRANCH to limited sets
of addresses, proposed encoding methods assign the least signal
transition codes to the addresses of JUMP / BRANCH operations
in the past. Our encoding methods reduce the signal transitions
on the instruction address buses by an average of 88%.
-
1A-3 Multi-Parametric Improvements for Embedded Systems Using Code-Placement and Address Bus Coding [p. 15]
-
S. Parameswaran, J. Henkel, H. Lekastas
Code placement techniques for instruction code have shown
to increase an SOCs performance mostly due to the increased
cache hit ratios and as such those techniques can be a major
optimization strategy for embedded systems. Little has been
investigated on the interdependencies between code placement
techniques and interconnect traffic (e.g. bus traffic) and optimization
techniques combining both. In this paper we show as
the first approach of its kind that a carefully designed known
code placement strategy combined and adapted to a known interconnect
encoding scheme does not only lead to a performance
increase but it does also lead to a significant reduction
of interconnect-related energy consumption. This becomes especially
interesting since future SOC bus systems (or more
general: änetworks on a chipä) are predicted to be a dominant
energy consumer of an SOC. We show that a high-level optimization
strategy like code placement and a lower-level optimization
strategy like interconnect encoding are NOT orthogonal.
Specifically, we report cache miss reduction ratios of 32%
in average combined with bus related energy savings of 50.4%
in average (with a maximum of up to 95.7%) by means of our
combined optimization strategy. The results have been verified
by means of diverse real-world SOC applications.
-
1A-4 Memory Access Pattern Analysis and Stream Cache Design for Multimedia Applications [p. 22]
-
J. Lee, C. Park, S. Ha
Memory system is a major performance and power bottleneck
in embedded systems especially for multimedia applications.
Most multimedia applications access stream type of data
structures with regular access patterns. It is observed that
conventional caches behave poorly for stream-type data
structure. Therefore, prediction-based prefetching techniques
have been extensively researched to exploit the regular access
patterns. Prefetching, however, may pollute the cache if the
prediction is not accurate and needs extra hardware
prediction logic. To overcome these problems, we propose a
novel hardware prefetching technique that is assisted by static
analysis of data access pattern with stream caches. With the
proposed stream cache architecture, we could achieve
significant performance improvement compared with the
conventional cache architecture.
Co-Chairs: Masanori Hashimoto, Kaushik Roy
-
1B-1 A Statistical Gate Delay Model for Intra-chip and Inter-chip Variabilities [p. 31]
-
K. Okada, K. Yamaoka, H. Onodera
This paper proposes a model to calculate statistical
gate-delay variation caused by intra-chip and inter-chip variabilities.
Our model consists of a statistical transistor model and a
gate-delay model. We present a modeling and extracting method
of transistor characteristics for the intra-chip variability and the
inter-chip variability. In the modeling of the intra-chip variability,
it is important to consider a gate-size dependence by which
the amount of intra-chip variation is affected. This effect is not
captured in a statistical delay analysis reported so far. Our gate-delay
model characterizes a statistical gate-delay variation using
a response surface method (RSM) according to the intra-chip and
inter-chip variability of each transistor in a gate. We evaluate the
accuracy of our model, and we show some simulated results of a
circuit delay variation characterized by the measured variances
of transistor currents.
-
1B-2 A Fast and Accurate Method for Interconnect Current Calculation [p. 37]
-
M. Shao, D.F. Wong, Y. Gao, H. Cao, L.-P. Yuan
As VLSI technology continues to scale down, electromigration
problem has become one of the dominant factors in
determining system reliability. This problem is caused by high
current density flowing in the metal interconnect. Therefore,
current evaluation is a crucial concern in IC design. SPICE level
circuit simulators are excellent for doing current calculation,
however, their running time are too expensive to be used
repeatedly in design synthesis loops. In this paper, we propose
an efficient approach for the interconnect current calculation.
This method is based on moment matching but does not need
high order moments. It only needs traversing the RC tree once
to get the mean current value of every segment, traversing the
tree once more is enough for the RMS current calculation, and
two more traversals is sufficient for the peak current calculation.
We apply our method to a larger number of interconnects
getting close-to-SPICE accuracy at significantly faster runtimes.
In particular, applying the method to 17,387 wire segments in
the clock tree of a commercial IC, we obtained that the average
deviation error of mean current is 0.0569%, average RMS
current error is 0.703% and average peak current error is
6.552%. It took 28 hours for HSPICE to get current value of all
the wire segments and it only took our method 156 seconds.
-
1B-3 Calculating the Effective Capacitance for the RC Interconnect in VDSM Technologies [p. 43]
-
S. Abbaspour, M. Pedram
In this paper, we present a new technique for calculating an
effective capacitance of an RC interconnect line in very deep
submicron design technologies. The calculation scheme
guarantees that the effective capacitance model simultaneously
matches both the 50% propagation delay and the 0-to-0.8Vdd
output transition behavior of a standard cell driving an RC
interconnect. Experimental results show that the new technique
exhibits high accuracy (less than 5% error) and high efficiency
(converges in two or at most three iterations). The paper also
includes extensions to handle complex cells as drivers of the RC
interconnect.
-
1B-4s Reduction of Crosstalk Noise by Optimizing 3-D Configuration of Routing Grid [p. 49]
-
A. Sakai, T. Yamada, Y. Matsushita, H. Yasuura
In this paper, we propose novel physical design
techniques for a sub-quarter micron system-on-a-chip (SoC). By
appropriately optimizing the routing grid space or the cell
utilization ratio, the coupling effects are almost eliminated. By
employing our proposed techniques on a 0.13µm six-layer physical
design, the longest path delay is significantly decreased by 15%
maximum without the need for process improvement. This
significant delay reduction, which corresponds to a half
generation of process progress, greatly accelerates the
performance of SoCs.
-
1B-5s Design Tools for 3-D Integrated Circuits [p. 53]
-
S. Das, A. Chandrakasan, R. Reif
We present a set of design tools for 3-D integration.
Using these tools - a 3-D standard-cell placement tool,
global routing tool, and layout editor - we have targeted existing
standard-cell circuit netlists for fabrication using wafer bonding.
We have analyzed the performance of several circuits using these
tools and find that 3-D integration provides significant benefits.
For example, relative to single-die placement, we observe on average
28% to 51% reduction in total wire length.
Co-Chairs: Nikil Dutt, Akira Fukuda
-
1C-1 An On-line Approach for Power Minimization in QoS Sensitive Systems [p. 59]
-
J. L. Wong, M. Potkonjak, G. Qu
Majority of modern mobile systems have two common
denominators: quality-of-service (QoS) requirements, such
as latency and synchronization, and strict energy constraints.
However, until now no synthesis techniques have been proposed
for the design and efficient use of such systems. We have two main
objectives: synthesis and conceptual. The synthesis goal is to introduce
the first design technique for quality-of-service (QoS) low
power synthesis. The conceptual objective is to develop a generic
technique for the automatic development of on-line algorithms
from efficient off-line algorithms using statistical techniques.
We first summarize a system of provably-optimal techniques
that minimize energy consumption of stream-oriented applications
under two main QoS metrics: latency and synchronization.
Specifically, we study how multiple voltages can be used to simultaneously
satisfy hardware requirements and minimize power
consumption, while preserving the requested level of QoS in terms
of latency and synchronization. The off-line algorithm is used as
input to statistical software used to identify important relevant
parameters of the processes, buffer occupancy rate indicators,
and a way how combine them to form a fast and efficient on-line
algorithm which decides which task to run at which voltage. The
effectiveness of the algorithms is demonstrated on a number of
standard multimedia benchmarks.
-
1C-2 Energy Minimization of Real-time Tasks on Variable Voltage Processors with Transition Energy Overhead [p. 65]
-
Y. Zhang, X. Hu, D. Chen
In this paper, we address the problem
of minimizing energy consumption of real-time tasks
on variable voltage processors whose transition energy
overhead is not negligible. Voltage settings with
minimum number of transitions are found first and
sequences of lower voltage cycles are evaluated to decide
voltage for each cycle of every task. Experimental
results demonstrate that our approach can reduce
energy consumed by transitions from 41% to 8% and
save more energy.
-
1C-3 Register Aware Scheduling for Distributed Cache Clustered Architecture [p. 71]
-
Z. Wang, X.S. Hu, E. Sha
Increasing wire delays have become a serious problem
for sophisticated VLSI designs. Clustered architecture offers
a promising alternative to alleviate the problem. In the clustered
architecture, the cache, register file and function units are
all partitioned into clusters such that short CPU cycle time can
be achieved. A key challenge is the arrangement of inter-cluster
communication. In this paper, we present a novel algorithm for
scheduling inter-cluster communication operations. Our algorithm
achieves better register resource utilization than the previous
methods. By judiciously putting the selected spilled variables
into their corresponding consumer's local cache, the costly cross-cache
transfer is minimized. Therefore, the distributed caches
are used more efficiently and the register constraint can be satisfied
without compromising the schedule performance. The experiments
shows that our technique outperforms the existing cluster-oriented
schedulers.
-
1C-4 Data Partitioning for Maximal Scratchpad Usage [p. 77]
-
M. Verma, S. Steinke, P. Marwedel
The energy consumption for Mobile Embedded
Systems is a limiting factor because of todayâs battery capacities.
The memory subsystem consumes a large chunk of the energy,
necessitating its efficient utilization. Energy efficient scratchpads
are thus becoming common, though unlike caches they require to
be explicitly utilized. In this paper, an algorithm integrated into
a compiler is presented which analyzes the application, partitions
an array variable whenever its beneficial, appropriately modifies
the application and selects the best set of variables and program
parts to be placed onto the scratchpad. Results show an energy
improvement between 5.7% and 17.6% for a variety of applications
against a previously known algorithm.
Co-Chairs: Kiyoharu Hamaguchi, Yuji Kukimoto
-
1D-1 SAT-based Sequential Depth Computation [p. 87]
-
M. Mneimneh, K. Sakallah
Determining the depth of sequential circuits is a crucial
step towards the completeness of bounded model checking
proofs in hardware verification. In this paper, we formulate sequential
depth computation as a logical inference problem for
Quantified Boolean Formulas. We introduce a novel technique to
simplify the complexity of the constructed formulas by applying
simple transformations to the circuit netlist. We also study the
structure of the resulting simplified QBFs and construct an efficient
SAT-based algorithm to check their satisfiability. We report
promising experimental results on some of the ISCAS 89 benchmarks.
-
1D-2 Logic Verification Based on Diagnosis Techniques [p. 93]
-
A. Veneris, A. Smith, M.S. Abadir
We present a formal logic verification methodology for
combinational circuits. The method uses simulation,
logic diagnosis and ATPG to identify circuit lines that
implement equivalent logic functions efficiently. One advantage
of the proposed technique is that it identifies
line equivalences under controllability and observability
don't care conditions, while not suffering from false negatives.
The method is easy to implement, and, due to
its general nature, existing techniques can benefit from
ideas described here. We also give implementation details
and present experiments to confirm its potential.
-
1D-3 Algorithms for Compacting Error Traces [p. 99]
-
Y.-A. Chen, F.-S. Chen
In this paper, we present a concept of compacting
the error traces generated by pseudo-random/random simulations.
The new shorter error trace not only decreases the time of
user's debugging process but also reduces the simulation time required
to verify the bug fixes. Two algorithms CET1 and CET2 are
presented to perform the task of compacting the error trace. Both
algorithms first use an efficient approach to eliminate the redundant
states to generate the unique states of the error trace. Then,
CET1 build the connected graph of these unique states by computing
the reachable states by one cycle for each unique state, and
then apply Dijkstra's shortest path algorithm to find out the shortest
error trace in the connected graph. Compared with CET1,
CET2 computes the reachable states by one cycle for those unique
states, when they are needed in Dijkstra's shortest path algorithm
to find the shortest error trace. After finding the shorter trace,
the corresponding input/output test vectors are generated. The
experimental results show that both algorithms can reduce the
length of error traces dramatically for most cases using reasonable
memory. For cases required longer CPU time to find the
shortest trace, CET2 is up to 37 times faster than CET1.
-
1D-4s Transaction-based Waveform Analysis for IP Selection [p. 104]
-
J. Liu, E. Shragowitz
In a process of IP selection, it is necessary to
establish whether a candidate IP is equivalent to a behavioral
model of a design proposed by a customer. It is desirable to
perform this verification to exclude IPs, which "don't match"
the rest of the design. This work combines a simulation
approach to establishing equivalence between models with
formal regular expression techniques to provide transaction-level
preliminary evaluation of IP suitability. Such evaluation
could precede a decision to acquire IP.
-
1D-5s An Automatic Interconnection Rectification Technique for SoC Design Integration [p. 108]
-
C.-Y. Wang, S.-W. Tung, J.-Y. Jou
This paper presents an automatic interconnection
rectification (AIR) technique to correct the misplaced interconnection
occurred in the integration of a SoC design automatically.
The experimental results show that the AIR can correct the misplaced
interconnection and therefore accelerates the integration
verification of a SoC design.
Co-Chairs: Nagisa Ishiura, Wolfgang Rosenstiel
-
2A-1 Typing Abstractions and Management in a Component Framework [p. 115]
-
F.J. Doucet, S.K. Shukla, R.K. Gupta
We consider the type inference problems in a compositional
design environment where the components are automatically
instantiated from pre-existing C++-based intellectual
property (IP) libraries. We present a component integration language
based on scripting for design specification. Our focus is architectural
aspects in specification that uses aggregation, as opposed
to the more commonly used inheritanceö for composition
of components. Our approach simplifies architectural specification
by employing a type inference and type-management-environment.
We show that the type inference problem is NP-complete.
We present a heuristic based on code generation and parameterization
to solve the type inference for IP selection in our C++-based
composition environment. We have implemented the composition
and type management in the BALBOA framework. The results
show the utility of our approach.
-
2A-2 Event-Driven Observability Enhanced Coverage Analysis of C Programs for Functional Validation [p. 123]
-
F. Fallah, I. Ghosh, M. Fujita
Software programs written in some programming languages like C, C++, Java, etc,
are mostly verified by functional simulation. Since exhaustive functional
simulation is impossible for even a small C program, it is important to
quantitatively measure the extent of design verification during simulation by
a set of test vectors. Various coverage metrics have been proposed for
measuring the degree of design verification. Most of them compute the extent
of design excitation (controllability) but are unable to say whether the
excitation responses have propagated to observable points in the program
(observability). In this paper we propose a metric for code coverage
analysis of C programs that addresses not only controllability but
tackles observability as well. Thus, this metric is able to tell what
percentage of the simulation responses have been propagated to observable
points in the program like primary outputs or printed variables. We improve
upon a recently proposed observability enhanced software coverage metric by
increasing the accuracy of the analysis as well as decreasing the simulation
runtime overhead by using an event-driven coverage analysis method. We report
some experimental results of using our coverage analysis tool for several
C programs.
-
2A-3 Trace-driven Rapid Pipeline Architecture Evaluation Scheme for ASIP Design [p. 129]
-
J.K. Kim, T.G. Kim
This paper proposes a rapid evaluation scheme of pipeline
architecture using phase-accurate simulation with only delay
model and trace. With latency information for every stage, we
can decide if an instruction in one stage can proceed to the next
stage or if an instruction can be issued for each cycle without
evaluating the value for registers. Branch target becomes
available with trace generated by fast instruction set
simulation. Fast verification time becomes possible because
instruction set simulation is performed only once.
-
2A-4 A Hardware/Software Partitioning Algorithm for SIMD Processor Cores [p. 135]
-
K. Tachikake, N. Togawa, Y. Miyaoka, J. Choi, M. Yanagisawa, T. Ohtsuki
This paper proposes a new hardware/software partitioning
algorithm for processor cores with SIMD instructions.
Given a compiled assembly code including SIMD
instructions, a timing constraint of execution time, and
available hardware units, the proposed algorithm synthesizes
an area-optimized processor core with a new assembly
code. Firstly, we assume an initial processor core on
which an input assembly code can run with the shortest
execution time. Secondly we reduce a hardware unit
added to a processor core one by one while the timing
constraint is satisfied. At the same time, we update the
assembly code so that it can run on the new processor
configuration. By repeating this process, we finally obtain
a processor core architecture with small area under
the given timing constraint. We expect that we can obtain
a processor core which has appropriate SIMD functional
units for running the input application program.
The promising experimental results are also shown.
Co-Chairs: Hidetoshi Onodera, Martin D. F. Wong
-
2B-1 Approximate Formulae Approach for Efficient Inductance Extraction [p. 143]
-
A. Kurokawa, T. Sato, H. Masuda
In this paper, we present a new and effective
approach to the extraction of on-chip inductance, in which we
apply approximate formulae. The equations are based on the
assumption of filaments or bars of finite width and zero
thickness and are derived through Taylorâs expansion of the
exact formula for mutual inductance between filaments.
Despite the assumption of uniform current density in each of
the bars, the model is sufficiently accurate for the
interconnections of current and future LSIs, in which most of
the wires are not affected by the skin and proximity effects.
Expression of the equations in polynomial form provides a
balance between accuracy and computational complexity.
These equations are mapped according to the geometric
structures for which they are most suitable in minimizing
runtime in the calculation of inductance while remaining
accurate to within 3%. Within the geometrical constraints, the
wires are of arbitrary specification.
From a comprehensive evaluation on the ITRS -specified
global wiring structure for 2003, the values for inductance
extracted through the proposed approach are within 3% of the
values obtained by commercial three-dimensional (3-D) field
solvers. The efficiency of the proposed approach is also
demonstrated by extraction from a real layout design that has
300-k interconnecting segments.
-
2B-2 Accurate Prediction of the Impact of On-chip Inductance on Interconnect
Delay using Electrical and Physical Parameter-based RSF [p. 149]
-
T. Sato, T. Kanamoto, A. Kurokawa, Y. Kawakami, H. Oka, T. Kitaura, A. Ikeuchi,
H. Kobayashi, M. Hashimoto
This paper proposes a new methodology to accurately
predict the impact of inductance on on-chip wire delay
using response surface functions (RSF). The proposed
methodology consists of two stages which involves first
calculating the delay difference between RC and RLC wire
models for a set of parameter variations, then building
RSFs using electrical parameters such as wire resistance,
capacitance, etc. , and physical parameters such as wire
width, pitch, etc. as variables. The proposed methodology
can help 1) to define design rules for avoiding inductance
effects, 2) to point out wires that require RLC delay calculation,
and 3) to estimate and correct the delay when
using an RC model. An example design rule for limiting
self inductance and accurate estimation of the delay
difference for a 100 nm technology node is also presented.
-
2B-3 A Metric for Analyzing Effective On-Chip Inductive Coupling [p. 156]
-
G. Zhong, C.-K. Koh, K. Roy
In this paper, we propose a metric for effective inductive
coupling: the matrix (R+ jwL) 1, where R and L are the
resistance and inductance matrices. We use this metric to analyze
the effectiveness of shields on reducing inductive coupling. Our
analysis shows how the resistances of shields affect the effective
inductive coupling between signal nets. SPICE simulations are
carried out to validate the proposed metric.
-
2B-4 Determination of Worst-Case Crosstalk Noise for Non-Switching Victims in GHz+ Interconnects [p. 162]
-
J. Chen, L. He
Considering RLC interconnect model and multiple
switching aggressors, we study switching pattern generation and
switching time alignment that leads to worst-case crosstalk noise
for a quiet victim or a noisy one. We assume that aggressors
can have arbitrary switching patterns and can switch at arbitrary
times. We show that the commonly used superposition algorithm
results in 15% underestimation on average, and propose a new
algorithm that has virtually the same complexity as the superposition
algorithm but approximates the exhaustive search very well
with only 4% underestimation on average. Further, we show that
applying RC model to GHz+ interconnects in IRTS 0.10um technology
underestimates crosstalk noise by up to 80%, and convincingly
conclude that RLC model is necessary to analyze such interconnects.
Co-Chairs: Hiroaki Ueno, Zhiping Yu
-
2C-1 Recent Developments in ESD Protection for RF
ICs [p. 171]
-
A. Wang
New challenge in ESD protection design for RF ICs
is to address the complex interactions between the ESD
protection network and the core circuit being protected in both
directions. This paper reviews recent developments in RF ESD
protection design, including switching and mis-triggering of
ESD protection networks; ESD-induced parasitic capacitive,
resistive, noise coupling and self-generated noise effects; RF
ESD evaluation techniques; and low-parasitic compact RF ESD
protection solutions.
-
2C-2 Temperature-Independence-Point Properties for 0.1um-Scale Pocket-Implant
Technologies and the Impact on Circuit Design [p. 179]
-
K. Hisamitsu, H. Ueno, M. Tanaka, D. Kitamaru, M. Miura-Mattausch,
H.J. Mattausch, S. Kumashiro, T. Yamaguchi, K. Yamashita, N. Nakayama
The temperature-independence point (TIP) of the
drain current for MOS transistors in a 0:1µm-scale pocket-implant
technology is gate-length (Lg) dependent and has different
magnitudes for n-MOSFET and p-MOSFET. Circuits such as
ring-oscillators have a TIP, lying between the values for n- and p-
MOSFET. The circuit TIP is close to the n-MOSFET TIP for long
Lg and gets closer to the p-MOSFET TIP for short Lg. The reason
is the different temperature dependence of electron and hole mobility
as a function of Lg. Due to the high field effect, oscillation
periods of ring-oscillators with short Lg hardly improve, when the
supply voltage is raised beyond the TIP. Therefore, an advantageous
supply-voltage (VDD) choice for pocket-implant technologies
is near the TIP of circuits, allowing a favorable combination
of short switching delay and minimized temperature dependence.
By designing the Vth;p closer to Vth;n, not only the low power dissipation,
due to the reduction of the TIP, but also the suppressed
TIP fluctuation can be realized.
-
2C-3 Behavioral Modeling of EM Devices by Selective Orthogonal Matrix Least-Squares Method [p. 184]
-
Y. Tanji, M. Suzuki, T. Watanabe, H. Asai
This paper presents a method for modeling EM
devices, where sample data obtained by numerical EM solver is
approximated into a rational matrix of complex s. The model is
described in verilog-AMS, thus, the EM devices can be
simulated with the digital/circuit mixed circuits described at
various abstraction levels. To generate the model, the selective
orthogonal matrix least-squares method is presented. The
computational efficiency of the proposed approach is confirmed
on a commercial simulator, compared with the numerical EM
method.
Co-Chairs: Giovanni De Micheli, Shin-Ichi Minato
-
2D-1 A BDD-based Fast Heuristic Algorithm for Disjoint Decomposition [p. 191]
-
T. Bengtsson, A. Martinelli, E. Dubrova
This paper presents a heuristic algorithm for disjoint
decomposition of a Boolean function based on its ROBDD
representation. Two distinct features make the algorithm feasible
for large functions. First, for an n-variable function, it checks only
O(n2) candidates for decomposition out of O(2n) possible ones. A
special strategy for selecting candidates makes it likely that all
other decompositions are encoded in the selected ones. Second,
the decompositions for the approved candidates are computed
using a novel IntervalCut algorithm. This algorithm does not
require re-ordering of ROBDD. The combination of both techniques
allows us to decompose the functions of size beyond that
possible with the exact algorithms. The experimental results on
582 benchmark functions show that the presented heuristic finds
95% of all decompositions on average. For 526 of those functions,
it finds 100% of the decompositions.
-
2D-2 Logic Optimization for Asynchronous Speed Independent Controllers Using Transduction Method [p. 197]
-
H. Saito, H. Nakamura, M. Fujita, T. Nanya
Asynchronous speed independent (SI) circuits
based on an unbounded gate delay model often suffer from high
area penalty. It happens due to the lack of efficient global optimization.
This paper presents a boolean optimization method
based on tranduction method to optimize asynchronous SI circuits
while preserving hazard-freeness.
-
2D-3 Technology Mapping for Low Leakage Power and High Speed with Hot-Carrier Effect Consideration [p. 203]
-
C.-w. Kang, M. Pedram
Leakage power and hot-carrier effects are emerging
as key concerns in deep sub-micron CMOS technologies with respect
to their effects on the total power dissipation and reliability of VLSI
circuits. Leakage power dissipation is rapidly becoming a
substantial contributor to the total power dissipation as threshold
voltage becomes small. Similarly, the hot-carrier effect is one of the
most significant failure mechanisms in high-density VLSI circuits. In
this paper, a technology mapping technique is presented for use in
reducing the leakage power dissipation of the circuit by utilizing a
dual-threshold voltage cell library and for minimizing the aged
delay of the circuit by considering the effect of hot carriers on the
cell speeds as the circuit ages. In addition, this paper presents two
methods to reduce delay during technology mapping: primary
output ordering and pin permutation. Experimental results show
that the total power dissipation and leakage power dissipation can
be reduced by up to 27% and 52% as a result of the leakage-aware
technology mapping and that the circuit aging phenomenon can be
reduced by up to 10.6% as a result of hot-carrier-aware technology
mapping. Delay was also reduced by up to 13% by using primary
output ordering and pin permutation.
-
2D-4s Synthesis of High Performance Low Power PTL Circuits [p. 209]
-
D. Samanta, M.C. Dharmadeep, A. Pal
Among the various CMOS logic families, PTL has been
recognized as one of the potential alternatives to static CMOS for the
synthesis of high performance and low power circuits. Moreover, as BDDs
can be readily mapped to PTL circuits, use of BDDs has been synonymous
with the synthesis of PTL circuits. Most of the reported works on PTL
synthesis are based on the Reduced Ordered BDDs (ROBDDs). We have
developed a novel heuristic-based technique for obtaining Reduced
Unordered BDDs (RUBDDs), which leads to circuits of smaller size having
lesser delay and smaller power consumption compared to the existing
results. We propose the technology mapping using the popular LEAP-like
cells, such that the PTL circuit synthesis flow has the same flavor as that of
the standard cell-based static CMOS circuit synthesis. We have also
developed models for the estimation of delay and power consumption of the
synthesized PTL circuits and compared those with the static CMOS and
other existing PTL-based circuit realizations.
-
2D-5s A Technology Mapping Algorithm for Heterogeneous FPGAs [p. 213]
-
C.-C. Kao, Y.-T. Lai
In this paper, a technology mapping algorithm is proposed for heterogeneous
FPGAs. The technology mapping problem is first formulated as a flow network
problem. Then, an algorithm based on the min-cost max-flow algorithm is
presented to select a proper set of feasible LUTs for various objectives.
The objective, the total area composed of LUTs and routing area, are discussed
in the paper. This algorithm has been tested on the MCNC benchmark circuits.
Compared with other existing LUT-based FPGA mapping algorithms, the algorithm
produces better characteristics.
Co-Chairs: Masahiro Fujita, Rajesh Gupta
-
3A-1 Combining Architecture Exploration and a Path to Implementation to Build
a Complete SoC Design Flow from System Specification to RTL [p. 219]
-
M.A. Dziri, F. Samet, F.R. Wagner, W.O. Cesário, A.A. Jerraya
This paper presents a full System-on-Chip (SoC)
design flow from system specification to RT-level. A new
approach to obtain a full path to implementation for SoC
design is proposed. This approach combines architecture
design space exploration using the VCC design environment
and system synthesis using the ROSES design flow, allowing a
true and complete system level design flow. The experiment
with a VDSL application shows a significant reduction of
design time.
-
3A-2 Towards On-Chip Fault-Tolerant Communication [p. 225]
-
T. Dumitras, S. Kerner, R. Marculescu
As CMOS technology scales down into the deep-submicron
(DSM) domain, devices and interconnects are subject
to new types of malfunctions and failures that are harder to predict
and avoid with the current system-on-chip (SoC) design methodologies.
Relaxing the requirement of 100% correctness in operation
drastically reduces the costs of design but, at the same time,
requires SoCs be designed with some degree of system-level
fault-tolerance. In this paper, we introduce a high-level
model of DSM failure patterns and propose a new communication paradigm
for SoCs, namely stochastic communication. Specifically, for a
generic tile-based architecture, we propose a randomized algorithm
which not only separates computation from communication, but also
provides the required fault-tolerance to on-chip failures.
This new technique is easy and cheap to implement in SoCs that
integrate a large number of communicating IP cores.
-
3A-3 Energy-Aware Mapping for Tile-based NoC Architectures under Performance Constraints [p. 233]
-
J. Hu, R. Marculescu
In this paper, we present an algorithm which automatically
maps the IPs/cores onto a generic regular Network
on Chip (NoC) architecture such that the total communication
energy is minimized. At the same time, the performance of the
mapped system is guaranteed to satisfy the specified constraints
through bandwidth reservation. As the main contribution, we
first formulate the problem of energy-aware mapping, in a topological
sense, and then propose an efficient branch-and-bound algorithm
to solve it. Experimental results show that the proposed
algorithm is very fast and robust, and significant energy savings
can be achieved. For instance, for a complex video/audio SoC design,
on average, 60.4% energy savings have been observed compared
to an ad-hoc implementation.
Co-Chairs: Masato Edahiro, Massoud Pedram
-
3B-1 Adaptive Wire Adjustment for Bounded Skew Clock Distribution Network [p. 243]
-
H. Saaied, D. Al-Khalili, A. Al-Khalili, M. Nekili
In this paper, we suggest an adaptive approach for
the Clock Distribution Network (CDN) to cope with a modification
in the VLSI system design. The CDN's wires are
adjusted iteratively to reduce the skew that is resulting from
a minor modification in the clock pins of a complex VLSI system.
Such skew can be remedied by selecting a Balancing
Node (BN) and adjust its edges so that the skew gets smaller.
The required edge adjustments are determined using the
Elmore delay model. The performance of the algorithm is
investigated using different random sets of clock pins. Also,
the algorithm is tested by altering some clock pins in a zero
skew CDN. For small modifications in a large number of
nodes in the CDN, our algorithm can achieve zero skew with
less iterations than linear order algorithms.
-
3B-2 Power Minimization by Clock Root Gating [p. 249]
-
Q. Wang, S. Roy
Clock root gating transformation targets power
savings on the clock tree by inserting gating logic at the root of
the clock. In this paper we propose an efficient graph-based
algorithm to solve the root clock gating optimization problem.
The algorithm is also tightly integrated with clock tree synthesis
tool so that real power savings can be achieved after clock tree is
generated. Experimental results on industrial circuits showed
that significant power savings can be achieved.
-
3B-3 BBE: Hierarchical Computation of 3-D Interconnect Capacitance with BEM Block Extraction [p. 255]
-
T. Lu, Z. Wang, X. Hong
Fast and accurate interconnect parasitic parameter
extraction has become increasingly critical for verification
and analysis in VLSI today. In this paper, a fast hierarchical extraction
approach based on Boundary Element Method (BEM) is
presented for 3-D parasitic capacitance computation. Hierarchical
partition of the 3-D field creates many parts which are called
3-D BEM blocks. After combining all the 3-D BEM blocks, the
capacitance matrix for a given set of nets can be computed by applying
the boundary conditions. Numerical results shows that this
method performs many times faster than the 3-D field solver, with
equal accuracy.
-
3B-4 Improving Boundary Element Methods for Parasitic Extraction [p. 261]
-
S. Yan, J. Liu, W. Shi
We improve the accuracy and speed of
boundary element method (BEM) or multipole accelerated
BEM for interconnect parasitic extraction.
Three techniques are presented and applied to capacitance
extraction:selective coefficient enhancement,
variable order multipole and multigrid. Experimental
results show that the techniques are effective for extracting
parasitics between all pairs of conductors, or
between selected pairs of conductors.
Co-Chairs: Naoyuki Shigyo, Albert Z. H. Wang
-
3C-1 Statistical Delay Computation Considering Spatial Correlations [p. 271]
-
A. Agarwal, D. Blaauw, V. Zolotov, S. Sundareswaran, M. Zhao,
K. Gala, R. Panda
Process variation has become a significant concern for
static timing analysis. In this paper, we present a new method for
path-based statistical timing analysis. We first propose a method for
modeling inter- and intra-die device length variations. Based on this
model, we then present an efficient method for computing the total
path delay probability distribution using a combination of device
length enumeration for inter-die variation and an analytical
approach for intra-die variation. We also propose a simple and
effective model of spatial correlation of intra-die device length variation.
The analysis is then extended to include spatial correlation. We
test the proposed methods on paths from an industrial high-performance
microprocessor and present comparisons with traditional
path analysis which does not distinguish between inter- and intradie
variations. The characteristics of the device length distributions
were obtained from measured data of 8 test chips with a total of
17688 device length measurements. Spatial correlation data was also
obtained from these measurements. We demonstrate the accuracy of
the proposed approach by comparing our results with Monte-Carlo
simulation.
-
3C-2 Predicting Short Circuit Power from Timing Models [p. 277]
-
E. Acar, R. Arunachalam, S.R. Nassif
Power dissipation is becoming a major show
stopper for integrated circuit design especially in the server
and pervasive computing technologies. Careful consideration
of power requirements is expected to bring major changes in
the way we design and analyze integrated circuit
performance. This paper proposes a practical methodology to
evaluate the short-circuit power of static CMOS gates via
effective use of timing information from timing analysis. We
introduce three methods to estimate short-circuit power of a
static CMOS circuit without requiring explicit circuit
simulation. Our proposed methodology offers practical
advantages over previous approaches, which heavily rely on
simple special device models. Proposed approach is
experimented with an extensive set of benchmark examples
and several device models and found very accurate.
-
3C-3 RCLK-VJ Network Reduction with Hurwitz Polynomial Approximation [p. 283]
-
Z. Qin, C.-K. Cheng
We propose a new linear network reduction algorithm
based on a generalized Y-Delta transformation technique in
s-domain. Resultant admittance is kept as a rational function of
s with a dramatically reduced order. Yet it preserves low-order
terms of exact admittance evaluated with traditional symbolic
analysis. Stability of transfer functions derived from reduced order
admittance is guaranteed via a Hurwitz polynomial approximation.
Such low-order transfer functions are used in pole
analysis and time domain waveform evaluation in response to any
input signal.
Co-Chairs: Yirng-An Chen, Shinji Kimura
-
3D-1 Gate-Level Simulation of Quantum Circuits [p. 295]
-
G.F. Viamontes, M. Rajagopalan, I.L. Markov, J.P. Hayes
Simulating quantum computation on a classical
computer is a difficult problem. The matrices representing quantum
gates, and vectors modeling qubit states grow exponentially
with an increase in the number of qubits. However, by using
a new data structure called the Quantum Information Decision
Diagram (QuIDD) that exploits the structure of quantum operators,
many of these matrices and vectors can be represented in a
form that grows polynomially. Using QuIDDs, we implemented
a general-purpose quantum computing simulator in C++ called
QuIDDPro and tested it on Groverâs algorithm. Our QuIDD technique
asymptotically outperforms other known simulation techniques.
-
3D-2 Enhanced Symbolic Simulation for Efficient Verification of Embedded Array Systems [p. 302]
-
T. Feng, L.-C. Wang, K.-T. Cheng, M. Pandey, M.S. Abadir
In the past, Symbolic Trajectory Evaluation (STE)
was shown to be effective for verifying individual array blocks.
However, when applying STE to verify multiple array blocks together
as a single system, the run-time OBDD sizes would often
blow up. In this paper, we propose using a "dual-rail" symbolic
simulation scheme to facilitate the application of STE proof
methodology for verifying array systems. The proposed scheme
implicitly partitions a given design into control domain and datapath
domain, and symbolic simulation is carried out on both domains.
With this scheme, the run-time OBDD sizes during the
symbolic simulation for each domain can be limited. We demonstrate
the effectiveness of our approach by verifying the Memory
Management Unit (MMU) in Motorola high-performance microprocessors.
The verification of MMU as a whole was not possible
before because of the OBDD size blow-up problem when an ordinary
symbolic simulator was used in the STE proof process.
-
3D-3s Hardware Verification Using ANSI-C Programs as a Reference [p. 308]
-
E. Clarke, D. Kroening
We describe an algorithm to verify a hardware design given
in Verilog using an ANSI-C program as a specification. We use
SAT based Bounded-Model Checking [1] in order to reduce the
equivalence problem to a bit vector logic decision problem. As
a case study, we describe experimental results on a hardware
and a software implementation of the data encryption standard
(DES) algorithm.
-
3D-4s Evaluation of Multiple-Output Logic Functions Using Decision Diagrams [p. 312]
-
Y. Iguchi, T. Sasao, M. Matsuura
This paper shows four different methods to evaluate
multiple-output logic functions using decision diagrams:
Shared BDD (SBDD), Multi-Terminal BDD (MTBDD), BDD for
characteristic functions (CF), and BDDs for Encoded Characteristic
Function for Non-zero outputs (ECFNs). Methods to compute
average evaluation time for each type of decision diagrams
are presented. By experimental analysis using benchmark functions,
the number of nodes and average evaluation time are compared.
Our results show that BDDs for ECFNs outperforms
MTBDDs, BDDs for CFs, and SBDDs with respect to both number
of nodes and computation time. The sizes of BDDs for ECFNs
are smaller than for MTBDDs, BDDs for CFs, and SBDDs.
Co-Chairs: Dinesh P. Mehta, Yasuhiro Takashima
-
4A-1 A Simulated Annealing Approach with Sequence-Pair Encoding Using a Penalty
Function for the Placement Problem with Boundary Constraints [p. 319]
-
S. Tayu
The module placement is one of the most important
problem in the VLSI design. A practical VLSI placement problem
often includes some constraints. In this paper, we propose
a penalty function approach for the efficient simulated annealing
search on the solution space of constrained problems. We apply
the penalty function approach to the placement problem with
boundary constraints. Experimental results show that our proposed
method can accomplish more effective simulated annealing
search than the conventional method proposed in [3] for two modules
sets, an MCNC benchmark ami49 and a randomly generated
module set.
-
4A-2 Multi-level Placement for Large-Scale Mixed-Size IC Designs [p. 325]
-
C.-C. Chang, J. Cong, X. Yuan
In this paper we study the large-scale mixed-size
placement problem where there is a significant size variation between
big and small placeable objects (the ratio can be as large as
10,000). We develop a multi-level optimization algorithm, MPGMS,
for this problem which can efficiently handle both large-scale
designs and large size variations. Compared with the recently
published work [1] on large-scale mixed macro and standard cell
placement benchmarks for wirelength minimization, our method
can achieve 13% wirelength reduction on average with comparable
runtime.
-
4A-3 Selected Sequence-Pair: An Efficient Decodable Packing Representation in Linear Time using Sequence-Pair [p. 331]
-
C. Kodama, K. Fujiyoshi
In this paper we study the large-scale mixed-size
placement problem where there is a significant size variation between
big and small placeable objects (the ratio can be as large as
10,000). We develop a multi-level optimization algorithm, MPGMS,
for this problem which can efficiently handle both large-scale
designs and large size variations. Compared with the recently
published work [1] on large-scale mixed macro and standard cell
placement benchmarks for wirelength minimization, our method
can achieve 13% wirelength reduction on average with comparable
runtime.
-
4A-4s An Extended Representation of Q-sequence for Optimizing Channel-Adjacency and Routing-Cost [p. 338]
-
C. Zhuang, K. Sakanushi, L. Jin, Y. Kajitani
This paper proposes a topological representation
for general floorplan, called the H-sequence, which can check
channel-adjacency and boundary-adjacency in a constant time.
Moreover, we define Routing-cost for the placement to measure its
routing difficulty. Experimental results indicate that H-sequence
based placement algorithm can optimize routing-cost effectively
in a short time.
-
4A-5s Non-slicing Floorplans with Boundary Constraints Using Generalized Polish Expression [p. 342]
-
D.-S. Chen, C.-T. Lin, Y.-W. Wang
In this paper, we address the problem of VLSI
floorplanning with considering boundary constraints. The
problem is practical and crucial in physical design since
architects decide to arrange some I/O involved modules along
the chip boundary to minimize both chip area and off-chip
connections. By using a new representation called Generalized
Polish Expression, we propose an efficient algorithm to handle
the boundary constraints on non-slicing floorplans. In addition,
a new fixing heuristic based on modular similarity is also
presented to effectively fix the generated infeasible floorplans
during the process. The experimental result is good in
commonly used MCNC benchmark circuits.
Organizers: Tadahiko Nakamura, Takahide Inoue
Moderator: Takahide Inoue
Panelists: Bob Altizer, Ken Chen, Jun Iwamura, Masasuke Kishi, Grant Martin,
Augusto De Oliveira
-
-
Ever increasing UDSM design/manufacturing complexity and cost have generated
serious ROI challenges. Numerous methods have been proposed and create various
opportunities in the SoC industry and academia. Although IP-reuse has believed
to be the-must-to-go approach but so far achieved limited success. Recently many
people have argued "Platform-Based Design" as the golden solution for SoC
design/manufacturing productivity. However the common definition of the Platform
is not well established yet. In this panel, a group of distinguished panelists
with various backgrounds will discuss what is their Platform and how PBD
improve their and your SoC design productivity.
[Points of Discussion:] Some topics in this panel will be: 1) What is the
Platform and Plat-form Based Design? 2) What is my Platform? What are attributes
, components and hierarchy of the Platform? 3) What I get and what I loose
(WYGWYL) with Platform Based Design. 4) Is my Platform same as yours? / Who
should own the Platform? 5) What are EDA role, opportunity and challenges for
the best Platform development?
Co-Chairs: Reiner Hartenstein, Kiyoshi Oguri
-
4C-1 Efficient LUT-Based FPGA Technology Mapping for Power Minimization [p. 353]
-
H. Li, W.-K. Mak, S. Katkoori
We study the technology mapping problem for
LUT-based FPGAs targeting at power minimization. The problem
has been proved to be NP-hard previously. Hence, we present
an efficient heuristic to compute low-power mapping solutions.
The major distinction of our work from previous ones is that
while generating a LUT, we look ahead at the impact of the mapping
selection of this LUT on the power consumption of the remaining
network. We choose the mapping that results in the least
estimated overall power consumption. The key idea is to compute
low-power -feasible cuts by an efficient incremental network
flow computation method. Experimental results show that
our algorithm reduces both power consumption and area over the
previous algorithms reported in the literature.
-
4C-2 Optimal Reconfiguration Sequence Management [p. 359]
-
S. Ghiasi, M. Sarrafzadeh
In this paper, we present an efficient optimal algorithm for
minimizing runtime reconfiguration (context switching) delay
of executing an application on a reconfigurable system. We
assume that the basic operations of the application are already
scheduled and each of them has to be realized on the reconfigurable
fabric in order to be executed. The modeling and algorithm are both
applicable to partially reconfigurable platforms as well as Multi-FPGA
systems. The algorithm can be directly applied to minimize the
application runtime for many typical classes of applications, where
the actual execution delay of basic operations is negligible compared
to reconfiguration delay. We prove the optimality and efficiency of
our algorithm and report experimental results, which demonstrate 40%
to 2.5% improvement in total runtime reconfiguration delay.
-
4C-3s On Improving FPGA Routability Applying Multi-level Switch Boxes [p. 366]
-
J. Liu, H. Fan, Y.-L. Wu
In this paper, we propose a new FPGA switch box
design style÷the extended switch boxes. An extended switch box
is multi-level in nature. It consists of a kernel and extension(s)
connected to the kernel, while many conventional switch boxes
only consist of the kernel part and are referred to as single-level
switch boxes. We show that with a much reduced total number of
manufactured switches in the chip, this new design has a guaranteed
complete mappability from any global routing to a feasible
detailed routing of the entire FPGA chip. The interesting results
seem to open a new avenue for designing FPGA routing structures.
-
4C-4s An Image Retrieval System Using FPGAs [p. 370]
-
K. Nakano, E. Takamichi
The main contribution of this paper is to present
an image retrieval system using FPGAs. Given a template image
T and a database of a number of images I1, I2,
..., our system lists all images that contain a subimage similar to T.
More specifically, a hardware generator in our system creates the Verilog
HDL source of a hardware that determines whether Ti
has a similar subimage to T for any image Ii
and a particular template T. The created Verilog HDL source is embed
in an FPGA using
the design tool provided by the FPGA vendor. Since the hardware
embedded in the FPGA is designed for a particular template T, it
is an instance-specific hardware that allows us to achieve extreme
acceleration. We evaluate the performance of our image matching
hardware using a PCI-connected Xilinx FPGA and a timing
analyzer. Since the generated hardware attains up to 3000 speedup
factor over the software solution, our approach is promising.
-
4C-5 Logic Foundry: Rapid Prototyping of FPGA-based DSP Systems [p. 374]
-
G. Spivey, S. Bhattacharyya, K. Nakajima
The Logic Foundry is a system for the creation
and integration of FPGA-based DSP systems. Recognizing
that some of the greatest challenges in creating FPGA-based
systems occur in the integration of the various components,
we have developed a system that addresses the following four
areas of integration: design flow integration, component
integration, platform integration, and software integration.
Using the Logic Foundry, a system can easily be specified,
and then automatically constructed and integrated with
system level software.
Co-Chairs: Tohru Ishihara, Radu Marculescu
-
4D-1 Advanced Power Management Techniques: Going Beyond Intelligent Shutdown [p. 385]
-
L. Benini
Well into the System-on-Chip era, power consumption
has emerged as one of the most critical challenges to design
complexity scaling. Moving from a critical assessment of current
technologies and architectures, we survey the distinguishing features
of a design methodology that aims at energy consumption
reduction, under guaranteed quality of service (QoS), as a main
objective in system design.
-
4D-2 Design Technologies for Low Power Microprocessors [p. 390]
-
T. Hattori
Low power is one of the most important
targets of embedded microprocessor design. To reduce the
dynamic power and static power, several design technologies
applied in real microprocessors. But the term "low power" is
not simple in real application. According to the system low
power, many varieties of low power features are required for
microprocessors. For example, an application processor for
cellular phone was designed using many low-power
technologies. Both device / circuits level and architecture/logic
level low power technologies are discussed.
-
4D-3 Design Methodology of Low-Power CMOS RF-ICs [p. 394]
-
T. Tsukahara, M. Harada, M. Ugajin, J. Kodate, A. Yamagishi
This paper presents design methodology for
CMOS RF-ICs in the 2-GHz band. After describing RF
transceiver architectures, it introduces some low-voltage,
low-power CMOS front-end circuits that use an LC-tank
folding technique. Finally, it presents a 1-V, 12-mW
image-rejection receiver in the 2-GHz band.
-
4D-4s Minimizing Total Power by Simultaneous Vdd/Vth Assignment [p. 400]
-
A. Srivastava, D. Sylvester
We investigate the effectiveness of simultaneous multiple
supply and threshold voltage assignment in minimizing the total
power (static + dynamic) for the first time. Achievable power reductions
under varying conditions are investigated, including static-power
limited designs and sub-1V processes. Rules of thumb are developed for optimal
Vdd's and Vth's to be used in future designs. These models show the optimal
second Vdd to be approximately half the nominal Vdd while the total
power savings is significantly greater than previously anticipated. We
describe the impact of level conversion delays and highlight the tradeoff
between power savings and critical path count.
-
4D-5s A Low Power CMOS Circuit with Variable Source Scheme (VSCMOS) [p. 404]
-
T. Yasuda, K. Hosokawa
This paper proposes a method to reduce the standby leakage current of MOSFET
by controlling the voltage of the source node. The method allows to use a
low threshold device for high performance speed and low power dissipation
in both active and standby periods. This method can be easily applied for
conventional ASIC library circuits because no additional process circuits,
or devices except slight modification for the body contact cell are required.
Co-Chairs: Yao-Weng Chang, Shinâichi Wakabayashi
-
5A-1 Fast Buffer Planning and Congestion Optimization in Interconnect-driven Floorplanning [p. 411]
-
K.W.C. Wong, E.F.Y. Young
In this paper, we study and implement a
routability-driven floorplanner with congestion estimation and
buffer block planning. We assume that buffers should be inserted
at flexible intervals from each other for long enough wires.
Under this buffer insertion constraint, our floorplanner will estimate
congestion by computing the best possible buffer locations
for each net and perform probabilistic analysis based on the solution.
Dynamic programming is used such that estimations can
be done very effectively. Nets are topologically grouped to consider
bus-based routing and to facilitate the estimation process.
We compare our results with those in paper [16] which are the
latest results for this problem, and show that our approach can
perform better in both quality and runtime.
-
5A-2 Interconnect-Driven Floorplanning by Searching Alternative Packings [p. 417]
-
C.-w. Sham, E.F.Y. Young, H. Zhou
In traditional floorplanners, area minimization
is an important issue. Due to the recent
advances in VLSI technology, the number of transistors
in a design and their switching speeds are increasing
rapidly. This results in the increasing importance
of interconnect delay and routability of a circuit. We
should consider interconnect planning and buffer planning
as soon as possible. In this paper, we propose a
method to reduce interconnect cost of a floorplan by
searching alternative packings. We found that if a
floorplan F contains some rectangular supermodules,
we can rearrange the blocks in the supermodule to obtain
a new floorplan with the same area as F but possibly
with a smaller interconnect cost. Experimental
results show that we can always reduce the interconnect
cost of a floorplan without any penalty in area
and runtime by using this method.
-
5A-3s Noise-Aware Buffer Planning for Interconnect-Driven Floorplanning [p. 423]
-
S.-M. Li, Y.-H. Cherng, Y.-W. Chang
Crosstalk-induced noise has become a key problem in interconnect optimization
when technology improves, spacing diminishes, and coupling
capacitance/inductance increases. Buffer insertion/sizing is one of the
most effective and popular techniques to reduce interconnect delay and
decouple coupling effects. It is traditionally applied to post-layout optimization.
However, it is obviously infeasible to insert/size hundreds of
thousands buffers during the post-layout stage when most routing regions
are occupied. Therefore, it is desirable to incorporate buffer planning into
floorplanning to ensure timing closure and design convergence. In this
paper, we first derive formulae of buffer insertion for timing and noise optimization,
and then apply the formulae to compute the feasible regions for
inserting buffers to meet both timing and noise constraints. Experimental
results show that our approach achieves an average success rate of 80.9%
(78.2%) of nets meeting timing constraints alone (both timing and noise
constraints) and consumes an average extra area of only 0.49% (0.66%)
over the given floorplan, compared with the average success rate of 75.6%
of nets meeting timing constraints alone and an extra area of 1.33% by the
BBP method [3].
-
5A-4s Floorplanning with Power Supply Noise Avoidance [p. 427]
-
H.-M. Chen, L.-D. Huang, I.-M. Liu, M. Lai, D.F. Wong
With todayâs advanced integrated circuits (ICs)
manufacturing technology in deep submicron (DSM) environment,
we can integrate entire electronic systems on a single chip
(SoC). However, without careful power supply planning in layout,
the design of chips will suffer from mostly signal integrity
problems including IR-drop, Delta I noise, and IC reliability. Postroute
methodologies in solving signal integrity problem have been
applied but they will cause a long turn-around time, which adds
costly delays to time-to-market. In this paper, we study the problem
of power supply noise avoidance as early as in floorplanning
stage. We show that the noise avoidance in power supply planning
problem can be formulated as a constrained maximum flow problem
and present an efficient yet effective heuristic to handle the
problem. Experimental results are encouraging. With slight increase
of total wirelength, we achieve almost no IR-drop requirement
violation and 46.6% of improvement on Delta I noise constraint
violation compared with a previous approach.
-
5A-5s Simultaneous Floorplanning and Buffer Block Planning [p. 431]
-
I.H.-R. Jiang, Y.-W. Chang, J.-Y. Jou, K.-Y. Chao
As technology advances and the number of interconnections
among modules rapidly increases, timing closure and design
convergence are the most important concerns. Hence, it is
desirable to consider interconnect optimization as early as possible.
In this paper, we first address simultaneous floorplanning and
buffer block planning (i.e., integrating buffer block planning into
floorplanning) for interconnect optimization. Experimental results
show that our method can significantly improve the interconnect
delay and reduce the number of buffers needed.
-
5A-6s A Buffer Planning Algorithm Based on Dead Space Redistribution [p. 435]
-
S. Chen, X. Hong, S. Dong, Y. Ma, Y. Cai, C.-K. Cheng, J. Gu
This paper studies the buffer planning problem for
interconnect-centric floorplanning for nanometer technologies.
The dead-spaces are the spaces within a placement that are not
held by any circuit block. In this paper, we proposed a buffer
planning algorithm based on dead space redistribution to make
good use of dead-spaces for buffer insertion. Associated with
circuit blocks under topological representations, the dead space
can be redistributed by freely moving some circuit blocks within
their rooms in the placement. The total area and the topology of
the placement keep unchanged while doing the dead space
redistribution. The number of nets satisfying the delay
constraint can be increased by redistributing the dead space all
over the placement, which has been demonstrated by the
experimental results. The increment of the number of nets that
satisfy delay constraints is 9% on an average.
Co-Chairs: Masahiro Fujita, Sandeep K. Shukla
-
5B-1 VCore Based Design Methodology [p. 441]
-
M. Muraoka, H. Hamada, H. Nishi, T. Tada, Y. Onishi, T. Hosokawa, K. Yoshida
The VCore [1](*) based design methodology, which
has been developed at the VCDS (**) Project, is a SoC
design methodology using VCores. A VCore is a reusable,
high level abstracted design component. We have
developed the VCore based design methodology and the
VCDS tool prototype. We used the developed tool and
did a trial SoC design. The design result showed that
SoC design productivity improved using the proposed
methodology.
(*) VCDS: Virtual Core based Design System
(**) VCore: Virtual Core
-
5B-2 Synthesis for SoC Architecture Using VCores [p. 446]
-
H. Nishi, M. Muraoka, R.K. Morizawa, H. Yokota, H. Hamada
In this paper, we propose a novel architecture synthesis
method for SoC using VCores. VCores are reusable and
configurable high-level descriptions. An initial SoC
architecture, which consists of a CPU, buses, and peripherals, is
generated based on an architecture template. The hardware
and software tradeoff is possible on the architecture model
after assignment of software VCores or hardware VCores. The
assignment is based on the results of the architecture's
performance estimation. We present a prototype of the
synthesis for SoC architecture using VCores and an
architecture level design experiment using this prototype.
-
5B-3 VCore Based Platform for SoC Design [p. 453]
-
Y. Onishi, M. Muraoka, M. Utsuki, N. Tsubaki
The reuse-based design paradigm is the key to improve the
design productivity of SoCs (System on a Chip). However, SoC
designers have difficulty in using conventional IPs (Intellectual
Property) because they don't have enough variability, it is
difficult to customize them. In this paper, we propose the
variability of VCores (Virtual Cores) and show VCores are
superior to IPs for system-level design property. Based on
VCores, we have developed a VCore-based platform. We show
the SoC design productivity will be improved by using VCores
and the VCore-based platform.
-
5B-4 VCDS Tool Demonstration [p. 459]
-
R.K. Morizawa, K. Tanaka, K. Watanabe, Y. Kaitsu, S. Hanamura,
T. Shinsha, M. Muraoka
The VCore(*) based Design Methodology developed in the
VCDS(**) project is a novel design methodology utilizing
VCores. VCores are reusable functional cores defined at high
level. We designed a SoC for Wearable Computer as a vehicle
application in the pilot project (Figure 1), and compared our
proposed methodology with a conventional RTL based design
methodology by measuring the design productivity. We
obtained very promising prospects that the design productivity
could be improved 20 times for the enhanced VCDS (Figure 2).
We will make a demonstration of the implemented pilot tool
to show how effective our proposed methodology is.
(*) VCore: Virtual Core (**) VCDS: Virtual Core based
Design System
Organizer: Robert Reuss
Moderators: Jose Munoz, Toshiaki Miyazaki
Panelists: Nader Bagherzadeh, Prith Banerjee, Brad Hutchings, Jose Munoz,
Brian Schott
The Adaptive Computing Systems (ACS) program was initiated by Defense Advanced
Research Projects Agency (DARPA) of the United States in 1996. With the advent
of FPGAs, has emerged a new class of computing systems that contain configurable
hardware. This session begins by a presentation by the first ACS program manager
of its motivation, original goals, and objectives. It is then followed by
presentations of four specific projects under the ACS program. Future activities
surrounding the ACS community will be discussed at the end.
-
5C-1 DARPA's Adaptive Computing Systems Program [p. 464]
-
J. Munoz
Motivation for DARPAâs ACS program will be presented along with original goals and objectives of the
program. A brief description of some of the efforts that were initiated and why. A report card of what I
think the program did well, and where I feel additional work is still required.
-
5C-2 Applications of Adaptive Computing Systems for Signal
Processing Challenges [p. 465]
-
B. Schott, P. Bellows, M. French, R. Parker
Adaptive computing systems use FPGAs for custom hardware
acceleration in high performance and real-time applications.
Unlike single purpose dedicated hardware approaches, the reusable
nature of the technology introduces system design tradeoffs
that must balance processing density, memory, and I/O bandwidth,
not to mention more subtle issues such as ease of programming,
debugging, and physical integration into real-world
systems. This paper describes results from the DARPA-funded
SLAAC project, which developed three generations of adaptive
computing systems for a diverse set of challenging signal processing
applications.
-
5C-3 Interactive Ray Tracing on Reconfigurable SIMD MorphoSys [p. 471]
-
H. Du, M. Sanchez-Elez, N. Tabrizi, N. Bagherzadeh, M.L. Anido, M. Fernandez
MorphoSys is a reconfigurable SIMD architecture.
In this paper, a BSP-based ray tracing is gracefully mapped
onto MorphoSys. The mapping highly exploits ray-tracing
parallelism. A straightforward mechanism is used to handle
irregularity among parallel rays in BSP. To support this
mechanism, a special data structure is established, in which no
intermediate data has to be saved. Moreover, optimizations
such as object reordering and merging are facilitated. Data
starvation is avoided by overlapping data transfer with
intensive computation so that applications with different
complexity can be managed efficiently. Since MorphoSys is
small in size and power efficient, we demonstrate that
MorphoSys is an economic platform for 3D animation
applications on portable devices.
-
5C-4 An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs [p. 477]
-
P. Banerjee
This paper describes a behavioral synthesis tool called the
MATCH compiler developed as part of the DARPA
Adaptive Computing Systems program. The MATCH
compiler reads in high-level descriptions of DSP
applications written in MATLAB, and automatically
generates synthesizable RTL models in VHDL. The RTL
models can be synthesized using commercial logic
synthesis tools and place and route tools onto FPGAs. By
linking the two design domains of DSP and FPGA
hardware design, the MATCH compiler provides DSP
design teams a significant reduction in design labor and
time, elimination of misinterpretations and costly design
rework, automatic verification of the hardware
implementation, and the ability of systems engineers and
algorithm developers to perform architectural exploration
in the early phases of their development cycle. The paper
describes how powerful directives are used to provide
high-level architectural tradeoffs for the DSP designer.
The MATCH compiler has been transferred to a startup
company called AccelChip which has developed a
commercial version of the compiler called AccelFPGA.
Experimental results are reported using AccelFPGA on a
set of nine MATLAB benchmarks that are mapped onto
the recent Xilinx Virtex II and Altera Stratix FPGAs. The
benchmark programs range in complexity from 20 lines to
170 lines of MATLAB code and produce VHDL code
ranging from 1500 to 4500 lines of code. The
compilation times range from 3 seconds to 40 seconds.
-
5C-5 Issues in Debugging Highly Parallel FPGA-based Applications Derived from Source Code [p. 483]
-
K.S. Hemmert, B.L. Hutchings
Using high-level synthesis tools to map programs
written in general-purpose languages to FPGA hardware has
grown in popularity and it is becoming necessary to provide comprehensive
debugging tools in order to verify the correctness of
the synthesized hardware. Currently, post-synthesis debugging
is done at the circuit level. This paper discusses the issues, as
well as some early results, of creating a source level debugger for
hardware synthesized from source code. This study is meant to
provide some insight into what needs to be added or built into
synthesizing compilers in order to allow debug of a synthesized
circuit at the source level, which will provide the programmer
with a familiar view of the program being debugged.
Co-Chairs: Seongsoo Lee, Takashi Miyamori
-
5D-1s Implementation of the Super-Systolic Array for Convolution [p. 491]
-
J.-J. Lee, G.-Y. Song
High-performance computation on a large array of
cells has been an important feature of systolic array. To achieve
even higher degree of concurrency, it is desirable to make cells
of systolic array themselves systolic array as well. The
architecture of systolic array with its cells consisting of another
systolic array is to be called super-systolic array.
In this paper we propose a scalable super-systolic array
architecture which shows high-performance and can be
adopted in the VLSI design including regular interconnection
and functional primitives that are typical for a systolic
architecture.
-
5D-2s Design of a Scalable RSA and ECC Crypto-Processor [p. 495]
-
M.-C. Sun, C.-P. Su, C.-T. Huang, C.-W. Wu
In this paper, we propose a scalable word-based
crypto-processor that performs modular multiplication based
on modified Montgomery algorithm for finite fields GF(P) and
GF(2m). The unified crypto-processor supports scalable keys of
length up to 2048 bits for RSA and 512 bits for elliptic curve
cryptography (ECC). Further extension of the key length can be
done easily by enlarging the memory module or using the external
memory resource. With the proposed parity prediction technique,
our pipelined crypto-processor achieves a 512-bit RSA encryption
rate of 276 Kbps and a 160-bit ECC encryption rate of
73.3 Kbps for a 220MHz clock rate.
-
5D-3s A Reconfigurable, Power-Scalable Rake Receiver IP for W-CDMA [p. 499]
-
A. Bianco, A. Dassatti, M. Martina, A. Molino, F. Vacca
During the last years wireless market has experienced
an exponential growth. 2G systems are essentially voiceö
oriented: the main innovation expected from 3G ones is the ubiquitous
Internet and multimedia fruition. The transition from 2G
to 3G provides both opportunities and challenges: one way to
make this migration as smoother as possible relies on the employment
of reconfigurable architectures. In this paper a reconfigurable
Rake Receiver for W-CDMA is proposed. Very promising
results from the physical implementation on a XCV300E have
been obtained.
-
5D-4s Robust High-Performance Low-Power Carry Select Adder [p. 503]
-
W. Jeong, K. Roy
This paper proposes Dual Transition Skewed
Logic (DTSL) based Carry Select Adder (CSA) suitable for
processing units requiring low power and high performance
with high noise immunity. We implemented 31-bit Carry
Select Adders in three different logic styles: Dual Transition
Skewed Logic (DTSL), Domino, and conventional static
CMOS in TSMC 0.25um technology and compared them in
terms of performance, power consumption and layout area.
CSA using DTSL shows 36.7% and 17.7% improvements in
power dissipation and performance, respectively, over
domino, and 40.4% improvement in performance compared
to a static CMOS CSA.
-
5D-5s Full-Custom vs. Standard-Cell Design Flow - an Adder Case Study [p. 507]
-
H. Eriksson, T. Henriksson, P. Larsson-Edefors, C. Svensson
Full-custom design is considered superior to
standard-cell design when a high-performance circuit is requested.
The structured routing of critical wires is considered to
be the most important contributor to this performance gap. However,
this is only true for bitsliced designs, such as ripple-carry
adders, but not for designs with inter-bitslice interconnections
spanning several bitslices, such as tree adders and reduction-tree
multipliers. It is found that standard-cell design techniques scale
better with the data width than full-custom bitsliced layouts for
designs dominated by inter-bitslice interconnections.
-
5D-6s A 500-MHz Low-Power Five-Port CMOS Register File [p. 511]
-
J. Wang, Q. Zhang
Current-mode approach in register file has the
great advantage of high operation speed with low power consumption.
Under 0.35- m m CMOS technology, we present a
1.85-ns read access, 32« 32-bit, five-port register file in which a
modified current-sense amplifier is used. The delay match circuit
and TSPC-FF are also used to improve speed. Energy saving is
achieved by current mode and other key ideas. Simulation results
show it consumes 640mW at 500MHz 3.3V. It can still work
at 2.3V with 350MHz 300mW.
-
5D-7s An Effective SDRAM Power Mode Management Scheme for Performance and Energy Sensitive Embedded Systems [p. 515]
-
N.-Y. Ker, C.-H. Chen
We present an effective power mode management scheme used in
SDRAM memory controllers. The scheme employs a bus utilization
monitoring mechanism to initiate proper operations of SDRAM chips.
Our approach reduces energy consumption by actively switching
memories to low-power mode at low bus utilization. At higher bus
utilization, the scheme switches memories to open page mode to
reduce precharge energy as well as program execution time. This
bus utilization predictor reduces memory energy consumption without
the expense of increasing program execution time. It achieved the
performance level of open page policy by consuming 20% less of memory
energy.
-
5D-8s Branch Predictor Design and Performance Estimation for a High Performance Embedded Microprocessor [p. 519]
-
S.-H. Lee, I.-K. Kim, L. Choi
AE64000 is a 64-bit embedded processor targeting
high-end embedded applications such as HDTV, DVD, and 3D
graphics. To achieve a higher performance for the AE64000, we
design a branch predictor for the processor, and find the
optimum parameters for the design through cycle-accurate
simulations on SpecINT benchmarks and embedded
applications (Dhrystone and Whetstone).
In AE64000 branch prediction is complicated by the
Instruction Folding Unit (IFU) of the processor front-end. By
predicting on a Pre-PC in the IFU rather than using a PC in the
pipeline core, we can effectively eliminate branch misprediction
penalty on a correct prediction. We have developed the
AE64000 simulator to evaluate the performance of the designed
branch predictor, and selected the optimum branch predictor
configuration by considering cost-effectiveness as well as by
analyzing the results generated from AE64000 simulator. The
selected branch predictor has been implemented in Verilog and
is added to AE64000 pipeline.
Co-Chairs: Tohru Ishihara, Sri Parameswaran
-
6A-1 Accelerating Design Space Exploration Using Pareto-Front Arithmetics [p. 525]
-
C. Haubelt, J. Teich
In this paper, we propose an approach for the synthesis
of heterogeneous (embedded) systems, while exploiting a hierarchical
problem structure. Particular to our approach is that
we explore the set of so-called Pareto-optimal solutions, i.e., optimizing
multiple objectives simultaneously. Since system complexity
grows steadily leading to giant search spaces which demand
for new strategies in design space exploration, we propose Pareto-
Front Arithmetics (PFA) using results of subsystems to construct
implementations of the top-level system. This way, we are able
to reduce the exploration time dramatically. An example of an
MPEG4 coder is used to show the benefit of this approach in reallife
applications.
-
6A-2 Quality-Driven Design by Bitwidth Optimization for Video Applications [p. 532]
-
Y. Cao, H. Yasuura
This paper presents a novel system-level design
methodology, called quality-driven design for video applications
by bitwidth optimization (QDDV). An output quality adaptive approach
based on a forward and a backward propagation technique,
which are effective bitwidth analysis and output-quality-based
bitwidth analysis are also presented for variable bitwidth
optimization. In order to illustrate the potential of the proposed
methodology, MPEG-2 video is used as the driver application. Experimental
results show the effectiveness of the methodology.
-
6A-3 Arbitrary Long Digit Integer Sorter HW/SW Co-Design [p. 538]
-
S.-W. Cheng
The coming of multimedia era and information
security era indicates that must process longer digit integer
data. Previous sort researches focus on pure performance of
large amount of finite fixed digit/bit number. This paper
discusses on effectively solving arbitrary long digit integer
sorting problem by HW/SW co-design under the Area×Time2
(AT2) price-performance constraint. The work proposes
multi-level (two-level) sort architecture to attain the object:
an accomplished fixed-digit (k-bit) hardware sorter
implements the first or basic level sorting, software
programmed radix 2k sort implements the second or higher
level sorting. By Super Radix Sorting HW/SW co-design and
reuse techniques, the work makes fixed-digit HW sorters
more flexible and useful.
Index Term - HW/SW co-design, Reusable & Embedded
Cores, Sorting, Radix Sort, Technology Independent
Methodologies, System-on-a-Chip (SoC), VLSI design.
Organizer: Kazuo Nakajima
Moderators: Kazuo Nakajima, Brian Schott
Panelists: Tokinori Kozawa, Jose Munoz, Wolfgang Rosenstiel, Sakae Takahashi,
Chen-Wen Wu
-
-
In the past, government agencies played pivotal roles in the development of new
technologies. For example, Internet is an outgrowth of the ARPAnet project
sponsored by the Defense Advanced Research Projects Agency (DARPA) of the United
States government. Recently, consortia of private industries, often in
cooperation with government agencies, have entered into this picture. As a
consequence, researchers in academia have shifted their attentions much more
towards real-world applications. In the panel discussion, we first compare the
traditional roles of funding agencies in the USA, EU, and Asia. We then focus on
the new trends in this funding business in each of the three regions. We will
examine why and how such shifts have been made and discuss the roles of funding
agencies to be played in the coming years. In particular, we will attempt to
search for future roles and new forms/ways of the funding agencies from the
viewpoint of economic development driven by technology advancement.
Chair: Masaharu Imai
-
6C-1 Legal Protection for Semiconductor Intellectual Property (IP) [p. 551]
-
Y. Oshima
Considering the value of semiconductor IP (Intellectual Property),
fundamental knowledge of legal protection for IP is essential requirement
for both IP providers and IP users. Since IPs are recognized as one of the
intangible assets, present intellectual property system, such as patent,
copyright, trade secret, and mask work protection system should provide
certain legal protection by their own aspects.
This article describes how to protect IP effectively by each present
intellectual property system and contributes to realize the smooth
IP trade based on suitable legal protection.
Co-Chairs: Tomohisa Wada, Shoji Kawahito
-
6D-1 Design and Implementation of a Video-Oriented Network-Interface-Card System [p. 559]
-
M.-C. Chen, S.-F. Hsiao, C.-H. Yang
We design a specific Ethernet network interface
card (NIC) for accelerating the video delivery by offloading the
overheads of protocol headers identification/appending and
CRC/checksums calculation, and speeding video bit streams
with a dedicated video interface. Compared with the same
operations of a 50MHz ARM micro-controller, the NIC system
saves 47,000 ns per frame. This NIC card also supports the
coexistence of the IPv4 and IPv6 standard for the future
extension. Both FPGA prototyping and 0.35um cell-based
design of the specific NIC system are given.
-
6D-2 A Highly Efficient AES Cipher Chip [p. 561]
-
C.-P. Su, T.-F. Lin, C.-T. Huang, C.-W. Wu
We present an efficient hardware implementation of
the AES (Advanced Encryption Standard) algorithm, with key
expansion capability. Instead of the widely used table-lookup implementation
of S-box, the proposed basis transformation technique
reduces the hardware overhead of the S-box by 64% and is
easily pipelined to achieve high throughput rate. Using a typical
0:25µm CMOS technology, the throughput rate is 2.977 Gbps for
128-bit keys, 2.510 Gbps for 192-bit keys, and 2.169 Gbps for 256-bit
keys with a 250MHz clock. Testability of the design is also considered.
The area of the core circuit is about 1,279 x 1,271 µm2 .
-
6D-3 Implementation of Fast CRC Calculation [p. 563]
-
T. Henriksson, D. Liu
CRC is important for error detection in communication
systems. With transmission speeds of several Gb/s the high-speed
implementation is a bottleneck. A circuit with two parallel
calculation units has been implemented in a 0.35 micron process.
They use 32 bits and 64 bits parallel input respectively. Chip
measurements prove throughput higher than 5.76 Gb/s, which
indicates that 10 Gb/s throughput is possible in more modern
processes.
-
6D-4 Design of a CMOS Test Chip for Package Models and I/O Characteristics Verification [p. 565]
-
C. Despande, T. Chen
Reliable packages for modern chips are crucial for satisfactory system
performance. In order to ascertain package's performance, an equivalent
electrical model is plugged into circuit simulations and the package's
performance characteristics can be analyzed. However, one implicit assumption
designers make is that these equivalent electrical models are accurate for
performance characterization of packages. For modern high-speed designs,
verification of the accuracy of these models is imperative and can only be
carried out on real-silicon. In addition, signal integrity for inter-chip
communication is crucial from a systems perspective. To study the signal
integrity of data under different PVT (process variation, voltage, and
temperature) conditions, appropriate test structures are needed.
Data can be subjected to controlled variations through these
structures and its integrity can be studied. And finally, testing
package reliability at high temperatures is important since
on-chip temperatures are increasing dramatically as processes
continue to scale. The test chip presented in this design is intended
to allow 1) circuit and package designers to evaluate
and validate the accuracy of package models under a variety of
environment conditions; 2) system designers to evaluate package/system
level signal integrity issue before actual chips are
available; and 3) system designers to evaluate on-chip temperatures
and to study thermal integrity of packages. The test chip
includes a variety of features to achieve these goals. The test
chip is fabricated at TSMC in a 0.18 um CMOS process. The
measurement results and the die microphotograph will be sent
later as soon as they are available. This paper presents the layout
based simulation results.
-
6D-5 A Still Image Encoder Based on Adaptive Resolution Vector Quantization Employing Needless Calculation Elimination Architecture [p. 567]
-
M. Fujibayashi, T. Nozawa, T. Nakayama, K. Mochizuki, K. Kotani, S. Sugawa, T. Ohmi
We have developed an advanced vector quantization (VQ) encoding hardware for
still image encoding systems. By utilizing needless calculation elimination
method, computational cost of VQ encoding is reduced to 40% or less, while
maintaining the accuracy of full-search VQ. We have successfully implemented
the advanced encoding method and Adaptive resolution VQ (AR-VQ), which
realizes compression ratio over 1/200 while maintaining image quality, into
a still image encoding processor. The processor can compress still image of
1600x2400 pixels within one second, which is 60 times faster than software
implementation on current PCs.
-
6D-6 Speech Encoding and Encryption in VLSI [p. 569]
-
K.K. Chakravarthy, M.B. Srinivas
In this work, an attempt has been made to design and
synthesize speech encoding and encryption as a system-on-chip.
The novelty of this design is that it uses wavelet decomposition
for data compression and perpetual audio masking to keep
quantization noise level to a minimum. The encryption is done
by implementing RSA algorithm in hardware.
-
6D-7 The Design of a i8080A Instruction Compatible Processor with Extended Memory Address [p. 571]
-
C. Kon, N. Shimizu
Computer systems with 8080 or compatible processor
and CP/M were most popular from the end 70's to early in
80's. We designed an 8080 compatible processor with MMU and
implement it on an FPGA as a CP/M system.
Keyword: SFL, 8080, CP/M
-
6D-8 The Design of a USB Device Controller IYOYOYO [p. 573]
-
T. Kouyama, H. Nano, C. Kon, N. Shimizu
USB(Universal Serial Bus) is a very popular interface in recent computer
systems. USB replaces legacy interface, such as serial port, pararell port, and
so on. We designed USB device controller IYOYOYO, and USB Gamepad
device with it. This Gamepad device was implemented on an FPGA, with
IYOYOYO.
-
6D-9 MAPLE Chip: a Processing Element for a Static Scheduling Centric Multiprocessor [p. 575]
-
K. Yasufuku, R. Ogawa, K. Iwai, H. Amano
A custom processor called MAPLE which supports
static scheduling by automatic parallelizing compilers is implemented
and evaluated. MAPLE has a high performance floating
point arithmetic unit and low latency data transfer mechanism
for other MAPLE chips. The maximum operational frequency is
80MHz in simulation, and the operation on the prototype board
with 23MHz clock is confirmed. It requires about 0.56W at 23MHz
operation.
-
6D-10 Finding the Best System Design Flow for a High Speed JPEG Encoder [p. 577]
-
K. Sakiyama, P. Schaumont, I.M. Verbauwhede
26 students at the University of California, Los
Angeles (UCLA) studied system level design methodologies
through the design of a high-speed JPEG encoder. The results
produced by 5 different design flows onto various target
platforms demonstrate the high impact of tools on design
quality.
-
6D-11 The Design of PCI Bus Interface [p. 579]
-
H. Hayasaka, H. Haramiishi, N. Shimizu
These days, the PCI bus is the standard bus which
not only x86 architecture but also other architecture is equipped
with. However, the details are difficult to understand. We designed
the general-purpose PCI bus interface for developing PCI
devices, and implemented it. In this paper, the feature of design is
reported.
-
6D-12 Low-Power Digital CDMA Receiver [p. 581]
-
J.-S. Liu, I-H. Chen, Y.-C. Tsai, S.-J. Jou
The advanced design task of the digital CDMA receiver are presented in
this report. A biased number system and architecture is used to reduce
the switching activity to reduce power consumption. Carry-save adder
tree is used to speed up the summation of 127 data (3 bits) in the
synchronization and date extraction process. Verilog HDL is used to
describe this system and design compiler of Synopsys is used to synthesize
our design. Design results show that it can work at 155MHz (Chip rate)
with 9913 Gate counts by using Compass 0.35 um CMOS Cell library.
-
6D-13 Hardware Implementation of an EAN-13 Bar Code Decoder [p. 583]
-
J.D. Maeyer, H. Devos, W. Meeus, P. Verplaetse, D. Stroobandt
This paper presents an FPGA hardware implementation
of a bar code decoder using the EAN-13 standard.
The design was implemented on a FPGA situated on an ATMEL
FPLSIC (Field Programmable System Level Integrated Circuit,
AT94K40-25DQC). The design has comparable performance and
is in many ways more robust than commercially available devices.
However, to the authorsâ knowledge, these all use a microprocessor
while our design is purely dedicated hardware.
-
6D-14 Error Correction Circuit using Difference-set Cyclic Code [p. 585]
-
Y. Kato, T. Morita
An error correction receiver using difference-set
cyclic code has been designed. Highly reliable operation, short
critical path, and small circuit size are key issues. The
synchronization circuit is optimized in its circuit size by
detecting 10 kinds of bit sequences for synchronization. The
circuit action is governed by state machine combined with a
Johnson counter and a timer. The critical path length is
estimated to be 4.8, which is less than average value.
-
6D-15 Design of a Digital CDMA Receiver [p. 587]
-
I.V. Kumar, M.B. Srinivas
In this work, an attempt has been made to design a
Digital Signal Code Division Multiple Accesses Receiver using
VHDL [1] [2] and synthesize using Mentor Graphics tools. The
receiver is designed for a maximum of three users as per the
design specification given for the contest. The novelty of this
design is that it uses four independent processes to achieve the
concurrent operation for the entire functionality. The simulation
and synthesis are carried out using Mentor Graphicsâ FPGA
tools and an optimized circuit in terms of area and timing is
brought out.
-
6D-16 Standard Cell Libraries with Various Driving Strength Cells for 0.13, 0.18 and 0.35um Technologies [p. 589]
-
M. Hashimoto, K. Fujimori, H. Onodera
We developed standard cell libraries for three technologies(0.13, 0.18 and 0.35Őm)
using an automatic layout generation
tool that we have developed. The developed libraries are as
competitive as manually-designed libraries in layout density and
speed. We verify the functionalities of all cells and the speed of
basic combinational cells on the fabricated chips. The libraries
are currently public to educational organizations in Japan.
-
6D-17 A Nearest-Hamming-Distance Search Memory with Fully Parallel Mixed Digital-Analog Match Circuitry [p. 591]
-
T. Koide, H.J. Mattausch, Y. Yaho, T. Gyohten, Y. Soda
A fully-parallel minimum Hamming-distance search
memory has been designed, which uses digital circuitry for bit comparison
and fast analog circuitry for word-comparison as well
as winner-take-all (WTA) functionality. The approach allows compact
and high-performance integration in conventional CMOS
technology. The 0.6µm, 2-poly, 3-metal CMOS design with 32 rows
and 128 columns achieves <100ns search times at <260 mW power
dissipation. The approach is extendable to increased pattern length
and larger row numbers and enables high efficiency for pattern matching
applications such as network routers, code-book-based
data compression, or object recognition. As for conventional memories
high yield can be achieved with a redundancy concept for rows
(including WTA function) and columns.
Co-Chairs: Kazutoshi Kobayashi, Massoud Pedram
-
7A-1 Run-Time Energy Estimation in System-On-a-Chip Designs [p. 595]
-
J. Haid, G. Kaefer, C. Steger, R. Weiss
In this paper, a co-processor for run-time energy
estimation in system-on-a-chip designs is proposed. The
estimation process is done by using power macro-models, thus
making analogue measurement equipment obsolete to the
software engineer once the system-on-a-chip (SOC) design is
characterized. Compared to sampling-based profiling systems
[17], the performance overhead of energy profiling is less,
because the energy estimation is done completely parallel to the
functional units residing on the SOC. The proposed
methodology can be used for run-time power optimization and
in-system energy profiling. The co-processor was evaluated on
a SOC for MPEG layer III audio decoding and the
experimental results show a maximum relative error of <5%.
-
7A-2 SEA: Fast Power Estimation for Micro-Architectures [p. 600]
-
P. Kalla, J. Henkel, X.S. Hu
Various approaches for micro-architectural power/
energy estimation have been introduced, mainly driven by the
need to obtain fast power/energy estimates during early phases
of complex SOC designs. In contrast to previous approaches we
study power/energy estimation for highly optimized synthesizable
description of microprocessor cores. Under this real-world design
scenario, we found, unlike related previous research, that power
can hardly be estimated closer than around 15% using an instruction
level model. However, we can estimate the energy as close
as 5%. Our research has resulted in the SEA framework that
estimates energy/power consumed by a software program, taking
specific micro-architectural features of the underlying programmable
hardware core into consideration. With this high accuracy
in energy estimation we achieve around 5 orders of magnitude
faster estimations compared to state-of-the art high-level
(RTL) commercial energy/power estimation tool suites. Thus,
our framework is capable of reliably estimating the energy/power
consumption of future complex SOCs.
-
7A-3s HyPE: Hybrid Power Estimation for IP-Based Programmable Systems [p. 606]
-
X. Liu, M.C. Papaefthymiou
This paper presents a novel power estimation
scheme for programmable systems consisting of predesigned datapath
and memory components. The proposed hybrid methodology
yields highly accurate estimates within short runtimes by
combining high-level simulation with analytical macromodeling
of circuit characteristics. To assess its effectiveness in practice,
we implemented our hybrid scheme into a power estimation tool,
called HYPE, and applied it to explore various architectural alternatives
in the design of a 256-state Viterbi decoder and a Rijndael
encryptor. For designs with about 1 million transistors, our
estimator terminates within seconds. Compared with industrial
gate-level power estimators, our approach is up to 1,000 times
faster with 5.4% deviation on average.
-
7A-4s An Efficient IP-Level Power Model for Complex Digital Circuits [p. 610]
-
C.-Y. Hsu, C.-N.J. Liu, J.-Y. Jou
In this paper, we propose an efficient IP-Level power model
with a small lookup table for complex CMOS circuits. The
table has only one dimension that maps the zero-delay charging
and discharging capacitance into the real power consumption
of pattern pairs but still has high accuracy. In order to improve
the efficiency of the characterization process, the Monte Carlo
approach is used during the estimation of the average power to
skip the samples that will not increase the accuracy too much.
The experimental result shows the table sizes are only up to 107
entries for ISCASâ85 benchmark circuits and the estimation
error is only 2.99% on average using the lookup table.
-
7A-5 A Hierarchical Analysis Methodology for Chip-Level Power Delivery with Realizable Model Reduction [p. 614]
-
Y.-M. Lee, C.C.-P. Chen
In this paper, we propose a novel hierarchical
analysis methodology to facilitate efficient chip-level power
fluctuation analysis. With extreme efficiency and simplicity,
our design methodology first builds time-varying
multiport Norton equivalent circuits in a row-by-row or
block-by-block based followed by global analysis on the
integrated reduced models. After generating the Norton
equivalent sources at external ports, we apply realizable
model order reduction technologies to further reduce
model. Since the elements of our reduced model are also
RC devices, they are fully compatible with general circuit
simulation engines. The experimental results demonstrate
more than 4X speed up with the flat simulation while maintaining
within 5% of accuracy.
Co-Chairs: Jason Cong, Satoshi Matsushita
-
7B-1 Optimality and Scalability Study of Existing Placement Algorithms [p. 621]
-
C.-C. Chang, J. Cong, M. Xie
Placement is an important step in the overall IC
design process in DSM technologies, as it defines the on-chip
interconnects, which have become the bottleneck in determining
circuit performance. The rapidly increasing design complexity,
combined with the demand for the capability of handling nearly
flattened designs for physical hierarchy generation, poses
significant challenges to existing placement algorithms. There
are very few studies on understanding the optimality and
scalability of placement algorithms, due to the limited sizes of
existing benchmarks and limited knowledge of optimal solutions.
The contribution of this paper includes two parts: 1) We
implemented an algorithm for generating synthetic benchmarks
that have known optimal wirelengths and can match any given
net distribution vector. 2) Using benchmarks of 10K to 2M
placeable modules with known optimal solutions, we studied the
optimality and scalability of three state-of-the-art placers,
Dragon [4], Capo [1], mPL [24] from academia, and one leading
edge industrial placer, QPlace [5] from Cadence. For the first
time our study reveals the gap between the results produced by
these tools versus true optimal solutions. The wirelengths
produced by these tools are 1.66 to 2.53 times the optimal in the
worst cases, and are 1.46 to 2.38 times the optimal on the
average. As for scalability, the average solution quality of each
tool deteriorates by an additional 4% to 25% when the problem
size increases by a factor of 10. These results indicate significant
room for improvement in existing placement algorithms.
-
7B-2 IBM's 50 Million Gate ASICs [p. 628]
-
J. Koehl, D.E. Lackey, G. Doerre
There is no slowdown in the complexity increase for
ASIC and SoC designs. As we write this paper in August, 2002,
40M gate ASICs are nearing tape-out, and 50M gate designs
are likely to start before this conference takes place. This paper
describes the current tool and methodology development efforts
focused on enabling ASIC and SoC designs of these sizes and
complexity, centered around the reduction of design
turn-around-time, improvement of the quality of results and the
modeling and optimization of deep sub-micron electrical effects.
-
7B-3 Silicon Virtual Prototyping: The New Cockpit for Nanometer Chip Design [p. 635]
-
W.-J. Dai, D. Huang, C.-C. Chang, M. Courtoy
A design methodology for the implementation of multi-million gate
system-on-chip designs is described. The new methodology is based
on the creation of a silicon virtual prototype early in the back-end
design process. The prototype is generated in a fraction of the time
required to complete the traditional back-end flow but still maintains
very high correlation with the final design. The physical prototype
becomes the Îcockpitâ where many design implementation decisions can
be optimized by leveraging the short iteration times. Hierarchical
design methodologies benefit from the prototyping stage by enabling
a more optimal partitioning. The silicon virtual prototype also alters
the nature of the hand-off model between front-end and back-end designers.
The netlist can now be quickly validated using the prototype: the physical
reality is being injected early in the design process resulting in fewer
iterations between front-end and back-end.
-
7B-4 Design Flow and Methodology for 50M gate ASIC [p. 640]
-
A. Mehrotra, L. van Ginneken, Y. Trivedi
This paper presents a methodology for full
chip RTL timing closure for very large ASICâs. The
methodology is based on the concept of a "Silicon
Virtual Prototype". The methodology is based on the
scalable technique of clustering and cluster
placement and leverages the tight integration
between the algorithms by means of a common,
unified data model.
Co-Chairs: Kwang-Ting Cheng, Yasuo Sato
-
7C-1 Efficient Loop-back Testing of On-chip ADCs and DACs [p. 651]
-
H.-s. Yu, J.A. Abraham, S. Hwang, J. Roh
This paper presents an efficient approach to testing
on-chip Analog to Digital Converters (ADCs) and Digital to
Analog Converters (DACs) in loop-back mode. On-chip digital
signal processing units can be used to generate stimuli. With this
methodology, go/no-go tests as well as characterization of the individual
ADCs and DACs are possible. The proposed approach is
simple and overcomes the low parametric fault coverage of conventional
loop-back tests. Simulations on a Matlab model of loopbacked
converters are presented to validate the feasibility of the
method.
-
7C-2 A Novel LCD Driver Testing Technique Using Logic Test Channel [p. 657]
-
C. Su, W.-J. Wang, C.-H. Wang, I. Tseng
This paper proposes a novel voltage measurement
technique for LCD driver testing by the use of logic test
channel of an ATE. The method is able to achieve less
than 1mV error with the presence of 32mV RMS noise.
-
7C-3 An Implementation of Memory-based On-chip Analogue Test Signal Generation [p. 663]
-
S. Mir, L. Rolíndez, C. Domigues, L. Rufer
This paper presents a memory-based on-chip
analogue test signal generation approach that is suitable for the
test of an Analogue and Mixed-Signal (AMS) core. This core
contains programmable electronic interfaces for acoustic and
ultrasound transducers. The test signals that must be generated
on-chip have only low or moderate frequencies (10 Hz-10 MHz).
The test circuitry designed in a 0.18 mm CMOS technology
includes a programmable shift-register, a clock divider, and a
programmable switched-capacitor filter bank. By controlling
the shift-register length and the sampling frequency, the paper
shows that high quality single tone signals can be generated on
chip in the band of interest.
-
7C-4 Delta-sigma Modulator Based Mixed-signal BIST Architecture for SoC [p. 669]
-
C.-K. Ong, K.-T. Cheng, L.-C. Wang
This paper proposes a mixed-signal Built-In Self-Test (BIST)
architecture based on a second-order delta-sigma
modulator. This modulator, which incorporates a design-for-testability
(DfT) circuitry, is capable of testing/characterizing itself
using digital stimulus. This characteristic is attractive for implementing
the modulator as an on-chip analog signal analyzer.
When applied for mixed-signal BIST, the modulator-based analog
signal analyzer is first characterized using digital stimulus.
Then the analyzer is utilized to characterize the stimulus generator
in the BIST application. Some critical implementation issues
of the BIST architecture are also discussed.
Co-Chairs: Naehyuck Chang, Hiroaki Takada
-
7D-1 Capturing and Analyzing Requirement - In Case of Software and Applying
to Hardware [p. 677]
-
A. Kawaguchi
It introduces the technology of requirement capture and requirement analysis
that uses the UML in the software world, and considers applying to hardware
development.
-
7D-2 A Comparison of the RTU Hardware RTOS with a Hardware/Software RTOS [p. 683]
-
J. Lee, V. J. Mooney III, A. Daleby, K. Ingström, T. Klevin, L. Lindh
In this paper, we show the performance
comparison and analysis result among three RTOSes: the
Real-Time Unit (RTU) hardware RTOS, the pure software
Atalanta RTOS and a hardware/software RTOS composed of
part of Atalanta interfaced to the System-on-a-Chip Lock
Cache (SoCLC) hardware. We also present our RTOS
configuration framework that can automatically configure
these three RTOSes. The average-case simulation result of a
database application example on a three-processor system
running thirty tasks with RTU and the same system with
SoCLC showed 36% and 19% overall speedups, respectively,
as compared to the pure software RTOS system.
-
7D-3s Linux Kernel Customization for Embedded Systems by Using Call Graph Approach [p. 689]
-
C.-T. Lee, Z.-W. Hong, J.-M. Lin
It has been attracting attention to reconfigure a GPOS for
application-specific domain. Linux is currently one of the popular
candidate GPOSs. Although Linux has tools for kernelâs reconfiguration
by letting users add/remove desired function modules, we still lack
of good schemes of reconfiguring Linux according to a specific embedded
system. In this article, we are going to propose an approach to customize
an application-specific Linux. This approach derives from a "call graph"
based on re-engineering. By analyzing a graph-structure representation of
the target system, we could find the rules of removing the unnecessary codes of Linux.
-
7D-4s Topology Selection for Energy Minimization in Embedded Networks [p. 693]
-
D. Li, P. H. Chou, N. Bagherzadeh
The trend towards distributed, networked embedded systems is
changing the way power should be managed. Power consumed by
bus and network interfaces now matches if not surpasses that of the
CPU and is thus becoming a prime candidate for reduction. This paper
explores energy-efficient bus topologies as a new technique for
global power optimization of embedded systems that are interconnected
by high-speed serial network-like busses such as FireWire and
a new generation of SoC buses. Our grammar-based representation
for these networks enables selection of energy-efficient bus topologies.
Experimental results show 15ö20% energy saving on the network
interfaces without sacrificing system performance.
Co-Chairs: Vasily Moshnyaga, Hiroyuki Ochi
-
8A-1 Semi-Formal Test Generation and Resolving a Temporal Abstraction Problem in Practice: Industrial Application [p. 699]
-
J. Dushina, M. Benjamin, D. Geist
This document describes a successful application of a
semi-formal test generation technique to the verification of
Direct Memory Access Controller (DMAC) of ST50, a new general
purpose RISC microprocessor developed by STMicroelectronics
and Hitachi. Like other memory-related devices, the
DMA controller challenges formal techniques because of the
state explosion problem. To cope with the challenge, abstraction
mechanism is applied during test generation: several abstract
models are created in order to verify different functional aspects
of the design. We also propose a practical solution to overcome a
temporal abstraction problem that arises when tests issued from
an abstract model have to be applied during real design simulation.
-
8A-2 Scan-chain Based Watch-points for Efficient Run-Time Debugging and Verification of FPGA Designs [p. 705]
-
A. Tiwari, K.A. Tomko
This paper describes a structured and area efficient approach for in-situ
debugging of application for FPGA based reconfigurable systems. A scan chain
is inserted into the hardware design running on the FPGA, which helps in
debugging and verification by providing watch-point capability. The scan
chain technique proposed is easy to use and has very low overhead. The
scan-chain based implementation capitalizes on the capability of newer FPGAs
to connect several LUTs serially and configure them as shift registers. The
hardware debugging procedure proposed using the shift register LUTs does not
require any recompilation of the design to change the watch-point conditions
and thus, is very fast. In this paper the area overhead resulting from addition
of a scan-chain based watch-point logic is discussed and is compared with other
proposed debugging techniques. We observed that this technique has an average
area overhead of 46% for the ITC benchmark circuits with varying widths of watch-point signals.
-
8A-3s A Novel Approach for Digital Waveform Compression [p. 712]
-
E. Naroska, S.-J. Ruan, C.-L. Ho, S. Mchaalia, F. Lai, U. Schwiegelshohn
The waveform data, which is gathered during digital simulation
of large and complex systems easily fill up huge amounts of
disk space on the simulation computer. This is specially true for
many deep-submicron designs. Hence, we developed a set of algorithms
and techniques which can be used to efficiently compress
digital waveforms without losing any significant information from
the original data. Experimental results show that compression ratios
of up to 300 compared to a original VCD formatted file can
be achieved.
-
8A-4s A Deep Submicron Power Estimation Methodology Adaptable to Variations between Power Characterization and Estimation [p. 716]
-
D. Eckerbert, P. Larsson-Edefors
Traditionally, RTL power estimation
techniques are characterizing a component for a fixed
environment (most importantly load capacitance, activity,
and operating frequency). This article presents
a solution to problems originating from the ineluctably
changing operating conditions such as differing load
capacitance due to different applications; different activity
and operating frequency as power reduction
techniques are more frequently employed.
Co-Chairs: Masaaki Yamada, Xin Yuan
-
8B-1 Congestion Driven Incremental Placement Algorithm for Standard Cell Layout [p. 723]
-
Z. Li, W. Wu, X. Hong
Congestion minimization is the least understood
in placement objectives, however, it models routability most
accurately. In this paper, a new incremental placement
algorithm C-ECOP for standard cell layout is presented to
reduce routing congestion. Congestion estimation is based on
a new routing model and a more accurate cost function. An
integer linear programming (ILP) problem is formulated to
determine cell flow direction and avoid the conflictions
between adjacent congestion areas. Experimental results
show that the algorithm can considerably reduce routing
congestion and preserve the performance of the initial
placement with high speed.
Key words: congestion, standard cell, incremental placement
-
8B-2 Performance-Driven Multi-Level Clustering for Combinational Circuits [p. 729]
-
C.N. Sze, T.-C. Wang
In this paper, an effective algorithm is presented for performance
driven multi-level clustering for combinational circuits, and
is applicable to hierarchical FPGAs.
With a novel graph contraction technique, which allows some
crucial delay information of a lower-level clustering to be
maintained in the contracted graph, our algorithm recursively
divides the lower-level clustering into the next higher-level
one in a way that each recursive clustering step is accomplished
by applying a modified single-level circuit clustering algorithm
based on [1]. We test our algorithm on the two-level clustering problem and compare it with the
latest algorithm in [2]. Experimental results show that our
algorithm achieves, on average, 12% more delay reduction
when compared to the best results (from TLC with full node-duplication)
in [2]. In fact, our algorithm is the first one for
the general multi-level circuit clustering problem with more
than two levels.
-
8B-3 Cross Talk Driven Placement [p. 735]
-
J. Lou, W. Chen
Due to the fast shrinking of process geometries, signal
integrity issues are becoming increasingly critical to the
performance and reliability of electronic systems. Traditional
post-layout based fixing methodologies quickly break down when
hundreds of thousands of nets need to be tailored for signal
integrity concerns. In order to achieve signal integrity closure,
cross talk, electro-migration, IR drop and other effects must be
addressed earlier in the design cycle. In this paper, a new
placement algorithm considering cross talk minimization is
presented. First, a probabilistic technique is presented to estimate
the coupling capacitance during coarse placement. Based on this
technique, an effective placement flow is introduced to remove the
highly coupled spots via placement density control. By addressing
cross talk at placement level with this algorithm, the value and
count of the highest post route peak noise were successfully
reduced in a set of industrial benchmarks. Furthermore, due to
reduced coupling capacitance, the design speed is 8% faster on
average in comparison with traditional timing-driven,
congestion-aware placement flow.
-
8B-4s VLSI Module Placement with Pre-placed Modules and Considering Congestion Using Solution Space Smoothing [p. 741]
-
S. Dong, X. Hong, X. Qi, R. Wang, S. Chen, J. Gu
Solution space smoothing allows a local search
heuristic to escape from a poor, local minimum. In this paper,
we propose a technique that can smooth the rugged terrain
surface of the solution space of a placement problem. We design
the solution space smoothing algorithms for VLSI placement
with pre-placed modules and placement with consideration of
congestion. Experiment results demonstrated that solution
space smoothing is very efficient for VLSI module placement,
and it can be applied to all floorplanning representations
proposed so far.
-
8B-5s A Path-based Timing-driven Quadratic Placement Algorithm [p. 745]
-
W. Hou, X. Hong, W. Wu, Y. Cai
This paper presents a path-based timing-driven
quadratic placement algorithm. The delay of the path acts as the
timing constraints. In the global optimization step, it tries to
satisfy the timing constraints. In the partition step, it tries to
decrease the cut number of critical paths. It has some special
skills, such as decrease the delay on the longest path, pad assign,
to decrease the delay further. Results show this algorithm can
make the timing behavior improve more than 20%.
Co-Chairs: Tomoo Inoue, Rochit Rajsuman
-
8C-1 Experience in Critical Path Selection for Deep Sub-Micron Delay Test and Timing Validation [p. 751]
-
J.-J. Liu, L.-C. Wang, A. Krstic, K.-T. Cheng
Critical path selection is an indispensable step for
AC delay test and timing validation. Traditionally, this step relies
on the construction of a set of worse-case paths based upon
discrete timing models. However, the assumption of discrete timing
models can be invalidated by timing defects and process variation
in the deep sub-micron domain, which are often continuous
in nature. As a result, critical paths defined in a traditional
timing analysis approach may not be truly critical in reality. In
this paper, we propose using a statistical delay evaluation framework
for estimating the quality of a path set. Based upon the
new framework, we demonstrate how the traditional definition of
a critical path set may deviate from the true critical path set in
the deep sub-micron domain. To remedy the problem, we discuss
improvements to the existing path selection strategies by including
new objectives. We then compare statistical approaches with
traditional approaches based upon experimental analysis of both
defect-free and defect-injected cases.
-
8C-2 On Effective Criterion of Path Selection for Delay Testing [p. 757]
-
M. Fukunaga, S. Kajihara, S. Takeoka, S. Yoshimura
Since a logic circuit often has too many paths to test
delay of all paths in the circuit, it is necessary for path delay
testing to limit the number of paths to be tested. Paths to be
tested should be ones with large delay that more likely cause a
fault. In addition, a test set for the paths are required to detect
other models of faults as many as possible. In this paper, we
investigate criteria of path selection for path delay testing. We
first define typical two criteria to be investigated here, and then
experimentally show the feature of paths selected with each
criterion, with respect to fault coverage of other delay fault
models. From our experiments, we observe that test patterns
for the longest paths cannot cover many other faults such as
gate delay faults or segment delay faults.
-
8C-3 DFT Timing Design Methodology for At-Speed BIST [p. 763]
-
Y. Sato, M. Sato, K. Tsutsumida, M. Kawashima, K. Hatayama, K. Nomoto
Logic BIST is well known as an effective method for low cost
testing. However, it is difficult to realize at-speed testing, as it
requires a deliberate timing design in regard to logic design
and layout of the chip. This paper presents a timing design
methodology for at-speed BIST, using a multiple-clock domain
scheme. Some experimental test results of large industrial
designs using our custom tool "Singen", will also be shown.
-
8C-4 SLV/ATPG: an Automated Flow for Test Model Generation from Switch Level Custom Circuits [p. 769]
-
T. McDougall, A. Parashkevov, S. Jolly, J. Zhu, J. Zeng, M. Abadir, C. Pyron
Custom VLSI design at the switch level is commonly
applied when a chip is required to meet stringent operating
requirements in terms of speed, power, or area. ATPG
requires gate level models, which are verified for correctness
against switch level models. Typically, test models are created
manually from the switch level models - a tedious, error-prone
process requiring experienced DFT engineers. This paper presents
an automated flow for creating gate level test models from
circuits at the switch level. The proposed flow utilizes Motorola's
Switch Level Verification (SLV) tool, which employs
detailed switch level analysis to model the behavior of MOS
transistors and represent them at a higher level of abstraction.
We present experimental results, which demonstrate that the
automated flow is capable of producing gate models that meet
the ATPG requirements and are comparable to manually created
ones.
Co-Chairs: Yu-Chung Hung, Makoto Nagata
-
8D-1 Using Red-Black Interval Trees in Device-Level Analog Placement with Symmetry Constraints [p. 777]
-
F. Balasa, S.C. Maruvada, K. Krishnamoorthy
The traditional way of approaching device-level
placement problems for analog layout is to explore a huge search
space of absolute placement representations, where cells are allowed
to illegally overlap during their moves [3, 10]. This paper
presents a novel exploration technique for analog placement,
operating on the set of tree representations of the layout [6, 2],
where the typical presence of an arbitrary number of symmetry
groups of devices is directly taken into account during the search
of the solution space. The efficiency of the novel approach is
due to the use of red-black interval trees [4], data structures employed
to support operations on dynamic sets of intervals.
-
8D-2 Current-Driven Wire Planning for Electromigration Avoidance in Analog Circuits [p. 783]
-
J. Lienig, G. Jerke
Electromigration due to insufficient wire width
can cause the premature failure of a circuit. The ongoing
reduction of circuit feature sizes has aggravated the problem
over the last couple of years, especially with analog circuits. It
is therefore an important reliability issue to consider current
densities already in the physical design stage. We present a new
methodology capable of routing analog multi-terminal signal
nets with current-dependent wire widths. It is based on
current-driven wire planning which effectively determines all
branch currents prior to detailed routing. We also discuss
successful applications of our methodology in commercial
analog circuit design.
-
8D-3 Efficient DDD-based Term Generation Algorithm for Analog Circuit Behavioral Modeling [p. 789]
-
S.X.-D. Tan, C.-J.R. Shi
An efficient approach to generating symbolic product
terms for behavioral modeling of large linear analog circuits
is presented. The approach is based on a compact determinant
decision diagram (DDD) representation of transfer functions and
characteristics of analog circuits. The new algorithm is based on
the concept that a dominant term in a DDD graph can be found by
searching the shortest path in the graph. But instead of traversing
a whole DDD graph each time, we show that a shortest path
can be found by just updating a small number of the newly added
vertices after the first shortest path is found. Experimental results
indicate that the new symbolic term generation algorithm outperforms
both pure shortest path based algorithm and dynamic
programming based algorithm, which is the fastest symbolic term
generation algorithm published so far.
-
8D-4 5Gbps Serial Link Transmitter with Pre-emphasis [p. 795]
-
C.-H. Lin, C.-H. Wang, S.-J. Jou
High-speed serial link that achieves Gbps has the
advantage of low cost and thus become popular. In this paper,
we will implement the high-speed data serial link transceiver
and demonstrate the pre-emphasis circuit. The overall circuit
is implemented in TSMC 0.18um 1P6M 1.8v CMOS process.
The performance of the transceiver can reach 5Gbps over the
10-meter long cable.
Co-Chairs: Yutaka Tamiya, Li-C. Wang
-
9A-1 Low Power Synthesis of Finite State Machines with Mixed D and T Flip-Flops [p. 803]
-
A. Iranli, P. Rezvani, M. Pedram
This paper presents a state assignment technique to reduce
dynamic power consumption in finite state machines (FSM).
The key idea is to decompose the state machine into a set of
cycles that are collectively equivalent to the original FSM, and
perform state assignment based on the cycle realization of the
state machine using Gray codes. A new implementation of
state machines by using a combination of D and T flip-flops is
thereby proposed, which in conjunction with the proposed
encoding algorithm, reduces power consumption by an average
of 15%.
-
9A-2 Don't Cares in Logic Minimization of Extended Finite State Machines [p. 809]
-
Y. Jiang, R. K. Brayton
Extended Finite State Machines (EFSMs) have
been proposed to model control oriented systems. A version of
this, with the data portion modeled by Presburger arithmetic, has
been used in formal verification and test pattern generation. This
paper proposes a general logic minimization scheme using don't
care derived from both control and data path. It consists of methods
to transfer donât cares through the data path and to generate
logic donât cares from the data path using quantifier-free Presburger
inequalities. Potential applications are discussed and preliminary
results validate the scheme on reasonable examples.
-
9A-3s Performance Optimization of Synchronous Control Units for Datapaths with Variable Delay Arithmetic Units [p. 816]
-
E. Kim, H. Saito, J.-G. Lee, D.-I. Lee, H. Nakamura, T. Nanya
Nowadays, variable delay arithmetic units have
been used for implementing a datapath of a target system in pursuit
of performance improvement. However, adoption of variable
delay arithmetic units requires modification of a typical synchronous
control unit design methodology. A telescopic arithmetic
unit based methodology is one of representative methodologies
to design synchronous control units for variable delay datapaths.
In this paper, we propose two optimization methods for
it. Proposed optimization techniques will be analyzed in order to
show their performance improvement effects explicitly.
-
9A-4s Integer Linear Programming-Based Synthesis of Skewed Logic Circuits [p. 820]
-
A. Cao, N. Sirisantana, C.-K. Koh, K. Roy
We present an integer linear programming-based
approach for solving the logic reconvergence problem
in skewed logic circuits with minimal logic duplication cost.
A simplification technique is applied to reduce the complexity
of the ILP problem greatly so that the run time is more
affordable. Experimental results show that an average of
18% of original gates are duplicated in skewed logic circuits,
whereas 65% in Domino logic circuits are duplicated. The
average power saving over Domino logic circuits is 40.9%.
Co-Chairs: Xinlong Hong, Yoichi Shiraiashi
-
9B-1 Highly Scalable Algorithms for Rectilinear and Octilinear Steiner Trees [p. 827]
-
A.B. Kahng, I.I. Mándoiu, A.Z. Zelikovsky
The rectilinear Steiner minimum tree (RSMT)
problem, which asks for a minimum-length interconnection of a
given set of terminals in the rectilinear plane, is one of the fundamental
problems in electronic design automation. Recently there
has been renewed interest in this problem due to the need for
highly scalable algorithms able to handle nets with tens of thousands
of terminals. In this paper we give a practical O(n log2 n)
heuristic for computing near-optimal rectilinear Steiner trees
based on a batched version of the greedy triple contraction algorithm
of Zelikovsky [21834]. Experiments conducted on both random
and industry testcases show that our heuristic matches or
exceeds the quality of best known RSMT heuristics, e.g., on random
instances with more than 100 terminals our heuristic improves
over the rectilinear minimum spanning tree by an average
of 11%. Moreover, our heuristic has very well scaling runtime,
e.g., it can route a 34k-terminals net extracted from a real design
in less than 25 seconds compared to over 86 minutes needed
by the O(n2) edge-based heuristic of Borah, Owens, and Irwin
[3]. Since our heuristic is graph-based, it can be easily modified
to handle practical considerations such as routing obstacles, preferred
directions, via costs, and octilinear routing ö indeed, experimental
results show only a small factor increase in runtime when
switching from rectilinear to octilinear routing.
-
9B-2 UTACO: a Unified Timing and Congestion Optimizing Algorithm for Standard Cell Global Routing [p. 834]
-
T. Jing, X. Hong, H. Bao, Y. Cai, J. Xu, C. Cheng, J. Gu
Timing performance and routability are two main
issues of global routing. In this paper, we adopt a shadow price
mechanism to incorporate the two issues into one unified
objective function. The shadow price of a net is the sum of its
congestion price and timing price. Based on the new formulation,
this paper presents the UTACO algorithm for standard cell (SC)
global routing. The experimental results show that UTACO is
efficient for both timing and congestion optimization.
-
9B-3 The Y-Architecture: Yet Another On-Chip Interconnect Solution [p. 840]
-
H. Chen, B. Yao, F. Zhou, C.-K. Cheng
In this paper, we propose a new on-chip
interconnect scheme called Y-architecture, which can utilize the
on-chip routing resources more efficiently than traditional
Manhattan interconnect architecture by allowing wires routed
in three directions (0ˇ, 60ˇ, and 120ˇ). To evaluate the efficiency
of different interconnect architectures, we assume mesh
structures with uniform communication demand and develop a
multi-commodity flow (MCF) approach to model the on-chip
communication traffic. We also extend the combinatorial MCF
algorithm in [5] to compute the optimal routing resource
allocations for different interconnect architectures. The
experiments show that: (1) Compared with Manhattan
architecture, the Y-architecture demonstrates a throughput
improvement of 30.7% for square chip. The throughput of the
Y-architecture is only 2.5% smaller than that of X-architecture.
(2) A chip with the shape of a convex polygon produces better
throughput than a rectangular chip: For Y-architecture, a
hexagonal chip provides 41% more throughput than a squared
chip using the Manhattan architecture. For Manhattan
architecture, a diamond chip achieves a throughput
improvement of 19.5% over the squared chip using the same
interconnect architecture. (3) Compared with Manhattan
architecture, the Y-architecture reduces the wire length of a
randomly distributed two pin net by 13.4% and the average
wire length of Y-architecture is only 4.3% more than that of the
X-architecture.
-
9B-4s A Novel Timing-Driven Global Routing Algorithm Considering Coupling Effects for High Performance Circuit Design [p. 847]
-
J. Xu, X. Hong, T. Jing, Y. Cai, J. Gu
As the CMOS technology enters the very deep
submicron era, inter-wire coupling capacitance becomes the
dominant part of load capacitance. The coupling effects have
brought new challenges to routing algorithms on both delay
estimation and optimization. In this paper, we propose a
timing-driven global routing algorithm with consideration of
coupling effects. The two-phase algorithm based on timing-relax
method includes a heuristic Steiner tree algorithm and an
optimization algorithm. Experimental results are given to
demonstrate the efficiency and accuracy of the algorithm.
-
9B-5s Graph Matching-Based Algorithms for Array-Based FPGA Segmentation Design and Routing [p. 851]
-
J.-M. Lin, S.-R. Pan, Y.-W. Chang
Architecture and CAD are closely related issues in FPGA design. Routing architecture
design shall optimize routability and facilitate router development; on
the other hand, router design shall consider the specific properties of routing architectures
to optimize the performance of the router. In this paper, we propose
effective and efficient unified matching-based algorithms for array-based FPGA
routing and segmentation design. For the segmentation design, we consider the
similarity of input routing instances and formulate a net-matching problem to construct
the optimal segmentation architecture. For the router design, we present a
matching-based timing-driven routing algorithm which can consider a versatile set
of routing segments. Experimental results show that our designed segmentations
significantly outperform those used in commercially available FPGAs. For example,
our designed segmentations achieve, on average, 14.6% and 19.7% improvements
in routability, compared with those used in the Lucent Technologies ORCA
2C-series and the Xilinx XC4000E-series FPGAs, respectively.
Co-Chairs: Alfred Crouch, Hiroshi Date
-
9C-1 Routing-Aware Scan Chain Ordering [p. 857]
-
P. Gupta, A.B. Kahng, S. Mantik
Scan chain insertion can have a large impact on routability, wirelength
and timing of the design. We present a routing-driven
methodology for scan chain ordering with minimum wirelength objective.
A routing-based approach to scan chain ordering, while
potentially more accurate, can result in TSP (Traveling Salesman
Problem) instances which are asymmetric and highly non-metric;
this may require a careful choice of solvers. We evaluate our new
methodology on recent industry place-and-route blocks with 1200
to 5000 scan cells. We show substantial wirelength reductions for
the routing-based flow, versus the traditional placement-based flow:
in a number of our test cases, over 86% of scan routing overhead is
saved. Even though our experiments are so far timing-oblivious, the
routing-based flow does also improve evaluated timing, and practical
timing-driven extensions appear feasible.
-
9C-2 Multiple Test Set Generation Method for LFSR-Based BIST [p. 863]
-
Y. Shi, Z. Zhang
In this paper we propose a new reseeding method for LFSR-based test
pattern generation suitable for circuits with random pattern resistant
faults. The character of our method is that the proposed test pattern
generator (TPG) can work both in normal LFSR mode, to generate pseudorandom
test vectors, and in jumping mode to make the TPG jump from a state to
the required state (seed of next group). Experimental results indicate
that its superiority against other known reseeding techniques with
respect to the length of the test sequence and the required area overhead.
-
9C-3 A Seed Selection Procedure for LFSR-Based Random Pattern Generators [p. 869]
-
K. Ichino, K.-i. Watanabe, M. Arai, S. Fukumoto, K. Iwasaki
We propose a technique of selecting seeds for the
LFSR-based test pattern generators that are used in VLSI
BISTs. By setting the computed seed as an initial value, target
fault coverage, for example 100%, can be accomplished with
minimum test length. We can also maximize fault coverage for a
given test length. Our method can be used for both
test-per-clock and test-per-scan BISTs. The procedure is based
on vector representations over GF (2m), where m is the
number of LFSR stages. The results indicate that test lengths
derived through selected seeds are about sixty percent shorter
than those derived by conventionally selected seeds for a given
fault coverage. We also show that seeds obtained through this
technique accomplish higher fault coverage than the
conventional selection procedure. In terms of the c7552
benchmark, taking a test-per-scan architecture with a 20-bit
LFSR as an example, the number of undetected faults can be
decreased from 304 to 227 for 10,000 LFSR patterns using our
proposed technique.
-
9C-4s Efficient BIST Design for Sequential Machines Using FiF-FoF Values in Machines States [p. 875]
-
S. Roy, U. Maulik, S. Bandyopadhyay, S. Basu, B.K. Sikdar
This paper introduces a novel BIST-quality metric termed as the
FiF - FoF (Fun-in-Factor & Fan-and-Factor) defined on FSM-states. Based
on the FiF - FoF analysis, an efficient scheme is presented that ensures all
state codes appear with uniform likelyhood at the present state (PS)
lines during the test phase. This results in higher fault efficiency in a
BIST structure. Experimental results on MCNC benchmarks show that the
scheme improves fault efficiency of sequential circuits significantly,
with marginal area overhead.
-
9C-5s A New Design-for-Test Technique for Reducing SOC Test Time [p. 879]
-
C.V.G. Rao, D.R. Chowdhury
This paper introduces a new design-for-test(DFT)
technique for system-on-chip(SOC) designs. It aims to provide the
test designer with details of test scheduling, test access mechanism
(TAM) design and an integrated test strategy in order to implement
an efficient test solution. Post-synthesis simulations are carried
out on the net lists of ISCAS'89 benchmark SOCs to prove
the allegiance of the proposed algorithm and to realize the DFT.
Experiments resulted in a significant reduction of the test time.
Co-Chairs: Wing-Hung Ki, Tsuneo Tsukahara
-
9D-1 Periodic Steady-State Analysis of Coupled ODE-AE-CGE Systems for MOS RF Autonomous Circuit Simulation [p. 885]
-
X. Wu, Z. Chen, J. Lai, Q. Zhang, O. Wing, J. Ren
This paper studies the steady-state analysis of a system of
ODE-AE-CGE arising from simulation of MOS RF autonomous circuits
in which the transistor plays as current generator and its current
equations could adequately represent the transistor large-signal
dynamic behavior. Gauss-Seidel relaxation is applied to decouple the
system so that shooting method, which is used to find the periodic
steady-state response, can be performed with low-order sensitivity
matrix. This simplifies the solution process and results in a fast
algorithm. We illustrate the new algorithm with the simulation of
a typically voltage-controlled oscillator (VCO), the simulation
results show good agreement with those obtained by MEDICI, a device simulator.
-
9D-2 A Frequency Separation Macromodel for System-Level Simulation of RF Circuits [p. 891]
-
X. Li, P. Li, Y. Xu, R. Dimaggio, L. Pileggi
In this paper we propose a frequency-separation
methodology to generate system-level macromodels for analog
and RF circuits. The proposed macromodels are similar in form
to those based on Volterra kernel calculations, but are much
simpler in terms of characterization and overall model
complexity, and can be derived from existing device models. This
simplicity is realized by applying some basic assumptions on the
form of the input excitations, and via separation of the
nonlinearities from the dynamic behavior. In addition, by
further separating the ideal model functionality, this
macromodel is applicable to strongly nonlinear components such
as mixers. While time-varying Volterra series models have been
proposed for mixers with a fixed local oscillation (LO) signal, the
proposed frequency separation model is completely general and
can capture the variations of the LO input during a system-level
simulation. The proposed macromodels are demonstrated in a
system-level simulation tool based on Simulink for efficient
evaluation of the entire RF system and associated components. A
GSM receiver system in 0.25µm CMOS process is used to
demonstrate the efficacy of these macromodels in our system-level
simulation environment.
-
9D-3 Nonlinear Distortion Analysis via Linear-Centric Models [p. 897]
-
P. Li, L.T. Pileggi
An efficient distortion analysis methodology is presented
for analog and RF circuits that utilizes linear-centric circuit
models to generate individual distortion contributions due
to the various circuit nonlinearities. The per-nonlinearity distortion
results are obtained via a straightforward post-simulation
step that is simpler and more efficient than the Volterra series
based approaches and do not require the high order device model
derivatives. For this reason the order of analysis can be significantly
higher than that for a Volterra series implementation
while fully accounting for all nonlinearity effects. The proposed
methodology is not restricted to weakly nonlinear circuits, but
can also analyze per-nonlinearity distortion for active switching
mixers and switch capacitor circuits when they are modeled as
periodically time-varying weakly nonlinear systems.While Volterra
series have also been attempted for this same class of circuits,
the requirement of device models for all of the high order
model derivatives makes such analysis somewhat impractical.
The proposed methodology provides important design insights
regarding the relationships between design parameters and circuit
linearity, hence the overall system performance. Circuit examples
are used to demonstrate the efficacy of the proposed
approach, and interesting insights are observed for RF switching
mixers in particular.
-
9D-4 Parasitic-Aware Design and Optimization of a Fully Integrated CMOS Wideband Amplifier [p. 904]
-
J. Park, K. Choi, D.J. Allstot
A custom CAD synthesis tool based on particle swarm optimization, and
results from the design of an RF CMOS distributed amplifier optimized
to overcome non-idealities associated with parasitic-laden passives,
are presented. The particle swarm synthesis approach is shown to be
more than an order of magnitude faster than the simulated annealing
design and optimization algorithm.
|