ASP-DAC 2003 ABSTRACTS

Sessions: [1A] [1B] [1C] [1D] [2A] [2B] [2C] [2D] [3A] [3B] [3C] [3D] [4A] [4B] [4C] [4D] [5A] [5B] [5C] [5D] [6A] [6B] [6C] [6D] [7A] [7B] [7C] [7D] [8A] [8B] [8C] [8D] [9A] [9B] [9C] [9D]


Session 1A: Bus Encoding and Memory Optimization

Co-Chairs: Luca Benini, Hiroshi Nakamura

1A-1 BEAM: Bus Encoding Based on Instruction-Set-Aware Memories [p. 3]
Y. Aghaghiri, F. Fallah, M. Pedram

This paper introduces a new approach for minimizing power dissipation on the memory address bus. The proposed approach relies on the availability of smart memories that have certain awareness of the instruction format of one or more architectures. Based on this knowledge, the memory calculates or predicts the instruction and data addresses. Hence, not all addresses are sent from the processor to the memory. This, in turn, significantly reduces the activity on the memory bus. The proposed method can eliminate up to 97% of the transitions on the instruction address bus and 75% of the transitions on the data address bus with a small hardware overhead. The actual power savings of 85% for the instruction bus and 64% for the data bus were achieved for a per-line bus capacitance of 10pF.

1A-2 Irredundant Address Bus Encoding Techniques Based on Adaptive Codebooks for Low Power [p. 9]
S. Komatsu, M. Fujita

The power dissipation at the off-chip bus is a significant part of the overall power dissipation in digital systems. This paper presents irredundant address bus encoding methods which reduce signal transitions on the instruction address buses by using adaptive codebook methods. These methods are based on the temporal locality and spatial locality of instruction address. Since applications tend to JUMP / BRANCH to limited sets of addresses, proposed encoding methods assign the least signal transition codes to the addresses of JUMP / BRANCH operations in the past. Our encoding methods reduce the signal transitions on the instruction address buses by an average of 88%.

1A-3 Multi-Parametric Improvements for Embedded Systems Using Code-Placement and Address Bus Coding [p. 15]
S. Parameswaran, J. Henkel, H. Lekastas

Code placement techniques for instruction code have shown to increase an SOCs performance mostly due to the increased cache hit ratios and as such those techniques can be a major optimization strategy for embedded systems. Little has been investigated on the interdependencies between code placement techniques and interconnect traffic (e.g. bus traffic) and optimization techniques combining both. In this paper we show as the first approach of its kind that a carefully designed known code placement strategy combined and adapted to a known interconnect encoding scheme does not only lead to a performance increase but it does also lead to a significant reduction of interconnect-related energy consumption. This becomes especially interesting since future SOC bus systems (or more general: änetworks on a chipä) are predicted to be a dominant energy consumer of an SOC. We show that a high-level optimization strategy like code placement and a lower-level optimization strategy like interconnect encoding are NOT orthogonal. Specifically, we report cache miss reduction ratios of 32% in average combined with bus related energy savings of 50.4% in average (with a maximum of up to 95.7%) by means of our combined optimization strategy. The results have been verified by means of diverse real-world SOC applications.

1A-4 Memory Access Pattern Analysis and Stream Cache Design for Multimedia Applications [p. 22]
J. Lee, C. Park, S. Ha

Memory system is a major performance and power bottleneck in embedded systems especially for multimedia applications. Most multimedia applications access stream type of data structures with regular access patterns. It is observed that conventional caches behave poorly for stream-type data structure. Therefore, prediction-based prefetching techniques have been extensively researched to exploit the regular access patterns. Prefetching, however, may pollute the cache if the prediction is not accurate and needs extra hardware prediction logic. To overcome these problems, we propose a novel hardware prefetching technique that is assisted by static analysis of data access pattern with stream caches. With the proposed stream cache architecture, we could achieve significant performance improvement compared with the conventional cache architecture.


Session 1B: DSM Interconnect and Gate Issues

Co-Chairs: Masanori Hashimoto, Kaushik Roy

1B-1 A Statistical Gate Delay Model for Intra-chip and Inter-chip Variabilities [p. 31]
K. Okada, K. Yamaoka, H. Onodera

This paper proposes a model to calculate statistical gate-delay variation caused by intra-chip and inter-chip variabilities. Our model consists of a statistical transistor model and a gate-delay model. We present a modeling and extracting method of transistor characteristics for the intra-chip variability and the inter-chip variability. In the modeling of the intra-chip variability, it is important to consider a gate-size dependence by which the amount of intra-chip variation is affected. This effect is not captured in a statistical delay analysis reported so far. Our gate-delay model characterizes a statistical gate-delay variation using a response surface method (RSM) according to the intra-chip and inter-chip variability of each transistor in a gate. We evaluate the accuracy of our model, and we show some simulated results of a circuit delay variation characterized by the measured variances of transistor currents.

1B-2 A Fast and Accurate Method for Interconnect Current Calculation [p. 37]
M. Shao, D.F. Wong, Y. Gao, H. Cao, L.-P. Yuan

As VLSI technology continues to scale down, electromigration problem has become one of the dominant factors in determining system reliability. This problem is caused by high current density flowing in the metal interconnect. Therefore, current evaluation is a crucial concern in IC design. SPICE level circuit simulators are excellent for doing current calculation, however, their running time are too expensive to be used repeatedly in design synthesis loops. In this paper, we propose an efficient approach for the interconnect current calculation. This method is based on moment matching but does not need high order moments. It only needs traversing the RC tree once to get the mean current value of every segment, traversing the tree once more is enough for the RMS current calculation, and two more traversals is sufficient for the peak current calculation. We apply our method to a larger number of interconnects getting close-to-SPICE accuracy at significantly faster runtimes. In particular, applying the method to 17,387 wire segments in the clock tree of a commercial IC, we obtained that the average deviation error of mean current is 0.0569%, average RMS current error is 0.703% and average peak current error is 6.552%. It took 28 hours for HSPICE to get current value of all the wire segments and it only took our method 156 seconds.

1B-3 Calculating the Effective Capacitance for the RC Interconnect in VDSM Technologies [p. 43]
S. Abbaspour, M. Pedram

In this paper, we present a new technique for calculating an effective capacitance of an RC interconnect line in very deep submicron design technologies. The calculation scheme guarantees that the effective capacitance model simultaneously matches both the 50% propagation delay and the 0-to-0.8Vdd output transition behavior of a standard cell driving an RC interconnect. Experimental results show that the new technique exhibits high accuracy (less than 5% error) and high efficiency (converges in two or at most three iterations). The paper also includes extensions to handle complex cells as drivers of the RC interconnect.

1B-4s Reduction of Crosstalk Noise by Optimizing 3-D Configuration of Routing Grid [p. 49]
A. Sakai, T. Yamada, Y. Matsushita, H. Yasuura

In this paper, we propose novel physical design techniques for a sub-quarter micron system-on-a-chip (SoC). By appropriately optimizing the routing grid space or the cell utilization ratio, the coupling effects are almost eliminated. By employing our proposed techniques on a 0.13µm six-layer physical design, the longest path delay is significantly decreased by 15% maximum without the need for process improvement. This significant delay reduction, which corresponds to a half generation of process progress, greatly accelerates the performance of SoCs.

1B-5s Design Tools for 3-D Integrated Circuits [p. 53]
S. Das, A. Chandrakasan, R. Reif

We present a set of design tools for 3-D integration. Using these tools - a 3-D standard-cell placement tool, global routing tool, and layout editor - we have targeted existing standard-cell circuit netlists for fabrication using wafer bonding. We have analyzed the performance of several circuits using these tools and find that 3-D integration provides significant benefits. For example, relative to single-die placement, we observe on average 28% to 51% reduction in total wire length.


Session 1C: Embedded Software: Task Scheduling and Compilation

Co-Chairs: Nikil Dutt, Akira Fukuda

1C-1 An On-line Approach for Power Minimization in QoS Sensitive Systems [p. 59]
J. L. Wong, M. Potkonjak, G. Qu

Majority of modern mobile systems have two common denominators: quality-of-service (QoS) requirements, such as latency and synchronization, and strict energy constraints. However, until now no synthesis techniques have been proposed for the design and efficient use of such systems. We have two main objectives: synthesis and conceptual. The synthesis goal is to introduce the first design technique for quality-of-service (QoS) low power synthesis. The conceptual objective is to develop a generic technique for the automatic development of on-line algorithms from efficient off-line algorithms using statistical techniques. We first summarize a system of provably-optimal techniques that minimize energy consumption of stream-oriented applications under two main QoS metrics: latency and synchronization. Specifically, we study how multiple voltages can be used to simultaneously satisfy hardware requirements and minimize power consumption, while preserving the requested level of QoS in terms of latency and synchronization. The off-line algorithm is used as input to statistical software used to identify important relevant parameters of the processes, buffer occupancy rate indicators, and a way how combine them to form a fast and efficient on-line algorithm which decides which task to run at which voltage. The effectiveness of the algorithms is demonstrated on a number of standard multimedia benchmarks.

1C-2 Energy Minimization of Real-time Tasks on Variable Voltage Processors with Transition Energy Overhead [p. 65]
Y. Zhang, X. Hu, D. Chen

In this paper, we address the problem of minimizing energy consumption of real-time tasks on variable voltage processors whose transition energy overhead is not negligible. Voltage settings with minimum number of transitions are found first and sequences of lower voltage cycles are evaluated to decide voltage for each cycle of every task. Experimental results demonstrate that our approach can reduce energy consumed by transitions from 41% to 8% and save more energy.

1C-3 Register Aware Scheduling for Distributed Cache Clustered Architecture [p. 71]
Z. Wang, X.S. Hu, E. Sha

Increasing wire delays have become a serious problem for sophisticated VLSI designs. Clustered architecture offers a promising alternative to alleviate the problem. In the clustered architecture, the cache, register file and function units are all partitioned into clusters such that short CPU cycle time can be achieved. A key challenge is the arrangement of inter-cluster communication. In this paper, we present a novel algorithm for scheduling inter-cluster communication operations. Our algorithm achieves better register resource utilization than the previous methods. By judiciously putting the selected spilled variables into their corresponding consumer's local cache, the costly cross-cache transfer is minimized. Therefore, the distributed caches are used more efficiently and the register constraint can be satisfied without compromising the schedule performance. The experiments shows that our technique outperforms the existing cluster-oriented schedulers.

1C-4 Data Partitioning for Maximal Scratchpad Usage [p. 77]
M. Verma, S. Steinke, P. Marwedel

The energy consumption for Mobile Embedded Systems is a limiting factor because of todayâs battery capacities. The memory subsystem consumes a large chunk of the energy, necessitating its efficient utilization. Energy efficient scratchpads are thus becoming common, though unlike caches they require to be explicitly utilized. In this paper, an algorithm integrated into a compiler is presented which analyzes the application, partitions an array variable whenever its beneficial, appropriately modifies the application and selects the best set of variables and program parts to be placed onto the scratchpad. Results show an energy improvement between 5.7% and 17.6% for a variety of applications against a previously known algorithm.


Session 1D: Combinational and Sequential Verification

Co-Chairs: Kiyoharu Hamaguchi, Yuji Kukimoto

1D-1 SAT-based Sequential Depth Computation [p. 87]
M. Mneimneh, K. Sakallah

Determining the depth of sequential circuits is a crucial step towards the completeness of bounded model checking proofs in hardware verification. In this paper, we formulate sequential depth computation as a logical inference problem for Quantified Boolean Formulas. We introduce a novel technique to simplify the complexity of the constructed formulas by applying simple transformations to the circuit netlist. We also study the structure of the resulting simplified QBFs and construct an efficient SAT-based algorithm to check their satisfiability. We report promising experimental results on some of the ISCAS 89 benchmarks.

1D-2 Logic Verification Based on Diagnosis Techniques [p. 93]
A. Veneris, A. Smith, M.S. Abadir

We present a formal logic verification methodology for combinational circuits. The method uses simulation, logic diagnosis and ATPG to identify circuit lines that implement equivalent logic functions efficiently. One advantage of the proposed technique is that it identifies line equivalences under controllability and observability don't care conditions, while not suffering from false negatives. The method is easy to implement, and, due to its general nature, existing techniques can benefit from ideas described here. We also give implementation details and present experiments to confirm its potential.

1D-3 Algorithms for Compacting Error Traces [p. 99]
Y.-A. Chen, F.-S. Chen

In this paper, we present a concept of compacting the error traces generated by pseudo-random/random simulations. The new shorter error trace not only decreases the time of user's debugging process but also reduces the simulation time required to verify the bug fixes. Two algorithms CET1 and CET2 are presented to perform the task of compacting the error trace. Both algorithms first use an efficient approach to eliminate the redundant states to generate the unique states of the error trace. Then, CET1 build the connected graph of these unique states by computing the reachable states by one cycle for each unique state, and then apply Dijkstra's shortest path algorithm to find out the shortest error trace in the connected graph. Compared with CET1, CET2 computes the reachable states by one cycle for those unique states, when they are needed in Dijkstra's shortest path algorithm to find the shortest error trace. After finding the shorter trace, the corresponding input/output test vectors are generated. The experimental results show that both algorithms can reduce the length of error traces dramatically for most cases using reasonable memory. For cases required longer CPU time to find the shortest trace, CET2 is up to 37 times faster than CET1.

1D-4s Transaction-based Waveform Analysis for IP Selection [p. 104]
J. Liu, E. Shragowitz

In a process of IP selection, it is necessary to establish whether a candidate IP is equivalent to a behavioral model of a design proposed by a customer. It is desirable to perform this verification to exclude IPs, which "don't match" the rest of the design. This work combines a simulation approach to establishing equivalence between models with formal regular expression techniques to provide transaction-level preliminary evaluation of IP suitability. Such evaluation could precede a decision to acquire IP.

1D-5s An Automatic Interconnection Rectification Technique for SoC Design Integration [p. 108]
C.-Y. Wang, S.-W. Tung, J.-Y. Jou

This paper presents an automatic interconnection rectification (AIR) technique to correct the misplaced interconnection occurred in the integration of a SoC design automatically. The experimental results show that the AIR can correct the misplaced interconnection and therefore accelerates the integration verification of a SoC design.


Session 2A: C-Based Specification and ASIP Design

Co-Chairs: Nagisa Ishiura, Wolfgang Rosenstiel

2A-1 Typing Abstractions and Management in a Component Framework [p. 115]
F.J. Doucet, S.K. Shukla, R.K. Gupta

We consider the type inference problems in a compositional design environment where the components are automatically instantiated from pre-existing C++-based intellectual property (IP) libraries. We present a component integration language based on scripting for design specification. Our focus is architectural aspects in specification that uses aggregation, as opposed to the more commonly used inheritanceö for composition of components. Our approach simplifies architectural specification by employing a type inference and type-management-environment. We show that the type inference problem is NP-complete. We present a heuristic based on code generation and parameterization to solve the type inference for IP selection in our C++-based composition environment. We have implemented the composition and type management in the BALBOA framework. The results show the utility of our approach.

2A-2 Event-Driven Observability Enhanced Coverage Analysis of C Programs for Functional Validation [p. 123]
F. Fallah, I. Ghosh, M. Fujita

Software programs written in some programming languages like C, C++, Java, etc, are mostly verified by functional simulation. Since exhaustive functional simulation is impossible for even a small C program, it is important to quantitatively measure the extent of design verification during simulation by a set of test vectors. Various coverage metrics have been proposed for measuring the degree of design verification. Most of them compute the extent of design excitation (controllability) but are unable to say whether the excitation responses have propagated to observable points in the program (observability). In this paper we propose a metric for code coverage analysis of C programs that addresses not only controllability but tackles observability as well. Thus, this metric is able to tell what percentage of the simulation responses have been propagated to observable points in the program like primary outputs or printed variables. We improve upon a recently proposed observability enhanced software coverage metric by increasing the accuracy of the analysis as well as decreasing the simulation runtime overhead by using an event-driven coverage analysis method. We report some experimental results of using our coverage analysis tool for several C programs.

2A-3 Trace-driven Rapid Pipeline Architecture Evaluation Scheme for ASIP Design [p. 129]
J.K. Kim, T.G. Kim

This paper proposes a rapid evaluation scheme of pipeline architecture using phase-accurate simulation with only delay model and trace. With latency information for every stage, we can decide if an instruction in one stage can proceed to the next stage or if an instruction can be issued for each cycle without evaluating the value for registers. Branch target becomes available with trace generated by fast instruction set simulation. Fast verification time becomes possible because instruction set simulation is performed only once.

2A-4 A Hardware/Software Partitioning Algorithm for SIMD Processor Cores [p. 135]
K. Tachikake, N. Togawa, Y. Miyaoka, J. Choi, M. Yanagisawa, T. Ohtsuki

This paper proposes a new hardware/software partitioning algorithm for processor cores with SIMD instructions. Given a compiled assembly code including SIMD instructions, a timing constraint of execution time, and available hardware units, the proposed algorithm synthesizes an area-optimized processor core with a new assembly code. Firstly, we assume an initial processor core on which an input assembly code can run with the shortest execution time. Secondly we reduce a hardware unit added to a processor core one by one while the timing constraint is satisfied. At the same time, we update the assembly code so that it can run on the new processor configuration. By repeating this process, we finally obtain a processor core architecture with small area under the given timing constraint. We expect that we can obtain a processor core which has appropriate SIMD functional units for running the input application program. The promising experimental results are also shown.


Session 2B: On-Chip Inductance

Co-Chairs: Hidetoshi Onodera, Martin D. F. Wong

2B-1 Approximate Formulae Approach for Efficient Inductance Extraction [p. 143]
A. Kurokawa, T. Sato, H. Masuda

In this paper, we present a new and effective approach to the extraction of on-chip inductance, in which we apply approximate formulae. The equations are based on the assumption of filaments or bars of finite width and zero thickness and are derived through Taylorâs expansion of the exact formula for mutual inductance between filaments. Despite the assumption of uniform current density in each of the bars, the model is sufficiently accurate for the interconnections of current and future LSIs, in which most of the wires are not affected by the skin and proximity effects. Expression of the equations in polynomial form provides a balance between accuracy and computational complexity. These equations are mapped according to the geometric structures for which they are most suitable in minimizing runtime in the calculation of inductance while remaining accurate to within 3%. Within the geometrical constraints, the wires are of arbitrary specification. From a comprehensive evaluation on the ITRS -specified global wiring structure for 2003, the values for inductance extracted through the proposed approach are within 3% of the values obtained by commercial three-dimensional (3-D) field solvers. The efficiency of the proposed approach is also demonstrated by extraction from a real layout design that has 300-k interconnecting segments.

2B-2 Accurate Prediction of the Impact of On-chip Inductance on Interconnect Delay using Electrical and Physical Parameter-based RSF [p. 149]
T. Sato, T. Kanamoto, A. Kurokawa, Y. Kawakami, H. Oka, T. Kitaura, A. Ikeuchi, H. Kobayashi, M. Hashimoto

This paper proposes a new methodology to accurately predict the impact of inductance on on-chip wire delay using response surface functions (RSF). The proposed methodology consists of two stages which involves first calculating the delay difference between RC and RLC wire models for a set of parameter variations, then building RSFs using electrical parameters such as wire resistance, capacitance, etc. , and physical parameters such as wire width, pitch, etc. as variables. The proposed methodology can help 1) to define design rules for avoiding inductance effects, 2) to point out wires that require RLC delay calculation, and 3) to estimate and correct the delay when using an RC model. An example design rule for limiting self inductance and accurate estimation of the delay difference for a 100 nm technology node is also presented.

2B-3 A Metric for Analyzing Effective On-Chip Inductive Coupling [p. 156]
G. Zhong, C.-K. Koh, K. Roy

In this paper, we propose a metric for effective inductive coupling: the matrix (R+ jwL) 1, where R and L are the resistance and inductance matrices. We use this metric to analyze the effectiveness of shields on reducing inductive coupling. Our analysis shows how the resistances of shields affect the effective inductive coupling between signal nets. SPICE simulations are carried out to validate the proposed metric.

2B-4 Determination of Worst-Case Crosstalk Noise for Non-Switching Victims in GHz+ Interconnects [p. 162]
J. Chen, L. He

Considering RLC interconnect model and multiple switching aggressors, we study switching pattern generation and switching time alignment that leads to worst-case crosstalk noise for a quiet victim or a noisy one. We assume that aggressors can have arbitrary switching patterns and can switch at arbitrary times. We show that the commonly used superposition algorithm results in 15% underestimation on average, and propose a new algorithm that has virtually the same complexity as the superposition algorithm but approximates the exhaustive search very well with only 4% underestimation on average. Further, we show that applying RC model to GHz+ interconnects in IRTS 0.10um technology underestimates crosstalk noise by up to 80%, and convincingly conclude that RLC model is necessary to analyze such interconnects.


Session 2C: Circuit and Modeling

Co-Chairs: Hiroaki Ueno, Zhiping Yu

2C-1 Recent Developments in ESD Protection for RF ICs [p. 171]
A. Wang

New challenge in ESD protection design for RF ICs is to address the complex interactions between the ESD protection network and the core circuit being protected in both directions. This paper reviews recent developments in RF ESD protection design, including switching and mis-triggering of ESD protection networks; ESD-induced parasitic capacitive, resistive, noise coupling and self-generated noise effects; RF ESD evaluation techniques; and low-parasitic compact RF ESD protection solutions.

2C-2 Temperature-Independence-Point Properties for 0.1um-Scale Pocket-Implant Technologies and the Impact on Circuit Design [p. 179]
K. Hisamitsu, H. Ueno, M. Tanaka, D. Kitamaru, M. Miura-Mattausch, H.J. Mattausch, S. Kumashiro, T. Yamaguchi, K. Yamashita, N. Nakayama

The temperature-independence point (TIP) of the drain current for MOS transistors in a 0:1µm-scale pocket-implant technology is gate-length (Lg) dependent and has different magnitudes for n-MOSFET and p-MOSFET. Circuits such as ring-oscillators have a TIP, lying between the values for n- and p- MOSFET. The circuit TIP is close to the n-MOSFET TIP for long Lg and gets closer to the p-MOSFET TIP for short Lg. The reason is the different temperature dependence of electron and hole mobility as a function of Lg. Due to the high field effect, oscillation periods of ring-oscillators with short Lg hardly improve, when the supply voltage is raised beyond the TIP. Therefore, an advantageous supply-voltage (VDD) choice for pocket-implant technologies is near the TIP of circuits, allowing a favorable combination of short switching delay and minimized temperature dependence. By designing the Vth;p closer to Vth;n, not only the low power dissipation, due to the reduction of the TIP, but also the suppressed TIP fluctuation can be realized.

2C-3 Behavioral Modeling of EM Devices by Selective Orthogonal Matrix Least-Squares Method [p. 184]
Y. Tanji, M. Suzuki, T. Watanabe, H. Asai

This paper presents a method for modeling EM devices, where sample data obtained by numerical EM solver is approximated into a rational matrix of complex s. The model is described in verilog-AMS, thus, the EM devices can be simulated with the digital/circuit mixed circuits described at various abstraction levels. To generate the model, the selective orthogonal matrix least-squares method is presented. The computational efficiency of the proposed approach is confirmed on a commercial simulator, compared with the numerical EM method.


Session 2D: Logic Optimization and Technology Mapping

Co-Chairs: Giovanni De Micheli, Shin-Ichi Minato

2D-1 A BDD-based Fast Heuristic Algorithm for Disjoint Decomposition [p. 191]
T. Bengtsson, A. Martinelli, E. Dubrova

This paper presents a heuristic algorithm for disjoint decomposition of a Boolean function based on its ROBDD representation. Two distinct features make the algorithm feasible for large functions. First, for an n-variable function, it checks only O(n2) candidates for decomposition out of O(2n) possible ones. A special strategy for selecting candidates makes it likely that all other decompositions are encoded in the selected ones. Second, the decompositions for the approved candidates are computed using a novel IntervalCut algorithm. This algorithm does not require re-ordering of ROBDD. The combination of both techniques allows us to decompose the functions of size beyond that possible with the exact algorithms. The experimental results on 582 benchmark functions show that the presented heuristic finds 95% of all decompositions on average. For 526 of those functions, it finds 100% of the decompositions.

2D-2 Logic Optimization for Asynchronous Speed Independent Controllers Using Transduction Method [p. 197]
H. Saito, H. Nakamura, M. Fujita, T. Nanya

Asynchronous speed independent (SI) circuits based on an unbounded gate delay model often suffer from high area penalty. It happens due to the lack of efficient global optimization. This paper presents a boolean optimization method based on tranduction method to optimize asynchronous SI circuits while preserving hazard-freeness.

2D-3 Technology Mapping for Low Leakage Power and High Speed with Hot-Carrier Effect Consideration [p. 203]
C.-w. Kang, M. Pedram

Leakage power and hot-carrier effects are emerging as key concerns in deep sub-micron CMOS technologies with respect to their effects on the total power dissipation and reliability of VLSI circuits. Leakage power dissipation is rapidly becoming a substantial contributor to the total power dissipation as threshold voltage becomes small. Similarly, the hot-carrier effect is one of the most significant failure mechanisms in high-density VLSI circuits. In this paper, a technology mapping technique is presented for use in reducing the leakage power dissipation of the circuit by utilizing a dual-threshold voltage cell library and for minimizing the aged delay of the circuit by considering the effect of hot carriers on the cell speeds as the circuit ages. In addition, this paper presents two methods to reduce delay during technology mapping: primary output ordering and pin permutation. Experimental results show that the total power dissipation and leakage power dissipation can be reduced by up to 27% and 52% as a result of the leakage-aware technology mapping and that the circuit aging phenomenon can be reduced by up to 10.6% as a result of hot-carrier-aware technology mapping. Delay was also reduced by up to 13% by using primary output ordering and pin permutation.

2D-4s Synthesis of High Performance Low Power PTL Circuits [p. 209]
D. Samanta, M.C. Dharmadeep, A. Pal

Among the various CMOS logic families, PTL has been recognized as one of the potential alternatives to static CMOS for the synthesis of high performance and low power circuits. Moreover, as BDDs can be readily mapped to PTL circuits, use of BDDs has been synonymous with the synthesis of PTL circuits. Most of the reported works on PTL synthesis are based on the Reduced Ordered BDDs (ROBDDs). We have developed a novel heuristic-based technique for obtaining Reduced Unordered BDDs (RUBDDs), which leads to circuits of smaller size having lesser delay and smaller power consumption compared to the existing results. We propose the technology mapping using the popular LEAP-like cells, such that the PTL circuit synthesis flow has the same flavor as that of the standard cell-based static CMOS circuit synthesis. We have also developed models for the estimation of delay and power consumption of the synthesized PTL circuits and compared those with the static CMOS and other existing PTL-based circuit realizations.

2D-5s A Technology Mapping Algorithm for Heterogeneous FPGAs [p. 213]
C.-C. Kao, Y.-T. Lai

In this paper, a technology mapping algorithm is proposed for heterogeneous FPGAs. The technology mapping problem is first formulated as a flow network problem. Then, an algorithm based on the min-cost max-flow algorithm is presented to select a proper set of feasible LUTs for various objectives. The objective, the total area composed of LUTs and routing area, are discussed in the paper. This algorithm has been tested on the MCNC benchmark circuits. Compared with other existing LUT-based FPGA mapping algorithms, the algorithm produces better characteristics.


Session 3A: SoC and NoC

Co-Chairs: Masahiro Fujita, Rajesh Gupta

3A-1 Combining Architecture Exploration and a Path to Implementation to Build a Complete SoC Design Flow from System Specification to RTL [p. 219]
M.A. Dziri, F. Samet, F.R. Wagner, W.O. Cesário, A.A. Jerraya

This paper presents a full System-on-Chip (SoC) design flow from system specification to RT-level. A new approach to obtain a full path to implementation for SoC design is proposed. This approach combines architecture design space exploration using the VCC design environment and system synthesis using the ROSES design flow, allowing a true and complete system level design flow. The experiment with a VDSL application shows a significant reduction of design time.

3A-2 Towards On-Chip Fault-Tolerant Communication [p. 225]
T. Dumitras, S. Kerner, R. Marculescu

As CMOS technology scales down into the deep-submicron (DSM) domain, devices and interconnects are subject to new types of malfunctions and failures that are harder to predict and avoid with the current system-on-chip (SoC) design methodologies. Relaxing the requirement of 100% correctness in operation drastically reduces the costs of design but, at the same time, requires SoCs be designed with some degree of system-level fault-tolerance. In this paper, we introduce a high-level model of DSM failure patterns and propose a new communication paradigm for SoCs, namely stochastic communication. Specifically, for a generic tile-based architecture, we propose a randomized algorithm which not only separates computation from communication, but also provides the required fault-tolerance to on-chip failures. This new technique is easy and cheap to implement in SoCs that integrate a large number of communicating IP cores.

3A-3 Energy-Aware Mapping for Tile-based NoC Architectures under Performance Constraints [p. 233]
J. Hu, R. Marculescu

In this paper, we present an algorithm which automatically maps the IPs/cores onto a generic regular Network on Chip (NoC) architecture such that the total communication energy is minimized. At the same time, the performance of the mapped system is guaranteed to satisfy the specified constraints through bandwidth reservation. As the main contribution, we first formulate the problem of energy-aware mapping, in a topological sense, and then propose an efficient branch-and-bound algorithm to solve it. Experimental results show that the proposed algorithm is very fast and robust, and significant energy savings can be achieved. For instance, for a complex video/audio SoC design, on average, 60.4% energy savings have been observed compared to an ad-hoc implementation.


Session 3B: Clock Synthesis and Capacitance Extraction

Co-Chairs: Masato Edahiro, Massoud Pedram

3B-1 Adaptive Wire Adjustment for Bounded Skew Clock Distribution Network [p. 243]
H. Saaied, D. Al-Khalili, A. Al-Khalili, M. Nekili

In this paper, we suggest an adaptive approach for the Clock Distribution Network (CDN) to cope with a modification in the VLSI system design. The CDN's wires are adjusted iteratively to reduce the skew that is resulting from a minor modification in the clock pins of a complex VLSI system. Such skew can be remedied by selecting a Balancing Node (BN) and adjust its edges so that the skew gets smaller. The required edge adjustments are determined using the Elmore delay model. The performance of the algorithm is investigated using different random sets of clock pins. Also, the algorithm is tested by altering some clock pins in a zero skew CDN. For small modifications in a large number of nodes in the CDN, our algorithm can achieve zero skew with less iterations than linear order algorithms.

3B-2 Power Minimization by Clock Root Gating [p. 249]
Q. Wang, S. Roy

Clock root gating transformation targets power savings on the clock tree by inserting gating logic at the root of the clock. In this paper we propose an efficient graph-based algorithm to solve the root clock gating optimization problem. The algorithm is also tightly integrated with clock tree synthesis tool so that real power savings can be achieved after clock tree is generated. Experimental results on industrial circuits showed that significant power savings can be achieved.

3B-3 BBE: Hierarchical Computation of 3-D Interconnect Capacitance with BEM Block Extraction [p. 255]
T. Lu, Z. Wang, X. Hong

Fast and accurate interconnect parasitic parameter extraction has become increasingly critical for verification and analysis in VLSI today. In this paper, a fast hierarchical extraction approach based on Boundary Element Method (BEM) is presented for 3-D parasitic capacitance computation. Hierarchical partition of the 3-D field creates many parts which are called 3-D BEM blocks. After combining all the 3-D BEM blocks, the capacitance matrix for a given set of nets can be computed by applying the boundary conditions. Numerical results shows that this method performs many times faster than the 3-D field solver, with equal accuracy.

3B-4 Improving Boundary Element Methods for Parasitic Extraction [p. 261]
S. Yan, J. Liu, W. Shi

We improve the accuracy and speed of boundary element method (BEM) or multipole accelerated BEM for interconnect parasitic extraction. Three techniques are presented and applied to capacitance extraction:selective coefficient enhancement, variable order multipole and multigrid. Experimental results show that the techniques are effective for extracting parasitics between all pairs of conductors, or between selected pairs of conductors.


Session 3C: Analysis Methodologies for Circuits

Co-Chairs: Naoyuki Shigyo, Albert Z. H. Wang

3C-1 Statistical Delay Computation Considering Spatial Correlations [p. 271]
A. Agarwal, D. Blaauw, V. Zolotov, S. Sundareswaran, M. Zhao, K. Gala, R. Panda

Process variation has become a significant concern for static timing analysis. In this paper, we present a new method for path-based statistical timing analysis. We first propose a method for modeling inter- and intra-die device length variations. Based on this model, we then present an efficient method for computing the total path delay probability distribution using a combination of device length enumeration for inter-die variation and an analytical approach for intra-die variation. We also propose a simple and effective model of spatial correlation of intra-die device length variation. The analysis is then extended to include spatial correlation. We test the proposed methods on paths from an industrial high-performance microprocessor and present comparisons with traditional path analysis which does not distinguish between inter- and intradie variations. The characteristics of the device length distributions were obtained from measured data of 8 test chips with a total of 17688 device length measurements. Spatial correlation data was also obtained from these measurements. We demonstrate the accuracy of the proposed approach by comparing our results with Monte-Carlo simulation.

3C-2 Predicting Short Circuit Power from Timing Models [p. 277]
E. Acar, R. Arunachalam, S.R. Nassif

Power dissipation is becoming a major show stopper for integrated circuit design especially in the server and pervasive computing technologies. Careful consideration of power requirements is expected to bring major changes in the way we design and analyze integrated circuit performance. This paper proposes a practical methodology to evaluate the short-circuit power of static CMOS gates via effective use of timing information from timing analysis. We introduce three methods to estimate short-circuit power of a static CMOS circuit without requiring explicit circuit simulation. Our proposed methodology offers practical advantages over previous approaches, which heavily rely on simple special device models. Proposed approach is experimented with an extensive set of benchmark examples and several device models and found very accurate.

3C-3 RCLK-VJ Network Reduction with Hurwitz Polynomial Approximation [p. 283]
Z. Qin, C.-K. Cheng

We propose a new linear network reduction algorithm based on a generalized Y-Delta transformation technique in s-domain. Resultant admittance is kept as a rational function of s with a dramatically reduced order. Yet it preserves low-order terms of exact admittance evaluated with traditional symbolic analysis. Stability of transfer functions derived from reduced order admittance is guaranteed via a Hurwitz polynomial approximation. Such low-order transfer functions are used in pole analysis and time domain waveform evaluation in response to any input signal.


Session 3D: Symbolic Simulation and Verification

Co-Chairs: Yirng-An Chen, Shinji Kimura

3D-1 Gate-Level Simulation of Quantum Circuits [p. 295]
G.F. Viamontes, M. Rajagopalan, I.L. Markov, J.P. Hayes

Simulating quantum computation on a classical computer is a difficult problem. The matrices representing quantum gates, and vectors modeling qubit states grow exponentially with an increase in the number of qubits. However, by using a new data structure called the Quantum Information Decision Diagram (QuIDD) that exploits the structure of quantum operators, many of these matrices and vectors can be represented in a form that grows polynomially. Using QuIDDs, we implemented a general-purpose quantum computing simulator in C++ called QuIDDPro and tested it on Groverâs algorithm. Our QuIDD technique asymptotically outperforms other known simulation techniques.

3D-2 Enhanced Symbolic Simulation for Efficient Verification of Embedded Array Systems [p. 302]
T. Feng, L.-C. Wang, K.-T. Cheng, M. Pandey, M.S. Abadir

In the past, Symbolic Trajectory Evaluation (STE) was shown to be effective for verifying individual array blocks. However, when applying STE to verify multiple array blocks together as a single system, the run-time OBDD sizes would often blow up. In this paper, we propose using a "dual-rail" symbolic simulation scheme to facilitate the application of STE proof methodology for verifying array systems. The proposed scheme implicitly partitions a given design into control domain and datapath domain, and symbolic simulation is carried out on both domains. With this scheme, the run-time OBDD sizes during the symbolic simulation for each domain can be limited. We demonstrate the effectiveness of our approach by verifying the Memory Management Unit (MMU) in Motorola high-performance microprocessors. The verification of MMU as a whole was not possible before because of the OBDD size blow-up problem when an ordinary symbolic simulator was used in the STE proof process.

3D-3s Hardware Verification Using ANSI-C Programs as a Reference [p. 308]
E. Clarke, D. Kroening

We describe an algorithm to verify a hardware design given in Verilog using an ANSI-C program as a specification. We use SAT based Bounded-Model Checking [1] in order to reduce the equivalence problem to a bit vector logic decision problem. As a case study, we describe experimental results on a hardware and a software implementation of the data encryption standard (DES) algorithm.

3D-4s Evaluation of Multiple-Output Logic Functions Using Decision Diagrams [p. 312]
Y. Iguchi, T. Sasao, M. Matsuura

This paper shows four different methods to evaluate multiple-output logic functions using decision diagrams: Shared BDD (SBDD), Multi-Terminal BDD (MTBDD), BDD for characteristic functions (CF), and BDDs for Encoded Characteristic Function for Non-zero outputs (ECFNs). Methods to compute average evaluation time for each type of decision diagrams are presented. By experimental analysis using benchmark functions, the number of nodes and average evaluation time are compared. Our results show that BDDs for ECFNs outperforms MTBDDs, BDDs for CFs, and SBDDs with respect to both number of nodes and computation time. The sizes of BDDs for ECFNs are smaller than for MTBDDs, BDDs for CFs, and SBDDs.


Session 4A: Modeling for Floorplan

Co-Chairs: Dinesh P. Mehta, Yasuhiro Takashima

4A-1 A Simulated Annealing Approach with Sequence-Pair Encoding Using a Penalty Function for the Placement Problem with Boundary Constraints [p. 319]
S. Tayu

The module placement is one of the most important problem in the VLSI design. A practical VLSI placement problem often includes some constraints. In this paper, we propose a penalty function approach for the efficient simulated annealing search on the solution space of constrained problems. We apply the penalty function approach to the placement problem with boundary constraints. Experimental results show that our proposed method can accomplish more effective simulated annealing search than the conventional method proposed in [3] for two modules sets, an MCNC benchmark ami49 and a randomly generated module set.

4A-2 Multi-level Placement for Large-Scale Mixed-Size IC Designs [p. 325]
C.-C. Chang, J. Cong, X. Yuan

In this paper we study the large-scale mixed-size placement problem where there is a significant size variation between big and small placeable objects (the ratio can be as large as 10,000). We develop a multi-level optimization algorithm, MPGMS, for this problem which can efficiently handle both large-scale designs and large size variations. Compared with the recently published work [1] on large-scale mixed macro and standard cell placement benchmarks for wirelength minimization, our method can achieve 13% wirelength reduction on average with comparable runtime.

4A-3 Selected Sequence-Pair: An Efficient Decodable Packing Representation in Linear Time using Sequence-Pair [p. 331]
C. Kodama, K. Fujiyoshi

In this paper we study the large-scale mixed-size placement problem where there is a significant size variation between big and small placeable objects (the ratio can be as large as 10,000). We develop a multi-level optimization algorithm, MPGMS, for this problem which can efficiently handle both large-scale designs and large size variations. Compared with the recently published work [1] on large-scale mixed macro and standard cell placement benchmarks for wirelength minimization, our method can achieve 13% wirelength reduction on average with comparable runtime.

4A-4s An Extended Representation of Q-sequence for Optimizing Channel-Adjacency and Routing-Cost [p. 338]
C. Zhuang, K. Sakanushi, L. Jin, Y. Kajitani

This paper proposes a topological representation for general floorplan, called the H-sequence, which can check channel-adjacency and boundary-adjacency in a constant time. Moreover, we define Routing-cost for the placement to measure its routing difficulty. Experimental results indicate that H-sequence based placement algorithm can optimize routing-cost effectively in a short time.

4A-5s Non-slicing Floorplans with Boundary Constraints Using Generalized Polish Expression [p. 342]
D.-S. Chen, C.-T. Lin, Y.-W. Wang

In this paper, we address the problem of VLSI floorplanning with considering boundary constraints. The problem is practical and crucial in physical design since architects decide to arrange some I/O involved modules along the chip boundary to minimize both chip area and off-chip connections. By using a new representation called Generalized Polish Expression, we propose an efficient algorithm to handle the boundary constraints on non-slicing floorplans. In addition, a new fixing heuristic based on modular similarity is also presented to effectively fix the generated infeasible floorplans during the process. The experimental result is good in commonly used MCNC benchmark circuits.


Session 4B: (Special Session) Panel Discussion: Anatomy of Platform-Based Design: Is It the Savior of UDSM SoC Design Crisis ? [p. 349]

Organizers: Tadahiko Nakamura, Takahide Inoue
Moderator: Takahide Inoue
Panelists: Bob Altizer, Ken Chen, Jun Iwamura, Masasuke Kishi, Grant Martin, Augusto De Oliveira

Ever increasing UDSM design/manufacturing complexity and cost have generated serious ROI challenges. Numerous methods have been proposed and create various opportunities in the SoC industry and academia. Although IP-reuse has believed to be the-must-to-go approach but so far achieved limited success. Recently many people have argued "Platform-Based Design" as the golden solution for SoC design/manufacturing productivity. However the common definition of the Platform is not well established yet. In this panel, a group of distinguished panelists with various backgrounds will discuss what is their Platform and how PBD improve their and your SoC design productivity.
[Points of Discussion:] Some topics in this panel will be: 1) What is the Platform and Plat-form Based Design? 2) What is my Platform? What are attributes , components and hierarchy of the Platform? 3) What I get and what I loose (WYGWYL) with Platform Based Design. 4) Is my Platform same as yours? / Who should own the Platform? 5) What are EDA role, opportunity and challenges for the best Platform development?


Session 4C: Reconfigurable Systems

Co-Chairs: Reiner Hartenstein, Kiyoshi Oguri

4C-1 Efficient LUT-Based FPGA Technology Mapping for Power Minimization [p. 353]
H. Li, W.-K. Mak, S. Katkoori

We study the technology mapping problem for LUT-based FPGAs targeting at power minimization. The problem has been proved to be NP-hard previously. Hence, we present an efficient heuristic to compute low-power mapping solutions. The major distinction of our work from previous ones is that while generating a LUT, we look ahead at the impact of the mapping selection of this LUT on the power consumption of the remaining network. We choose the mapping that results in the least estimated overall power consumption. The key idea is to compute low-power -feasible cuts by an efficient incremental network flow computation method. Experimental results show that our algorithm reduces both power consumption and area over the previous algorithms reported in the literature.

4C-2 Optimal Reconfiguration Sequence Management [p. 359]
S. Ghiasi, M. Sarrafzadeh

In this paper, we present an efficient optimal algorithm for minimizing runtime reconfiguration (context switching) delay of executing an application on a reconfigurable system. We assume that the basic operations of the application are already scheduled and each of them has to be realized on the reconfigurable fabric in order to be executed. The modeling and algorithm are both applicable to partially reconfigurable platforms as well as Multi-FPGA systems. The algorithm can be directly applied to minimize the application runtime for many typical classes of applications, where the actual execution delay of basic operations is negligible compared to reconfiguration delay. We prove the optimality and efficiency of our algorithm and report experimental results, which demonstrate 40% to 2.5% improvement in total runtime reconfiguration delay.

4C-3s On Improving FPGA Routability Applying Multi-level Switch Boxes [p. 366]
J. Liu, H. Fan, Y.-L. Wu

In this paper, we propose a new FPGA switch box design style÷the extended switch boxes. An extended switch box is multi-level in nature. It consists of a kernel and extension(s) connected to the kernel, while many conventional switch boxes only consist of the kernel part and are referred to as single-level switch boxes. We show that with a much reduced total number of manufactured switches in the chip, this new design has a guaranteed complete mappability from any global routing to a feasible detailed routing of the entire FPGA chip. The interesting results seem to open a new avenue for designing FPGA routing structures.

4C-4s An Image Retrieval System Using FPGAs [p. 370]
K. Nakano, E. Takamichi

The main contribution of this paper is to present an image retrieval system using FPGAs. Given a template image T and a database of a number of images I1, I2, ..., our system lists all images that contain a subimage similar to T. More specifically, a hardware generator in our system creates the Verilog HDL source of a hardware that determines whether Ti has a similar subimage to T for any image Ii and a particular template T. The created Verilog HDL source is embed in an FPGA using the design tool provided by the FPGA vendor. Since the hardware embedded in the FPGA is designed for a particular template T, it is an instance-specific hardware that allows us to achieve extreme acceleration. We evaluate the performance of our image matching hardware using a PCI-connected Xilinx FPGA and a timing analyzer. Since the generated hardware attains up to 3000 speedup factor over the software solution, our approach is promising.

4C-5 Logic Foundry: Rapid Prototyping of FPGA-based DSP Systems [p. 374]
G. Spivey, S. Bhattacharyya, K. Nakajima

The Logic Foundry is a system for the creation and integration of FPGA-based DSP systems. Recognizing that some of the greatest challenges in creating FPGA-based systems occur in the integration of the various components, we have developed a system that addresses the following four areas of integration: design flow integration, component integration, platform integration, and software integration. Using the Logic Foundry, a system can easily be specified, and then automatically constructed and integrated with system level software.


Session 4D: Design Methodologies for Leading Edge Low-Power Design

Co-Chairs: Tohru Ishihara, Radu Marculescu

4D-1 Advanced Power Management Techniques: Going Beyond Intelligent Shutdown [p. 385]
L. Benini

Well into the System-on-Chip era, power consumption has emerged as one of the most critical challenges to design complexity scaling. Moving from a critical assessment of current technologies and architectures, we survey the distinguishing features of a design methodology that aims at energy consumption reduction, under guaranteed quality of service (QoS), as a main objective in system design.

4D-2 Design Technologies for Low Power Microprocessors [p. 390]
T. Hattori

Low power is one of the most important targets of embedded microprocessor design. To reduce the dynamic power and static power, several design technologies applied in real microprocessors. But the term "low power" is not simple in real application. According to the system low power, many varieties of low power features are required for microprocessors. For example, an application processor for cellular phone was designed using many low-power technologies. Both device / circuits level and architecture/logic level low power technologies are discussed.

4D-3 Design Methodology of Low-Power CMOS RF-ICs [p. 394]
T. Tsukahara, M. Harada, M. Ugajin, J. Kodate, A. Yamagishi

This paper presents design methodology for CMOS RF-ICs in the 2-GHz band. After describing RF transceiver architectures, it introduces some low-voltage, low-power CMOS front-end circuits that use an LC-tank folding technique. Finally, it presents a 1-V, 12-mW image-rejection receiver in the 2-GHz band.

4D-4s Minimizing Total Power by Simultaneous Vdd/Vth Assignment [p. 400]
A. Srivastava, D. Sylvester

We investigate the effectiveness of simultaneous multiple supply and threshold voltage assignment in minimizing the total power (static + dynamic) for the first time. Achievable power reductions under varying conditions are investigated, including static-power limited designs and sub-1V processes. Rules of thumb are developed for optimal Vdd's and Vth's to be used in future designs. These models show the optimal second Vdd to be approximately half the nominal Vdd while the total power savings is significantly greater than previously anticipated. We describe the impact of level conversion delays and highlight the tradeoff between power savings and critical path count.

4D-5s A Low Power CMOS Circuit with Variable Source Scheme (VSCMOS) [p. 404]
T. Yasuda, K. Hosokawa

This paper proposes a method to reduce the standby leakage current of MOSFET by controlling the voltage of the source node. The method allows to use a low threshold device for high performance speed and low power dissipation in both active and standby periods. This method can be easily applied for conventional ASIC library circuits because no additional process circuits, or devices except slight modification for the body contact cell are required.


Session 5A: Performance Driven Floorplan

Co-Chairs: Yao-Weng Chang, Shinâichi Wakabayashi

5A-1 Fast Buffer Planning and Congestion Optimization in Interconnect-driven Floorplanning [p. 411]
K.W.C. Wong, E.F.Y. Young

In this paper, we study and implement a routability-driven floorplanner with congestion estimation and buffer block planning. We assume that buffers should be inserted at flexible intervals from each other for long enough wires. Under this buffer insertion constraint, our floorplanner will estimate congestion by computing the best possible buffer locations for each net and perform probabilistic analysis based on the solution. Dynamic programming is used such that estimations can be done very effectively. Nets are topologically grouped to consider bus-based routing and to facilitate the estimation process. We compare our results with those in paper [16] which are the latest results for this problem, and show that our approach can perform better in both quality and runtime.

5A-2 Interconnect-Driven Floorplanning by Searching Alternative Packings [p. 417]
C.-w. Sham, E.F.Y. Young, H. Zhou

In traditional floorplanners, area minimization is an important issue. Due to the recent advances in VLSI technology, the number of transistors in a design and their switching speeds are increasing rapidly. This results in the increasing importance of interconnect delay and routability of a circuit. We should consider interconnect planning and buffer planning as soon as possible. In this paper, we propose a method to reduce interconnect cost of a floorplan by searching alternative packings. We found that if a floorplan F contains some rectangular supermodules, we can rearrange the blocks in the supermodule to obtain a new floorplan with the same area as F but possibly with a smaller interconnect cost. Experimental results show that we can always reduce the interconnect cost of a floorplan without any penalty in area and runtime by using this method.

5A-3s Noise-Aware Buffer Planning for Interconnect-Driven Floorplanning [p. 423]
S.-M. Li, Y.-H. Cherng, Y.-W. Chang

Crosstalk-induced noise has become a key problem in interconnect optimization when technology improves, spacing diminishes, and coupling capacitance/inductance increases. Buffer insertion/sizing is one of the most effective and popular techniques to reduce interconnect delay and decouple coupling effects. It is traditionally applied to post-layout optimization. However, it is obviously infeasible to insert/size hundreds of thousands buffers during the post-layout stage when most routing regions are occupied. Therefore, it is desirable to incorporate buffer planning into floorplanning to ensure timing closure and design convergence. In this paper, we first derive formulae of buffer insertion for timing and noise optimization, and then apply the formulae to compute the feasible regions for inserting buffers to meet both timing and noise constraints. Experimental results show that our approach achieves an average success rate of 80.9% (78.2%) of nets meeting timing constraints alone (both timing and noise constraints) and consumes an average extra area of only 0.49% (0.66%) over the given floorplan, compared with the average success rate of 75.6% of nets meeting timing constraints alone and an extra area of 1.33% by the BBP method [3].

5A-4s Floorplanning with Power Supply Noise Avoidance [p. 427]
H.-M. Chen, L.-D. Huang, I.-M. Liu, M. Lai, D.F. Wong

With todayâs advanced integrated circuits (ICs) manufacturing technology in deep submicron (DSM) environment, we can integrate entire electronic systems on a single chip (SoC). However, without careful power supply planning in layout, the design of chips will suffer from mostly signal integrity problems including IR-drop, Delta I noise, and IC reliability. Postroute methodologies in solving signal integrity problem have been applied but they will cause a long turn-around time, which adds costly delays to time-to-market. In this paper, we study the problem of power supply noise avoidance as early as in floorplanning stage. We show that the noise avoidance in power supply planning problem can be formulated as a constrained maximum flow problem and present an efficient yet effective heuristic to handle the problem. Experimental results are encouraging. With slight increase of total wirelength, we achieve almost no IR-drop requirement violation and 46.6% of improvement on Delta I noise constraint violation compared with a previous approach.

5A-5s Simultaneous Floorplanning and Buffer Block Planning [p. 431]
I.H.-R. Jiang, Y.-W. Chang, J.-Y. Jou, K.-Y. Chao

As technology advances and the number of interconnections among modules rapidly increases, timing closure and design convergence are the most important concerns. Hence, it is desirable to consider interconnect optimization as early as possible. In this paper, we first address simultaneous floorplanning and buffer block planning (i.e., integrating buffer block planning into floorplanning) for interconnect optimization. Experimental results show that our method can significantly improve the interconnect delay and reduce the number of buffers needed.

5A-6s A Buffer Planning Algorithm Based on Dead Space Redistribution [p. 435]
S. Chen, X. Hong, S. Dong, Y. Ma, Y. Cai, C.-K. Cheng, J. Gu

This paper studies the buffer planning problem for interconnect-centric floorplanning for nanometer technologies. The dead-spaces are the spaces within a placement that are not held by any circuit block. In this paper, we proposed a buffer planning algorithm based on dead space redistribution to make good use of dead-spaces for buffer insertion. Associated with circuit blocks under topological representations, the dead space can be redistributed by freely moving some circuit blocks within their rooms in the placement. The total area and the topology of the placement keep unchanged while doing the dead space redistribution. The number of nets satisfying the delay constraint can be increased by redistributing the dead space all over the placement, which has been demonstrated by the experimental results. The increment of the number of nets that satisfy delay constraints is 9% on an average.


Session 5B: (Special Session) Invited Talks: Virtual Core Based Reuse Methodology for SoC Design

Co-Chairs: Masahiro Fujita, Sandeep K. Shukla

5B-1 VCore Based Design Methodology [p. 441]
M. Muraoka, H. Hamada, H. Nishi, T. Tada, Y. Onishi, T. Hosokawa, K. Yoshida

The VCore [1](*) based design methodology, which has been developed at the VCDS (**) Project, is a SoC design methodology using VCores. A VCore is a reusable, high level abstracted design component. We have developed the VCore based design methodology and the VCDS tool prototype. We used the developed tool and did a trial SoC design. The design result showed that SoC design productivity improved using the proposed methodology. (*) VCDS: Virtual Core based Design System (**) VCore: Virtual Core

5B-2 Synthesis for SoC Architecture Using VCores [p. 446]
H. Nishi, M. Muraoka, R.K. Morizawa, H. Yokota, H. Hamada

In this paper, we propose a novel architecture synthesis method for SoC using VCores. VCores are reusable and configurable high-level descriptions. An initial SoC architecture, which consists of a CPU, buses, and peripherals, is generated based on an architecture template. The hardware and software tradeoff is possible on the architecture model after assignment of software VCores or hardware VCores. The assignment is based on the results of the architecture's performance estimation. We present a prototype of the synthesis for SoC architecture using VCores and an architecture level design experiment using this prototype.

5B-3 VCore Based Platform for SoC Design [p. 453]
Y. Onishi, M. Muraoka, M. Utsuki, N. Tsubaki

The reuse-based design paradigm is the key to improve the design productivity of SoCs (System on a Chip). However, SoC designers have difficulty in using conventional IPs (Intellectual Property) because they don't have enough variability, it is difficult to customize them. In this paper, we propose the variability of VCores (Virtual Cores) and show VCores are superior to IPs for system-level design property. Based on VCores, we have developed a VCore-based platform. We show the SoC design productivity will be improved by using VCores and the VCore-based platform.

5B-4 VCDS Tool Demonstration [p. 459]
R.K. Morizawa, K. Tanaka, K. Watanabe, Y. Kaitsu, S. Hanamura, T. Shinsha, M. Muraoka

The VCore(*) based Design Methodology developed in the VCDS(**) project is a novel design methodology utilizing VCores. VCores are reusable functional cores defined at high level. We designed a SoC for Wearable Computer as a vehicle application in the pilot project (Figure 1), and compared our proposed methodology with a conventional RTL based design methodology by measuring the design productivity. We obtained very promising prospects that the design productivity could be improved 20 times for the enhanced VCDS (Figure 2). We will make a demonstration of the implemented pilot tool to show how effective our proposed methodology is. (*) VCore: Virtual Core (**) VCDS: Virtual Core based Design System


Session 5C: (Special Session) Invited Talks + Panel Discussion: Adaptive Computing: What Can It Do, Where Can It Go? [p. 463]

Organizer: Robert Reuss
Moderators: Jose Munoz, Toshiaki Miyazaki
Panelists: Nader Bagherzadeh, Prith Banerjee, Brad Hutchings, Jose Munoz, Brian Schott

The Adaptive Computing Systems (ACS) program was initiated by Defense Advanced Research Projects Agency (DARPA) of the United States in 1996. With the advent of FPGAs, has emerged a new class of computing systems that contain configurable hardware. This session begins by a presentation by the first ACS program manager of its motivation, original goals, and objectives. It is then followed by presentations of four specific projects under the ACS program. Future activities surrounding the ACS community will be discussed at the end.

5C-1 DARPA's Adaptive Computing Systems Program [p. 464]
J. Munoz

Motivation for DARPAâs ACS program will be presented along with original goals and objectives of the program. A brief description of some of the efforts that were initiated and why. A report card of what I think the program did well, and where I feel additional work is still required.

5C-2 Applications of Adaptive Computing Systems for Signal Processing Challenges [p. 465]
B. Schott, P. Bellows, M. French, R. Parker

Adaptive computing systems use FPGAs for custom hardware acceleration in high performance and real-time applications. Unlike single purpose dedicated hardware approaches, the reusable nature of the technology introduces system design tradeoffs that must balance processing density, memory, and I/O bandwidth, not to mention more subtle issues such as ease of programming, debugging, and physical integration into real-world systems. This paper describes results from the DARPA-funded SLAAC project, which developed three generations of adaptive computing systems for a diverse set of challenging signal processing applications.

5C-3 Interactive Ray Tracing on Reconfigurable SIMD MorphoSys [p. 471]
H. Du, M. Sanchez-Elez, N. Tabrizi, N. Bagherzadeh, M.L. Anido, M. Fernandez

MorphoSys is a reconfigurable SIMD architecture. In this paper, a BSP-based ray tracing is gracefully mapped onto MorphoSys. The mapping highly exploits ray-tracing parallelism. A straightforward mechanism is used to handle irregularity among parallel rays in BSP. To support this mechanism, a special data structure is established, in which no intermediate data has to be saved. Moreover, optimizations such as object reordering and merging are facilitated. Data starvation is avoided by overlapping data transfer with intensive computation so that applications with different complexity can be managed efficiently. Since MorphoSys is small in size and power efficient, we demonstrate that MorphoSys is an economic platform for 3D animation applications on portable devices.

5C-4 An Overview of a Compiler for Mapping MATLAB Programs onto FPGAs [p. 477]
P. Banerjee

This paper describes a behavioral synthesis tool called the MATCH compiler developed as part of the DARPA Adaptive Computing Systems program. The MATCH compiler reads in high-level descriptions of DSP applications written in MATLAB, and automatically generates synthesizable RTL models in VHDL. The RTL models can be synthesized using commercial logic synthesis tools and place and route tools onto FPGAs. By linking the two design domains of DSP and FPGA hardware design, the MATCH compiler provides DSP design teams a significant reduction in design labor and time, elimination of misinterpretations and costly design rework, automatic verification of the hardware implementation, and the ability of systems engineers and algorithm developers to perform architectural exploration in the early phases of their development cycle. The paper describes how powerful directives are used to provide high-level architectural tradeoffs for the DSP designer. The MATCH compiler has been transferred to a startup company called AccelChip which has developed a commercial version of the compiler called AccelFPGA. Experimental results are reported using AccelFPGA on a set of nine MATLAB benchmarks that are mapped onto the recent Xilinx Virtex II and Altera Stratix FPGAs. The benchmark programs range in complexity from 20 lines to 170 lines of MATLAB code and produce VHDL code ranging from 1500 to 4500 lines of code. The compilation times range from 3 seconds to 40 seconds.

5C-5 Issues in Debugging Highly Parallel FPGA-based Applications Derived from Source Code [p. 483]
K.S. Hemmert, B.L. Hutchings

Using high-level synthesis tools to map programs written in general-purpose languages to FPGA hardware has grown in popularity and it is becoming necessary to provide comprehensive debugging tools in order to verify the correctness of the synthesized hardware. Currently, post-synthesis debugging is done at the circuit level. This paper discusses the issues, as well as some early results, of creating a source level debugger for hardware synthesized from source code. This study is meant to provide some insight into what needs to be added or built into synthesizing compilers in order to allow debug of a synthesized circuit at the source level, which will provide the programmer with a familiar view of the program being debugged.


Session 5D: Leading Edge Design Examples

Co-Chairs: Seongsoo Lee, Takashi Miyamori

5D-1s Implementation of the Super-Systolic Array for Convolution [p. 491]
J.-J. Lee, G.-Y. Song

High-performance computation on a large array of cells has been an important feature of systolic array. To achieve even higher degree of concurrency, it is desirable to make cells of systolic array themselves systolic array as well. The architecture of systolic array with its cells consisting of another systolic array is to be called super-systolic array. In this paper we propose a scalable super-systolic array architecture which shows high-performance and can be adopted in the VLSI design including regular interconnection and functional primitives that are typical for a systolic architecture.

5D-2s Design of a Scalable RSA and ECC Crypto-Processor [p. 495]
M.-C. Sun, C.-P. Su, C.-T. Huang, C.-W. Wu

In this paper, we propose a scalable word-based crypto-processor that performs modular multiplication based on modified Montgomery algorithm for finite fields GF(P) and GF(2m). The unified crypto-processor supports scalable keys of length up to 2048 bits for RSA and 512 bits for elliptic curve cryptography (ECC). Further extension of the key length can be done easily by enlarging the memory module or using the external memory resource. With the proposed parity prediction technique, our pipelined crypto-processor achieves a 512-bit RSA encryption rate of 276 Kbps and a 160-bit ECC encryption rate of 73.3 Kbps for a 220MHz clock rate.

5D-3s A Reconfigurable, Power-Scalable Rake Receiver IP for W-CDMA [p. 499]
A. Bianco, A. Dassatti, M. Martina, A. Molino, F. Vacca

During the last years wireless market has experienced an exponential growth. 2G systems are essentially voiceö oriented: the main innovation expected from 3G ones is the ubiquitous Internet and multimedia fruition. The transition from 2G to 3G provides both opportunities and challenges: one way to make this migration as smoother as possible relies on the employment of reconfigurable architectures. In this paper a reconfigurable Rake Receiver for W-CDMA is proposed. Very promising results from the physical implementation on a XCV300E have been obtained.

5D-4s Robust High-Performance Low-Power Carry Select Adder [p. 503]
W. Jeong, K. Roy

This paper proposes Dual Transition Skewed Logic (DTSL) based Carry Select Adder (CSA) suitable for processing units requiring low power and high performance with high noise immunity. We implemented 31-bit Carry Select Adders in three different logic styles: Dual Transition Skewed Logic (DTSL), Domino, and conventional static CMOS in TSMC 0.25um technology and compared them in terms of performance, power consumption and layout area. CSA using DTSL shows 36.7% and 17.7% improvements in power dissipation and performance, respectively, over domino, and 40.4% improvement in performance compared to a static CMOS CSA.

5D-5s Full-Custom vs. Standard-Cell Design Flow - an Adder Case Study [p. 507]
H. Eriksson, T. Henriksson, P. Larsson-Edefors, C. Svensson

Full-custom design is considered superior to standard-cell design when a high-performance circuit is requested. The structured routing of critical wires is considered to be the most important contributor to this performance gap. However, this is only true for bitsliced designs, such as ripple-carry adders, but not for designs with inter-bitslice interconnections spanning several bitslices, such as tree adders and reduction-tree multipliers. It is found that standard-cell design techniques scale better with the data width than full-custom bitsliced layouts for designs dominated by inter-bitslice interconnections.

5D-6s A 500-MHz Low-Power Five-Port CMOS Register File [p. 511]
J. Wang, Q. Zhang

Current-mode approach in register file has the great advantage of high operation speed with low power consumption. Under 0.35- m m CMOS technology, we present a 1.85-ns read access, 32« 32-bit, five-port register file in which a modified current-sense amplifier is used. The delay match circuit and TSPC-FF are also used to improve speed. Energy saving is achieved by current mode and other key ideas. Simulation results show it consumes 640mW at 500MHz 3.3V. It can still work at 2.3V with 350MHz 300mW.

5D-7s An Effective SDRAM Power Mode Management Scheme for Performance and Energy Sensitive Embedded Systems [p. 515]
N.-Y. Ker, C.-H. Chen

We present an effective power mode management scheme used in SDRAM memory controllers. The scheme employs a bus utilization monitoring mechanism to initiate proper operations of SDRAM chips. Our approach reduces energy consumption by actively switching memories to low-power mode at low bus utilization. At higher bus utilization, the scheme switches memories to open page mode to reduce precharge energy as well as program execution time. This bus utilization predictor reduces memory energy consumption without the expense of increasing program execution time. It achieved the performance level of open page policy by consuming 20% less of memory energy.

5D-8s Branch Predictor Design and Performance Estimation for a High Performance Embedded Microprocessor [p. 519]
S.-H. Lee, I.-K. Kim, L. Choi

AE64000 is a 64-bit embedded processor targeting high-end embedded applications such as HDTV, DVD, and 3D graphics. To achieve a higher performance for the AE64000, we design a branch predictor for the processor, and find the optimum parameters for the design through cycle-accurate simulations on SpecINT benchmarks and embedded applications (Dhrystone and Whetstone). In AE64000 branch prediction is complicated by the Instruction Folding Unit (IFU) of the processor front-end. By predicting on a Pre-PC in the IFU rather than using a PC in the pipeline core, we can effectively eliminate branch misprediction penalty on a correct prediction. We have developed the AE64000 simulator to evaluate the performance of the designed branch predictor, and selected the optimum branch predictor configuration by considering cost-effectiveness as well as by analyzing the results generated from AE64000 simulator. The selected branch predictor has been implemented in Verilog and is added to AE64000 pipeline.


Session 6A: Design Space Exploration

Co-Chairs: Tohru Ishihara, Sri Parameswaran

6A-1 Accelerating Design Space Exploration Using Pareto-Front Arithmetics [p. 525]
C. Haubelt, J. Teich

In this paper, we propose an approach for the synthesis of heterogeneous (embedded) systems, while exploiting a hierarchical problem structure. Particular to our approach is that we explore the set of so-called Pareto-optimal solutions, i.e., optimizing multiple objectives simultaneously. Since system complexity grows steadily leading to giant search spaces which demand for new strategies in design space exploration, we propose Pareto- Front Arithmetics (PFA) using results of subsystems to construct implementations of the top-level system. This way, we are able to reduce the exploration time dramatically. An example of an MPEG4 coder is used to show the benefit of this approach in reallife applications.

6A-2 Quality-Driven Design by Bitwidth Optimization for Video Applications [p. 532]
Y. Cao, H. Yasuura

This paper presents a novel system-level design methodology, called quality-driven design for video applications by bitwidth optimization (QDDV). An output quality adaptive approach based on a forward and a backward propagation technique, which are effective bitwidth analysis and output-quality-based bitwidth analysis are also presented for variable bitwidth optimization. In order to illustrate the potential of the proposed methodology, MPEG-2 video is used as the driver application. Experimental results show the effectiveness of the methodology.

6A-3 Arbitrary Long Digit Integer Sorter HW/SW Co-Design [p. 538]
S.-W. Cheng

The coming of multimedia era and information security era indicates that must process longer digit integer data. Previous sort researches focus on pure performance of large amount of finite fixed digit/bit number. This paper discusses on effectively solving arbitrary long digit integer sorting problem by HW/SW co-design under the Area×Time2 (AT2) price-performance constraint. The work proposes multi-level (two-level) sort architecture to attain the object: an accomplished fixed-digit (k-bit) hardware sorter implements the first or basic level sorting, software programmed radix 2k sort implements the second or higher level sorting. By Super Radix Sorting HW/SW co-design and reuse techniques, the work makes fixed-digit HW sorters more flexible and useful.
Index Term - HW/SW co-design, Reusable & Embedded Cores, Sorting, Radix Sort, Technology Independent Methodologies, System-on-a-Chip (SoC), VLSI design.


Session 6B: (Special Session) Panel Discussion: Roles of Funding Agencies in Technology-driven Economic Development [p. 547]

Organizer: Kazuo Nakajima
Moderators: Kazuo Nakajima, Brian Schott
Panelists: Tokinori Kozawa, Jose Munoz, Wolfgang Rosenstiel, Sakae Takahashi, Chen-Wen Wu

In the past, government agencies played pivotal roles in the development of new technologies. For example, Internet is an outgrowth of the ARPAnet project sponsored by the Defense Advanced Research Projects Agency (DARPA) of the United States government. Recently, consortia of private industries, often in cooperation with government agencies, have entered into this picture. As a consequence, researchers in academia have shifted their attentions much more towards real-world applications. In the panel discussion, we first compare the traditional roles of funding agencies in the USA, EU, and Asia. We then focus on the new trends in this funding business in each of the three regions. We will examine why and how such shifts have been made and discuss the roles of funding agencies to be played in the coming years. In particular, we will attempt to search for future roles and new forms/ways of the funding agencies from the viewpoint of economic development driven by technology advancement.


Session 6C: (Special Session) Invited Talk: Legal Protection for Semiconductor Intellectual Property

Chair: Masaharu Imai

6C-1 Legal Protection for Semiconductor Intellectual Property (IP) [p. 551]
Y. Oshima

Considering the value of semiconductor IP (Intellectual Property), fundamental knowledge of legal protection for IP is essential requirement for both IP providers and IP users. Since IPs are recognized as one of the intangible assets, present intellectual property system, such as patent, copyright, trade secret, and mask work protection system should provide certain legal protection by their own aspects. This article describes how to protect IP effectively by each present intellectual property system and contributes to realize the smooth IP trade based on suitable legal protection.


Session 6D: (Special Session) Presentation and Poster Session: University LSI Design Contest

Co-Chairs: Tomohisa Wada, Shoji Kawahito

6D-1 Design and Implementation of a Video-Oriented Network-Interface-Card System [p. 559]
M.-C. Chen, S.-F. Hsiao, C.-H. Yang

We design a specific Ethernet network interface card (NIC) for accelerating the video delivery by offloading the overheads of protocol headers identification/appending and CRC/checksums calculation, and speeding video bit streams with a dedicated video interface. Compared with the same operations of a 50MHz ARM micro-controller, the NIC system saves 47,000 ns per frame. This NIC card also supports the coexistence of the IPv4 and IPv6 standard for the future extension. Both FPGA prototyping and 0.35um cell-based design of the specific NIC system are given.

6D-2 A Highly Efficient AES Cipher Chip [p. 561]
C.-P. Su, T.-F. Lin, C.-T. Huang, C.-W. Wu

We present an efficient hardware implementation of the AES (Advanced Encryption Standard) algorithm, with key expansion capability. Instead of the widely used table-lookup implementation of S-box, the proposed basis transformation technique reduces the hardware overhead of the S-box by 64% and is easily pipelined to achieve high throughput rate. Using a typical 0:25µm CMOS technology, the throughput rate is 2.977 Gbps for 128-bit keys, 2.510 Gbps for 192-bit keys, and 2.169 Gbps for 256-bit keys with a 250MHz clock. Testability of the design is also considered. The area of the core circuit is about 1,279 x 1,271 µm2 .

6D-3 Implementation of Fast CRC Calculation [p. 563]
T. Henriksson, D. Liu

CRC is important for error detection in communication systems. With transmission speeds of several Gb/s the high-speed implementation is a bottleneck. A circuit with two parallel calculation units has been implemented in a 0.35 micron process. They use 32 bits and 64 bits parallel input respectively. Chip measurements prove throughput higher than 5.76 Gb/s, which indicates that 10 Gb/s throughput is possible in more modern processes.

6D-4 Design of a CMOS Test Chip for Package Models and I/O Characteristics Verification [p. 565]
C. Despande, T. Chen

Reliable packages for modern chips are crucial for satisfactory system performance. In order to ascertain package's performance, an equivalent electrical model is plugged into circuit simulations and the package's performance characteristics can be analyzed. However, one implicit assumption designers make is that these equivalent electrical models are accurate for performance characterization of packages. For modern high-speed designs, verification of the accuracy of these models is imperative and can only be carried out on real-silicon. In addition, signal integrity for inter-chip communication is crucial from a systems perspective. To study the signal integrity of data under different PVT (process variation, voltage, and temperature) conditions, appropriate test structures are needed. Data can be subjected to controlled variations through these structures and its integrity can be studied. And finally, testing package reliability at high temperatures is important since on-chip temperatures are increasing dramatically as processes continue to scale. The test chip presented in this design is intended to allow 1) circuit and package designers to evaluate and validate the accuracy of package models under a variety of environment conditions; 2) system designers to evaluate package/system level signal integrity issue before actual chips are available; and 3) system designers to evaluate on-chip temperatures and to study thermal integrity of packages. The test chip includes a variety of features to achieve these goals. The test chip is fabricated at TSMC in a 0.18 um CMOS process. The measurement results and the die microphotograph will be sent later as soon as they are available. This paper presents the layout based simulation results.

6D-5 A Still Image Encoder Based on Adaptive Resolution Vector Quantization Employing Needless Calculation Elimination Architecture [p. 567]
M. Fujibayashi, T. Nozawa, T. Nakayama, K. Mochizuki, K. Kotani, S. Sugawa, T. Ohmi

We have developed an advanced vector quantization (VQ) encoding hardware for still image encoding systems. By utilizing needless calculation elimination method, computational cost of VQ encoding is reduced to 40% or less, while maintaining the accuracy of full-search VQ. We have successfully implemented the advanced encoding method and Adaptive resolution VQ (AR-VQ), which realizes compression ratio over 1/200 while maintaining image quality, into a still image encoding processor. The processor can compress still image of 1600x2400 pixels within one second, which is 60 times faster than software implementation on current PCs.

6D-6 Speech Encoding and Encryption in VLSI [p. 569]
K.K. Chakravarthy, M.B. Srinivas

In this work, an attempt has been made to design and synthesize speech encoding and encryption as a system-on-chip. The novelty of this design is that it uses wavelet decomposition for data compression and perpetual audio masking to keep quantization noise level to a minimum. The encryption is done by implementing RSA algorithm in hardware.

6D-7 The Design of a i8080A Instruction Compatible Processor with Extended Memory Address [p. 571]
C. Kon, N. Shimizu

Computer systems with 8080 or compatible processor and CP/M were most popular from the end 70's to early in 80's. We designed an 8080 compatible processor with MMU and implement it on an FPGA as a CP/M system.
Keyword: SFL, 8080, CP/M

6D-8 The Design of a USB Device Controller IYOYOYO [p. 573]
T. Kouyama, H. Nano, C. Kon, N. Shimizu

USB(Universal Serial Bus) is a very popular interface in recent computer systems. USB replaces legacy interface, such as serial port, pararell port, and so on. We designed USB device controller IYOYOYO, and USB Gamepad device with it. This Gamepad device was implemented on an FPGA, with IYOYOYO.

6D-9 MAPLE Chip: a Processing Element for a Static Scheduling Centric Multiprocessor [p. 575]
K. Yasufuku, R. Ogawa, K. Iwai, H. Amano

A custom processor called MAPLE which supports static scheduling by automatic parallelizing compilers is implemented and evaluated. MAPLE has a high performance floating point arithmetic unit and low latency data transfer mechanism for other MAPLE chips. The maximum operational frequency is 80MHz in simulation, and the operation on the prototype board with 23MHz clock is confirmed. It requires about 0.56W at 23MHz operation.

6D-10 Finding the Best System Design Flow for a High Speed JPEG Encoder [p. 577]
K. Sakiyama, P. Schaumont, I.M. Verbauwhede

26 students at the University of California, Los Angeles (UCLA) studied system level design methodologies through the design of a high-speed JPEG encoder. The results produced by 5 different design flows onto various target platforms demonstrate the high impact of tools on design quality.

6D-11 The Design of PCI Bus Interface [p. 579]
H. Hayasaka, H. Haramiishi, N. Shimizu

These days, the PCI bus is the standard bus which not only x86 architecture but also other architecture is equipped with. However, the details are difficult to understand. We designed the general-purpose PCI bus interface for developing PCI devices, and implemented it. In this paper, the feature of design is reported.

6D-12 Low-Power Digital CDMA Receiver [p. 581]
J.-S. Liu, I-H. Chen, Y.-C. Tsai, S.-J. Jou

The advanced design task of the digital CDMA receiver are presented in this report. A biased number system and architecture is used to reduce the switching activity to reduce power consumption. Carry-save adder tree is used to speed up the summation of 127 data (3 bits) in the synchronization and date extraction process. Verilog HDL is used to describe this system and design compiler of Synopsys is used to synthesize our design. Design results show that it can work at 155MHz (Chip rate) with 9913 Gate counts by using Compass 0.35 um CMOS Cell library.

6D-13 Hardware Implementation of an EAN-13 Bar Code Decoder [p. 583]
J.D. Maeyer, H. Devos, W. Meeus, P. Verplaetse, D. Stroobandt

This paper presents an FPGA hardware implementation of a bar code decoder using the EAN-13 standard. The design was implemented on a FPGA situated on an ATMEL FPLSIC (Field Programmable System Level Integrated Circuit, AT94K40-25DQC). The design has comparable performance and is in many ways more robust than commercially available devices. However, to the authorsâ knowledge, these all use a microprocessor while our design is purely dedicated hardware.

6D-14 Error Correction Circuit using Difference-set Cyclic Code [p. 585]
Y. Kato, T. Morita

An error correction receiver using difference-set cyclic code has been designed. Highly reliable operation, short critical path, and small circuit size are key issues. The synchronization circuit is optimized in its circuit size by detecting 10 kinds of bit sequences for synchronization. The circuit action is governed by state machine combined with a Johnson counter and a timer. The critical path length is estimated to be 4.8, which is less than average value.

6D-15 Design of a Digital CDMA Receiver [p. 587]
I.V. Kumar, M.B. Srinivas

In this work, an attempt has been made to design a Digital Signal Code Division Multiple Accesses Receiver using VHDL [1] [2] and synthesize using Mentor Graphics tools. The receiver is designed for a maximum of three users as per the design specification given for the contest. The novelty of this design is that it uses four independent processes to achieve the concurrent operation for the entire functionality. The simulation and synthesis are carried out using Mentor Graphicsâ FPGA tools and an optimized circuit in terms of area and timing is brought out.

6D-16 Standard Cell Libraries with Various Driving Strength Cells for 0.13, 0.18 and 0.35um Technologies [p. 589]
M. Hashimoto, K. Fujimori, H. Onodera

We developed standard cell libraries for three technologies(0.13, 0.18 and 0.35Őm) using an automatic layout generation tool that we have developed. The developed libraries are as competitive as manually-designed libraries in layout density and speed. We verify the functionalities of all cells and the speed of basic combinational cells on the fabricated chips. The libraries are currently public to educational organizations in Japan.

6D-17 A Nearest-Hamming-Distance Search Memory with Fully Parallel Mixed Digital-Analog Match Circuitry [p. 591]
T. Koide, H.J. Mattausch, Y. Yaho, T. Gyohten, Y. Soda

A fully-parallel minimum Hamming-distance search memory has been designed, which uses digital circuitry for bit comparison and fast analog circuitry for word-comparison as well as winner-take-all (WTA) functionality. The approach allows compact and high-performance integration in conventional CMOS technology. The 0.6µm, 2-poly, 3-metal CMOS design with 32 rows and 128 columns achieves <100ns search times at <260 mW power dissipation. The approach is extendable to increased pattern length and larger row numbers and enables high efficiency for pattern matching applications such as network routers, code-book-based data compression, or object recognition. As for conventional memories high yield can be achieved with a redundancy concept for rows (including WTA function) and columns.


Session 7A: System-Level Power Issues

Co-Chairs: Kazutoshi Kobayashi, Massoud Pedram

7A-1 Run-Time Energy Estimation in System-On-a-Chip Designs [p. 595]
J. Haid, G. Kaefer, C. Steger, R. Weiss

In this paper, a co-processor for run-time energy estimation in system-on-a-chip designs is proposed. The estimation process is done by using power macro-models, thus making analogue measurement equipment obsolete to the software engineer once the system-on-a-chip (SOC) design is characterized. Compared to sampling-based profiling systems [17], the performance overhead of energy profiling is less, because the energy estimation is done completely parallel to the functional units residing on the SOC. The proposed methodology can be used for run-time power optimization and in-system energy profiling. The co-processor was evaluated on a SOC for MPEG layer III audio decoding and the experimental results show a maximum relative error of <5%.

7A-2 SEA: Fast Power Estimation for Micro-Architectures [p. 600]
P. Kalla, J. Henkel, X.S. Hu

Various approaches for micro-architectural power/ energy estimation have been introduced, mainly driven by the need to obtain fast power/energy estimates during early phases of complex SOC designs. In contrast to previous approaches we study power/energy estimation for highly optimized synthesizable description of microprocessor cores. Under this real-world design scenario, we found, unlike related previous research, that power can hardly be estimated closer than around 15% using an instruction level model. However, we can estimate the energy as close as 5%. Our research has resulted in the SEA framework that estimates energy/power consumed by a software program, taking specific micro-architectural features of the underlying programmable hardware core into consideration. With this high accuracy in energy estimation we achieve around 5 orders of magnitude faster estimations compared to state-of-the art high-level (RTL) commercial energy/power estimation tool suites. Thus, our framework is capable of reliably estimating the energy/power consumption of future complex SOCs.

7A-3s HyPE: Hybrid Power Estimation for IP-Based Programmable Systems [p. 606]
X. Liu, M.C. Papaefthymiou

This paper presents a novel power estimation scheme for programmable systems consisting of predesigned datapath and memory components. The proposed hybrid methodology yields highly accurate estimates within short runtimes by combining high-level simulation with analytical macromodeling of circuit characteristics. To assess its effectiveness in practice, we implemented our hybrid scheme into a power estimation tool, called HYPE, and applied it to explore various architectural alternatives in the design of a 256-state Viterbi decoder and a Rijndael encryptor. For designs with about 1 million transistors, our estimator terminates within seconds. Compared with industrial gate-level power estimators, our approach is up to 1,000 times faster with 5.4% deviation on average.

7A-4s An Efficient IP-Level Power Model for Complex Digital Circuits [p. 610]
C.-Y. Hsu, C.-N.J. Liu, J.-Y. Jou

In this paper, we propose an efficient IP-Level power model with a small lookup table for complex CMOS circuits. The table has only one dimension that maps the zero-delay charging and discharging capacitance into the real power consumption of pattern pairs but still has high accuracy. In order to improve the efficiency of the characterization process, the Monte Carlo approach is used during the estimation of the average power to skip the samples that will not increase the accuracy too much. The experimental result shows the table sizes are only up to 107 entries for ISCASâ85 benchmark circuits and the estimation error is only 2.99% on average using the lookup table.

7A-5 A Hierarchical Analysis Methodology for Chip-Level Power Delivery with Realizable Model Reduction [p. 614]
Y.-M. Lee, C.C.-P. Chen

In this paper, we propose a novel hierarchical analysis methodology to facilitate efficient chip-level power fluctuation analysis. With extreme efficiency and simplicity, our design methodology first builds time-varying multiport Norton equivalent circuits in a row-by-row or block-by-block based followed by global analysis on the integrated reduced models. After generating the Norton equivalent sources at external ports, we apply realizable model order reduction technologies to further reduce model. Since the elements of our reduced model are also RC devices, they are fully compatible with general circuit simulation engines. The experimental results demonstrate more than 4X speed up with the flat simulation while maintaining within 5% of accuracy.


Session 7B: (Special Session) Invited Talks: Design Methodologies for 50M Gate ASICs

Co-Chairs: Jason Cong, Satoshi Matsushita

7B-1 Optimality and Scalability Study of Existing Placement Algorithms [p. 621]
C.-C. Chang, J. Cong, M. Xie

Placement is an important step in the overall IC design process in DSM technologies, as it defines the on-chip interconnects, which have become the bottleneck in determining circuit performance. The rapidly increasing design complexity, combined with the demand for the capability of handling nearly flattened designs for physical hierarchy generation, poses significant challenges to existing placement algorithms. There are very few studies on understanding the optimality and scalability of placement algorithms, due to the limited sizes of existing benchmarks and limited knowledge of optimal solutions. The contribution of this paper includes two parts: 1) We implemented an algorithm for generating synthetic benchmarks that have known optimal wirelengths and can match any given net distribution vector. 2) Using benchmarks of 10K to 2M placeable modules with known optimal solutions, we studied the optimality and scalability of three state-of-the-art placers, Dragon [4], Capo [1], mPL [24] from academia, and one leading edge industrial placer, QPlace [5] from Cadence. For the first time our study reveals the gap between the results produced by these tools versus true optimal solutions. The wirelengths produced by these tools are 1.66 to 2.53 times the optimal in the worst cases, and are 1.46 to 2.38 times the optimal on the average. As for scalability, the average solution quality of each tool deteriorates by an additional 4% to 25% when the problem size increases by a factor of 10. These results indicate significant room for improvement in existing placement algorithms.

7B-2 IBM's 50 Million Gate ASICs [p. 628]
J. Koehl, D.E. Lackey, G. Doerre

There is no slowdown in the complexity increase for ASIC and SoC designs. As we write this paper in August, 2002, 40M gate ASICs are nearing tape-out, and 50M gate designs are likely to start before this conference takes place. This paper describes the current tool and methodology development efforts focused on enabling ASIC and SoC designs of these sizes and complexity, centered around the reduction of design turn-around-time, improvement of the quality of results and the modeling and optimization of deep sub-micron electrical effects.

7B-3 Silicon Virtual Prototyping: The New Cockpit for Nanometer Chip Design [p. 635]
W.-J. Dai, D. Huang, C.-C. Chang, M. Courtoy

A design methodology for the implementation of multi-million gate system-on-chip designs is described. The new methodology is based on the creation of a silicon virtual prototype early in the back-end design process. The prototype is generated in a fraction of the time required to complete the traditional back-end flow but still maintains very high correlation with the final design. The physical prototype becomes the Îcockpitâ where many design implementation decisions can be optimized by leveraging the short iteration times. Hierarchical design methodologies benefit from the prototyping stage by enabling a more optimal partitioning. The silicon virtual prototype also alters the nature of the hand-off model between front-end and back-end designers. The netlist can now be quickly validated using the prototype: the physical reality is being injected early in the design process resulting in fewer iterations between front-end and back-end.

7B-4 Design Flow and Methodology for 50M gate ASIC [p. 640]
A. Mehrotra, L. van Ginneken, Y. Trivedi

This paper presents a methodology for full chip RTL timing closure for very large ASICâs. The methodology is based on the concept of a "Silicon Virtual Prototype". The methodology is based on the scalable technique of clustering and cluster placement and leverages the tight integration between the algorithms by means of a common, unified data model.


Session 7C: (Special Session) Invited Talks: Mixed Signal Test

Co-Chairs: Kwang-Ting Cheng, Yasuo Sato

7C-1 Efficient Loop-back Testing of On-chip ADCs and DACs [p. 651]
H.-s. Yu, J.A. Abraham, S. Hwang, J. Roh

This paper presents an efficient approach to testing on-chip Analog to Digital Converters (ADCs) and Digital to Analog Converters (DACs) in loop-back mode. On-chip digital signal processing units can be used to generate stimuli. With this methodology, go/no-go tests as well as characterization of the individual ADCs and DACs are possible. The proposed approach is simple and overcomes the low parametric fault coverage of conventional loop-back tests. Simulations on a Matlab model of loopbacked converters are presented to validate the feasibility of the method.

7C-2 A Novel LCD Driver Testing Technique Using Logic Test Channel [p. 657]
C. Su, W.-J. Wang, C.-H. Wang, I. Tseng

This paper proposes a novel voltage measurement technique for LCD driver testing by the use of logic test channel of an ATE. The method is able to achieve less than 1mV error with the presence of 32mV RMS noise.

7C-3 An Implementation of Memory-based On-chip Analogue Test Signal Generation [p. 663]
S. Mir, L. Rolíndez, C. Domigues, L. Rufer

This paper presents a memory-based on-chip analogue test signal generation approach that is suitable for the test of an Analogue and Mixed-Signal (AMS) core. This core contains programmable electronic interfaces for acoustic and ultrasound transducers. The test signals that must be generated on-chip have only low or moderate frequencies (10 Hz-10 MHz). The test circuitry designed in a 0.18 mm CMOS technology includes a programmable shift-register, a clock divider, and a programmable switched-capacitor filter bank. By controlling the shift-register length and the sampling frequency, the paper shows that high quality single tone signals can be generated on chip in the band of interest.

7C-4 Delta-sigma Modulator Based Mixed-signal BIST Architecture for SoC [p. 669]
C.-K. Ong, K.-T. Cheng, L.-C. Wang

This paper proposes a mixed-signal Built-In Self-Test (BIST) architecture based on a second-order delta-sigma modulator. This modulator, which incorporates a design-for-testability (DfT) circuitry, is capable of testing/characterizing itself using digital stimulus. This characteristic is attractive for implementing the modulator as an on-chip analog signal analyzer. When applied for mixed-signal BIST, the modulator-based analog signal analyzer is first characterized using digital stimulus. Then the analyzer is utilized to characterize the stimulus generator in the BIST application. Some critical implementation issues of the BIST architecture are also discussed.


Session 7D: Embedded Systems: Hardware / Software Design Methodology and Optimization

Co-Chairs: Naehyuck Chang, Hiroaki Takada

7D-1 Capturing and Analyzing Requirement - In Case of Software and Applying to Hardware [p. 677]
A. Kawaguchi

It introduces the technology of requirement capture and requirement analysis that uses the UML in the software world, and considers applying to hardware development.

7D-2 A Comparison of the RTU Hardware RTOS with a Hardware/Software RTOS [p. 683]
J. Lee, V. J. Mooney III, A. Daleby, K. Ingström, T. Klevin, L. Lindh

In this paper, we show the performance comparison and analysis result among three RTOSes: the Real-Time Unit (RTU) hardware RTOS, the pure software Atalanta RTOS and a hardware/software RTOS composed of part of Atalanta interfaced to the System-on-a-Chip Lock Cache (SoCLC) hardware. We also present our RTOS configuration framework that can automatically configure these three RTOSes. The average-case simulation result of a database application example on a three-processor system running thirty tasks with RTU and the same system with SoCLC showed 36% and 19% overall speedups, respectively, as compared to the pure software RTOS system.

7D-3s Linux Kernel Customization for Embedded Systems by Using Call Graph Approach [p. 689]
C.-T. Lee, Z.-W. Hong, J.-M. Lin

It has been attracting attention to reconfigure a GPOS for application-specific domain. Linux is currently one of the popular candidate GPOSs. Although Linux has tools for kernelâs reconfiguration by letting users add/remove desired function modules, we still lack of good schemes of reconfiguring Linux according to a specific embedded system. In this article, we are going to propose an approach to customize an application-specific Linux. This approach derives from a "call graph" based on re-engineering. By analyzing a graph-structure representation of the target system, we could find the rules of removing the unnecessary codes of Linux.

7D-4s Topology Selection for Energy Minimization in Embedded Networks [p. 693]
D. Li, P. H. Chou, N. Bagherzadeh

The trend towards distributed, networked embedded systems is changing the way power should be managed. Power consumed by bus and network interfaces now matches if not surpasses that of the CPU and is thus becoming a prime candidate for reduction. This paper explores energy-efficient bus topologies as a new technique for global power optimization of embedded systems that are interconnected by high-speed serial network-like busses such as FireWire and a new generation of SoC buses. Our grammar-based representation for these networks enables selection of energy-efficient bus topologies. Experimental results show 15ö20% energy saving on the network interfaces without sacrificing system performance.


Session 8A: Design Validation Techniques

Co-Chairs: Vasily Moshnyaga, Hiroyuki Ochi

8A-1 Semi-Formal Test Generation and Resolving a Temporal Abstraction Problem in Practice: Industrial Application [p. 699]
J. Dushina, M. Benjamin, D. Geist

This document describes a successful application of a semi-formal test generation technique to the verification of Direct Memory Access Controller (DMAC) of ST50, a new general purpose RISC microprocessor developed by STMicroelectronics and Hitachi. Like other memory-related devices, the DMA controller challenges formal techniques because of the state explosion problem. To cope with the challenge, abstraction mechanism is applied during test generation: several abstract models are created in order to verify different functional aspects of the design. We also propose a practical solution to overcome a temporal abstraction problem that arises when tests issued from an abstract model have to be applied during real design simulation.

8A-2 Scan-chain Based Watch-points for Efficient Run-Time Debugging and Verification of FPGA Designs [p. 705]
A. Tiwari, K.A. Tomko

This paper describes a structured and area efficient approach for in-situ debugging of application for FPGA based reconfigurable systems. A scan chain is inserted into the hardware design running on the FPGA, which helps in debugging and verification by providing watch-point capability. The scan chain technique proposed is easy to use and has very low overhead. The scan-chain based implementation capitalizes on the capability of newer FPGAs to connect several LUTs serially and configure them as shift registers. The hardware debugging procedure proposed using the shift register LUTs does not require any recompilation of the design to change the watch-point conditions and thus, is very fast. In this paper the area overhead resulting from addition of a scan-chain based watch-point logic is discussed and is compared with other proposed debugging techniques. We observed that this technique has an average area overhead of 46% for the ITC benchmark circuits with varying widths of watch-point signals.

8A-3s A Novel Approach for Digital Waveform Compression [p. 712]
E. Naroska, S.-J. Ruan, C.-L. Ho, S. Mchaalia, F. Lai, U. Schwiegelshohn

The waveform data, which is gathered during digital simulation of large and complex systems easily fill up huge amounts of disk space on the simulation computer. This is specially true for many deep-submicron designs. Hence, we developed a set of algorithms and techniques which can be used to efficiently compress digital waveforms without losing any significant information from the original data. Experimental results show that compression ratios of up to 300 compared to a original VCD formatted file can be achieved.

8A-4s A Deep Submicron Power Estimation Methodology Adaptable to Variations between Power Characterization and Estimation [p. 716]
D. Eckerbert, P. Larsson-Edefors

Traditionally, RTL power estimation techniques are characterizing a component for a fixed environment (most importantly load capacitance, activity, and operating frequency). This article presents a solution to problems originating from the ineluctably changing operating conditions such as differing load capacitance due to different applications; different activity and operating frequency as power reduction techniques are more frequently employed.


Session 8B: Placement

Co-Chairs: Masaaki Yamada, Xin Yuan

8B-1 Congestion Driven Incremental Placement Algorithm for Standard Cell Layout [p. 723]
Z. Li, W. Wu, X. Hong

Congestion minimization is the least understood in placement objectives, however, it models routability most accurately. In this paper, a new incremental placement algorithm C-ECOP for standard cell layout is presented to reduce routing congestion. Congestion estimation is based on a new routing model and a more accurate cost function. An integer linear programming (ILP) problem is formulated to determine cell flow direction and avoid the conflictions between adjacent congestion areas. Experimental results show that the algorithm can considerably reduce routing congestion and preserve the performance of the initial placement with high speed.
Key words: congestion, standard cell, incremental placement

8B-2 Performance-Driven Multi-Level Clustering for Combinational Circuits [p. 729]
C.N. Sze, T.-C. Wang

In this paper, an effective algorithm is presented for performance driven multi-level clustering for combinational circuits, and is applicable to hierarchical FPGAs. With a novel graph contraction technique, which allows some crucial delay information of a lower-level clustering to be maintained in the contracted graph, our algorithm recursively divides the lower-level clustering into the next higher-level one in a way that each recursive clustering step is accomplished by applying a modified single-level circuit clustering algorithm based on [1]. We test our algorithm on the two-level clustering problem and compare it with the latest algorithm in [2]. Experimental results show that our algorithm achieves, on average, 12% more delay reduction when compared to the best results (from TLC with full node-duplication) in [2]. In fact, our algorithm is the first one for the general multi-level circuit clustering problem with more than two levels.

8B-3 Cross Talk Driven Placement [p. 735]
J. Lou, W. Chen

Due to the fast shrinking of process geometries, signal integrity issues are becoming increasingly critical to the performance and reliability of electronic systems. Traditional post-layout based fixing methodologies quickly break down when hundreds of thousands of nets need to be tailored for signal integrity concerns. In order to achieve signal integrity closure, cross talk, electro-migration, IR drop and other effects must be addressed earlier in the design cycle. In this paper, a new placement algorithm considering cross talk minimization is presented. First, a probabilistic technique is presented to estimate the coupling capacitance during coarse placement. Based on this technique, an effective placement flow is introduced to remove the highly coupled spots via placement density control. By addressing cross talk at placement level with this algorithm, the value and count of the highest post route peak noise were successfully reduced in a set of industrial benchmarks. Furthermore, due to reduced coupling capacitance, the design speed is 8% faster on average in comparison with traditional timing-driven, congestion-aware placement flow.

8B-4s VLSI Module Placement with Pre-placed Modules and Considering Congestion Using Solution Space Smoothing [p. 741]
S. Dong, X. Hong, X. Qi, R. Wang, S. Chen, J. Gu

Solution space smoothing allows a local search heuristic to escape from a poor, local minimum. In this paper, we propose a technique that can smooth the rugged terrain surface of the solution space of a placement problem. We design the solution space smoothing algorithms for VLSI placement with pre-placed modules and placement with consideration of congestion. Experiment results demonstrated that solution space smoothing is very efficient for VLSI module placement, and it can be applied to all floorplanning representations proposed so far.

8B-5s A Path-based Timing-driven Quadratic Placement Algorithm [p. 745]
W. Hou, X. Hong, W. Wu, Y. Cai

This paper presents a path-based timing-driven quadratic placement algorithm. The delay of the path acts as the timing constraints. In the global optimization step, it tries to satisfy the timing constraints. In the partition step, it tries to decrease the cut number of critical paths. It has some special skills, such as decrease the delay on the longest path, pad assign, to decrease the delay further. Results show this algorithm can make the timing behavior improve more than 20%.


Session 8C: Test issues for deep sub-micron design

Co-Chairs: Tomoo Inoue, Rochit Rajsuman

8C-1 Experience in Critical Path Selection for Deep Sub-Micron Delay Test and Timing Validation [p. 751]
J.-J. Liu, L.-C. Wang, A. Krstic, K.-T. Cheng

Critical path selection is an indispensable step for AC delay test and timing validation. Traditionally, this step relies on the construction of a set of worse-case paths based upon discrete timing models. However, the assumption of discrete timing models can be invalidated by timing defects and process variation in the deep sub-micron domain, which are often continuous in nature. As a result, critical paths defined in a traditional timing analysis approach may not be truly critical in reality. In this paper, we propose using a statistical delay evaluation framework for estimating the quality of a path set. Based upon the new framework, we demonstrate how the traditional definition of a critical path set may deviate from the true critical path set in the deep sub-micron domain. To remedy the problem, we discuss improvements to the existing path selection strategies by including new objectives. We then compare statistical approaches with traditional approaches based upon experimental analysis of both defect-free and defect-injected cases.

8C-2 On Effective Criterion of Path Selection for Delay Testing [p. 757]
M. Fukunaga, S. Kajihara, S. Takeoka, S. Yoshimura

Since a logic circuit often has too many paths to test delay of all paths in the circuit, it is necessary for path delay testing to limit the number of paths to be tested. Paths to be tested should be ones with large delay that more likely cause a fault. In addition, a test set for the paths are required to detect other models of faults as many as possible. In this paper, we investigate criteria of path selection for path delay testing. We first define typical two criteria to be investigated here, and then experimentally show the feature of paths selected with each criterion, with respect to fault coverage of other delay fault models. From our experiments, we observe that test patterns for the longest paths cannot cover many other faults such as gate delay faults or segment delay faults.

8C-3 DFT Timing Design Methodology for At-Speed BIST [p. 763]
Y. Sato, M. Sato, K. Tsutsumida, M. Kawashima, K. Hatayama, K. Nomoto

Logic BIST is well known as an effective method for low cost testing. However, it is difficult to realize at-speed testing, as it requires a deliberate timing design in regard to logic design and layout of the chip. This paper presents a timing design methodology for at-speed BIST, using a multiple-clock domain scheme. Some experimental test results of large industrial designs using our custom tool "Singen", will also be shown.

8C-4 SLV/ATPG: an Automated Flow for Test Model Generation from Switch Level Custom Circuits [p. 769]
T. McDougall, A. Parashkevov, S. Jolly, J. Zhu, J. Zeng, M. Abadir, C. Pyron

Custom VLSI design at the switch level is commonly applied when a chip is required to meet stringent operating requirements in terms of speed, power, or area. ATPG requires gate level models, which are verified for correctness against switch level models. Typically, test models are created manually from the switch level models - a tedious, error-prone process requiring experienced DFT engineers. This paper presents an automated flow for creating gate level test models from circuits at the switch level. The proposed flow utilizes Motorola's Switch Level Verification (SLV) tool, which employs detailed switch level analysis to model the behavior of MOS transistors and represent them at a higher level of abstraction. We present experimental results, which demonstrate that the automated flow is capable of producing gate models that meet the ATPG requirements and are comparable to manually created ones.


Session 8D: Analog Circuits Design and Methodology

Co-Chairs: Yu-Chung Hung, Makoto Nagata

8D-1 Using Red-Black Interval Trees in Device-Level Analog Placement with Symmetry Constraints [p. 777]
F. Balasa, S.C. Maruvada, K. Krishnamoorthy

The traditional way of approaching device-level placement problems for analog layout is to explore a huge search space of absolute placement representations, where cells are allowed to illegally overlap during their moves [3, 10]. This paper presents a novel exploration technique for analog placement, operating on the set of tree representations of the layout [6, 2], where the typical presence of an arbitrary number of symmetry groups of devices is directly taken into account during the search of the solution space. The efficiency of the novel approach is due to the use of red-black interval trees [4], data structures employed to support operations on dynamic sets of intervals.

8D-2 Current-Driven Wire Planning for Electromigration Avoidance in Analog Circuits [p. 783]
J. Lienig, G. Jerke

Electromigration due to insufficient wire width can cause the premature failure of a circuit. The ongoing reduction of circuit feature sizes has aggravated the problem over the last couple of years, especially with analog circuits. It is therefore an important reliability issue to consider current densities already in the physical design stage. We present a new methodology capable of routing analog multi-terminal signal nets with current-dependent wire widths. It is based on current-driven wire planning which effectively determines all branch currents prior to detailed routing. We also discuss successful applications of our methodology in commercial analog circuit design.

8D-3 Efficient DDD-based Term Generation Algorithm for Analog Circuit Behavioral Modeling [p. 789]
S.X.-D. Tan, C.-J.R. Shi

An efficient approach to generating symbolic product terms for behavioral modeling of large linear analog circuits is presented. The approach is based on a compact determinant decision diagram (DDD) representation of transfer functions and characteristics of analog circuits. The new algorithm is based on the concept that a dominant term in a DDD graph can be found by searching the shortest path in the graph. But instead of traversing a whole DDD graph each time, we show that a shortest path can be found by just updating a small number of the newly added vertices after the first shortest path is found. Experimental results indicate that the new symbolic term generation algorithm outperforms both pure shortest path based algorithm and dynamic programming based algorithm, which is the fastest symbolic term generation algorithm published so far.

8D-4 5Gbps Serial Link Transmitter with Pre-emphasis [p. 795]
C.-H. Lin, C.-H. Wang, S.-J. Jou

High-speed serial link that achieves Gbps has the advantage of low cost and thus become popular. In this paper, we will implement the high-speed data serial link transceiver and demonstrate the pre-emphasis circuit. The overall circuit is implemented in TSMC 0.18um 1P6M 1.8v CMOS process. The performance of the transceiver can reach 5Gbps over the 10-meter long cable.


Session 9A: Synthesis for Power Performance Optimization

Co-Chairs: Yutaka Tamiya, Li-C. Wang

9A-1 Low Power Synthesis of Finite State Machines with Mixed D and T Flip-Flops [p. 803]
A. Iranli, P. Rezvani, M. Pedram

This paper presents a state assignment technique to reduce dynamic power consumption in finite state machines (FSM). The key idea is to decompose the state machine into a set of cycles that are collectively equivalent to the original FSM, and perform state assignment based on the cycle realization of the state machine using Gray codes. A new implementation of state machines by using a combination of D and T flip-flops is thereby proposed, which in conjunction with the proposed encoding algorithm, reduces power consumption by an average of 15%.

9A-2 Don't Cares in Logic Minimization of Extended Finite State Machines [p. 809]
Y. Jiang, R. K. Brayton

Extended Finite State Machines (EFSMs) have been proposed to model control oriented systems. A version of this, with the data portion modeled by Presburger arithmetic, has been used in formal verification and test pattern generation. This paper proposes a general logic minimization scheme using don't care derived from both control and data path. It consists of methods to transfer donât cares through the data path and to generate logic donât cares from the data path using quantifier-free Presburger inequalities. Potential applications are discussed and preliminary results validate the scheme on reasonable examples.

9A-3s Performance Optimization of Synchronous Control Units for Datapaths with Variable Delay Arithmetic Units [p. 816]
E. Kim, H. Saito, J.-G. Lee, D.-I. Lee, H. Nakamura, T. Nanya

Nowadays, variable delay arithmetic units have been used for implementing a datapath of a target system in pursuit of performance improvement. However, adoption of variable delay arithmetic units requires modification of a typical synchronous control unit design methodology. A telescopic arithmetic unit based methodology is one of representative methodologies to design synchronous control units for variable delay datapaths. In this paper, we propose two optimization methods for it. Proposed optimization techniques will be analyzed in order to show their performance improvement effects explicitly.

9A-4s Integer Linear Programming-Based Synthesis of Skewed Logic Circuits [p. 820]
A. Cao, N. Sirisantana, C.-K. Koh, K. Roy

We present an integer linear programming-based approach for solving the logic reconvergence problem in skewed logic circuits with minimal logic duplication cost. A simplification technique is applied to reduce the complexity of the ILP problem greatly so that the run time is more affordable. Experimental results show that an average of 18% of original gates are duplicated in skewed logic circuits, whereas 65% in Domino logic circuits are duplicated. The average power saving over Domino logic circuits is 40.9%.


Session 9B: Routing

Co-Chairs: Xinlong Hong, Yoichi Shiraiashi

9B-1 Highly Scalable Algorithms for Rectilinear and Octilinear Steiner Trees [p. 827]
A.B. Kahng, I.I. Mándoiu, A.Z. Zelikovsky

The rectilinear Steiner minimum tree (RSMT) problem, which asks for a minimum-length interconnection of a given set of terminals in the rectilinear plane, is one of the fundamental problems in electronic design automation. Recently there has been renewed interest in this problem due to the need for highly scalable algorithms able to handle nets with tens of thousands of terminals. In this paper we give a practical O(n log2 n) heuristic for computing near-optimal rectilinear Steiner trees based on a batched version of the greedy triple contraction algorithm of Zelikovsky [21834]. Experiments conducted on both random and industry testcases show that our heuristic matches or exceeds the quality of best known RSMT heuristics, e.g., on random instances with more than 100 terminals our heuristic improves over the rectilinear minimum spanning tree by an average of 11%. Moreover, our heuristic has very well scaling runtime, e.g., it can route a 34k-terminals net extracted from a real design in less than 25 seconds compared to over 86 minutes needed by the O(n2) edge-based heuristic of Borah, Owens, and Irwin [3]. Since our heuristic is graph-based, it can be easily modified to handle practical considerations such as routing obstacles, preferred directions, via costs, and octilinear routing ö indeed, experimental results show only a small factor increase in runtime when switching from rectilinear to octilinear routing.

9B-2 UTACO: a Unified Timing and Congestion Optimizing Algorithm for Standard Cell Global Routing [p. 834]
T. Jing, X. Hong, H. Bao, Y. Cai, J. Xu, C. Cheng, J. Gu

Timing performance and routability are two main issues of global routing. In this paper, we adopt a shadow price mechanism to incorporate the two issues into one unified objective function. The shadow price of a net is the sum of its congestion price and timing price. Based on the new formulation, this paper presents the UTACO algorithm for standard cell (SC) global routing. The experimental results show that UTACO is efficient for both timing and congestion optimization.

9B-3 The Y-Architecture: Yet Another On-Chip Interconnect Solution [p. 840]
H. Chen, B. Yao, F. Zhou, C.-K. Cheng

In this paper, we propose a new on-chip interconnect scheme called Y-architecture, which can utilize the on-chip routing resources more efficiently than traditional Manhattan interconnect architecture by allowing wires routed in three directions (0ˇ, 60ˇ, and 120ˇ). To evaluate the efficiency of different interconnect architectures, we assume mesh structures with uniform communication demand and develop a multi-commodity flow (MCF) approach to model the on-chip communication traffic. We also extend the combinatorial MCF algorithm in [5] to compute the optimal routing resource allocations for different interconnect architectures. The experiments show that: (1) Compared with Manhattan architecture, the Y-architecture demonstrates a throughput improvement of 30.7% for square chip. The throughput of the Y-architecture is only 2.5% smaller than that of X-architecture. (2) A chip with the shape of a convex polygon produces better throughput than a rectangular chip: For Y-architecture, a hexagonal chip provides 41% more throughput than a squared chip using the Manhattan architecture. For Manhattan architecture, a diamond chip achieves a throughput improvement of 19.5% over the squared chip using the same interconnect architecture. (3) Compared with Manhattan architecture, the Y-architecture reduces the wire length of a randomly distributed two pin net by 13.4% and the average wire length of Y-architecture is only 4.3% more than that of the X-architecture.

9B-4s A Novel Timing-Driven Global Routing Algorithm Considering Coupling Effects for High Performance Circuit Design [p. 847]
J. Xu, X. Hong, T. Jing, Y. Cai, J. Gu

As the CMOS technology enters the very deep submicron era, inter-wire coupling capacitance becomes the dominant part of load capacitance. The coupling effects have brought new challenges to routing algorithms on both delay estimation and optimization. In this paper, we propose a timing-driven global routing algorithm with consideration of coupling effects. The two-phase algorithm based on timing-relax method includes a heuristic Steiner tree algorithm and an optimization algorithm. Experimental results are given to demonstrate the efficiency and accuracy of the algorithm.

9B-5s Graph Matching-Based Algorithms for Array-Based FPGA Segmentation Design and Routing [p. 851]
J.-M. Lin, S.-R. Pan, Y.-W. Chang

Architecture and CAD are closely related issues in FPGA design. Routing architecture design shall optimize routability and facilitate router development; on the other hand, router design shall consider the specific properties of routing architectures to optimize the performance of the router. In this paper, we propose effective and efficient unified matching-based algorithms for array-based FPGA routing and segmentation design. For the segmentation design, we consider the similarity of input routing instances and formulate a net-matching problem to construct the optimal segmentation architecture. For the router design, we present a matching-based timing-driven routing algorithm which can consider a versatile set of routing segments. Experimental results show that our designed segmentations significantly outperform those used in commercially available FPGAs. For example, our designed segmentations achieve, on average, 14.6% and 19.7% improvements in routability, compared with those used in the Lucent Technologies ORCA 2C-series and the Xilinx XC4000E-series FPGAs, respectively.


Session 9C: DFT Optimizations

Co-Chairs: Alfred Crouch, Hiroshi Date

9C-1 Routing-Aware Scan Chain Ordering [p. 857]
P. Gupta, A.B. Kahng, S. Mantik

Scan chain insertion can have a large impact on routability, wirelength and timing of the design. We present a routing-driven methodology for scan chain ordering with minimum wirelength objective. A routing-based approach to scan chain ordering, while potentially more accurate, can result in TSP (Traveling Salesman Problem) instances which are asymmetric and highly non-metric; this may require a careful choice of solvers. We evaluate our new methodology on recent industry place-and-route blocks with 1200 to 5000 scan cells. We show substantial wirelength reductions for the routing-based flow, versus the traditional placement-based flow: in a number of our test cases, over 86% of scan routing overhead is saved. Even though our experiments are so far timing-oblivious, the routing-based flow does also improve evaluated timing, and practical timing-driven extensions appear feasible.

9C-2 Multiple Test Set Generation Method for LFSR-Based BIST [p. 863]
Y. Shi, Z. Zhang

In this paper we propose a new reseeding method for LFSR-based test pattern generation suitable for circuits with random pattern resistant faults. The character of our method is that the proposed test pattern generator (TPG) can work both in normal LFSR mode, to generate pseudorandom test vectors, and in jumping mode to make the TPG jump from a state to the required state (seed of next group). Experimental results indicate that its superiority against other known reseeding techniques with respect to the length of the test sequence and the required area overhead.

9C-3 A Seed Selection Procedure for LFSR-Based Random Pattern Generators [p. 869]
K. Ichino, K.-i. Watanabe, M. Arai, S. Fukumoto, K. Iwasaki

We propose a technique of selecting seeds for the LFSR-based test pattern generators that are used in VLSI BISTs. By setting the computed seed as an initial value, target fault coverage, for example 100%, can be accomplished with minimum test length. We can also maximize fault coverage for a given test length. Our method can be used for both test-per-clock and test-per-scan BISTs. The procedure is based on vector representations over GF (2m), where m is the number of LFSR stages. The results indicate that test lengths derived through selected seeds are about sixty percent shorter than those derived by conventionally selected seeds for a given fault coverage. We also show that seeds obtained through this technique accomplish higher fault coverage than the conventional selection procedure. In terms of the c7552 benchmark, taking a test-per-scan architecture with a 20-bit LFSR as an example, the number of undetected faults can be decreased from 304 to 227 for 10,000 LFSR patterns using our proposed technique.

9C-4s Efficient BIST Design for Sequential Machines Using FiF-FoF Values in Machines States [p. 875]
S. Roy, U. Maulik, S. Bandyopadhyay, S. Basu, B.K. Sikdar

This paper introduces a novel BIST-quality metric termed as the FiF - FoF (Fun-in-Factor & Fan-and-Factor) defined on FSM-states. Based on the FiF - FoF analysis, an efficient scheme is presented that ensures all state codes appear with uniform likelyhood at the present state (PS) lines during the test phase. This results in higher fault efficiency in a BIST structure. Experimental results on MCNC benchmarks show that the scheme improves fault efficiency of sequential circuits significantly, with marginal area overhead.

9C-5s A New Design-for-Test Technique for Reducing SOC Test Time [p. 879]
C.V.G. Rao, D.R. Chowdhury

This paper introduces a new design-for-test(DFT) technique for system-on-chip(SOC) designs. It aims to provide the test designer with details of test scheduling, test access mechanism (TAM) design and an integrated test strategy in order to implement an efficient test solution. Post-synthesis simulations are carried out on the net lists of ISCAS'89 benchmark SOCs to prove the allegiance of the proposed algorithm and to realize the DFT. Experiments resulted in a significant reduction of the test time.


Session 9D: RF Circuits Design and Methodology

Co-Chairs: Wing-Hung Ki, Tsuneo Tsukahara

9D-1 Periodic Steady-State Analysis of Coupled ODE-AE-CGE Systems for MOS RF Autonomous Circuit Simulation [p. 885]
X. Wu, Z. Chen, J. Lai, Q. Zhang, O. Wing, J. Ren

This paper studies the steady-state analysis of a system of ODE-AE-CGE arising from simulation of MOS RF autonomous circuits in which the transistor plays as current generator and its current equations could adequately represent the transistor large-signal dynamic behavior. Gauss-Seidel relaxation is applied to decouple the system so that shooting method, which is used to find the periodic steady-state response, can be performed with low-order sensitivity matrix. This simplifies the solution process and results in a fast algorithm. We illustrate the new algorithm with the simulation of a typically voltage-controlled oscillator (VCO), the simulation results show good agreement with those obtained by MEDICI, a device simulator.

9D-2 A Frequency Separation Macromodel for System-Level Simulation of RF Circuits [p. 891]
X. Li, P. Li, Y. Xu, R. Dimaggio, L. Pileggi

In this paper we propose a frequency-separation methodology to generate system-level macromodels for analog and RF circuits. The proposed macromodels are similar in form to those based on Volterra kernel calculations, but are much simpler in terms of characterization and overall model complexity, and can be derived from existing device models. This simplicity is realized by applying some basic assumptions on the form of the input excitations, and via separation of the nonlinearities from the dynamic behavior. In addition, by further separating the ideal model functionality, this macromodel is applicable to strongly nonlinear components such as mixers. While time-varying Volterra series models have been proposed for mixers with a fixed local oscillation (LO) signal, the proposed frequency separation model is completely general and can capture the variations of the LO input during a system-level simulation. The proposed macromodels are demonstrated in a system-level simulation tool based on Simulink for efficient evaluation of the entire RF system and associated components. A GSM receiver system in 0.25µm CMOS process is used to demonstrate the efficacy of these macromodels in our system-level simulation environment.

9D-3 Nonlinear Distortion Analysis via Linear-Centric Models [p. 897]
P. Li, L.T. Pileggi

An efficient distortion analysis methodology is presented for analog and RF circuits that utilizes linear-centric circuit models to generate individual distortion contributions due to the various circuit nonlinearities. The per-nonlinearity distortion results are obtained via a straightforward post-simulation step that is simpler and more efficient than the Volterra series based approaches and do not require the high order device model derivatives. For this reason the order of analysis can be significantly higher than that for a Volterra series implementation while fully accounting for all nonlinearity effects. The proposed methodology is not restricted to weakly nonlinear circuits, but can also analyze per-nonlinearity distortion for active switching mixers and switch capacitor circuits when they are modeled as periodically time-varying weakly nonlinear systems.While Volterra series have also been attempted for this same class of circuits, the requirement of device models for all of the high order model derivatives makes such analysis somewhat impractical. The proposed methodology provides important design insights regarding the relationships between design parameters and circuit linearity, hence the overall system performance. Circuit examples are used to demonstrate the efficacy of the proposed approach, and interesting insights are observed for RF switching mixers in particular.

9D-4 Parasitic-Aware Design and Optimization of a Fully Integrated CMOS Wideband Amplifier [p. 904]
J. Park, K. Choi, D.J. Allstot

A custom CAD synthesis tool based on particle swarm optimization, and results from the design of an RF CMOS distributed amplifier optimized to overcome non-idealities associated with parasitic-laden passives, are presented. The particle swarm synthesis approach is shown to be more than an order of magnitude faster than the simulated annealing design and optimization algorithm.