Selling and using IPs seems to be the perfect solution for improving design productivity. However, IP business models may not have adequate infrastructures with which to flourish. This talk will elaborate on new design styles, modeling guidelines, methodologies, and CAD tools needed for successful reuse.
A number of different algorithms for optimized offset assignment in DSP code generation have been developed recently. These algorithms aim at constructing a layout of local variables in memory, such that the addresses of variables can be computed efficiently in most cases. This is achieved by maximizing the use of auto-increment operations on address registers. However, the algorithms published in previous work only consider special cases of offset assignment problems, characterized by fixed parameters such as register file sizes and auto-increment ranges. In contrast, this paper presents a genetic optimization technique capable of simultaneously handling arbitrary register file sizes and auto-increment ranges. Moreover, this technique is the first that integrates the allocation of modify registers into offset assignment. Experimental evaluation indicates a significant improvement in the quality of constructed offset assignments, as compared to previous work.
Bit-true simulation verifies the finite word length choices in the VLSI implementation of a DSP application. Present-day bit-true simulation tools are time consuming. We elaborate a new approach in which the signal flow graph of the application is analyzed and then transformed utilizing the flexibility available on the simulation target. This global approach outperforms current tools by an order of magnitude in simulation time.
Since most DSP applications access large amount of data stored in the memory, a DSP code generator must minimize the addressing overhead. In this paper, we propose a method for addressing optimization in loop execution targeted toward DSP processors with auto-increment/decrement feature in their address generation unit. Our optimization methods include a multi-phase data ordering and a graph-based address register allocation. The proposed approaches have been evaluated using a set of core algorithms targeted towards the TI TMS320C40 DSP processor. Experimental results show that our system is indeed more effective compared to a commercial optimizing DSP compiler.
After being niche markets for several years, application markets for one-chip integration of large DRAMs and logic circuits are growing very rapidly as the transition to 0.25um technologies will offer customers up to 128 Mbit of embedded DRAM and 500 Kgates logic. However, embedded DRAM implies many technical challenges to be solved. In this paper we will address some of these technical issues in more detail.
Many ISA-level machine description languages have been introduced to support the automated development and retargeting of digital signal processor (DSP) software development tools. These languages have yet to move below the ISA-level and adequately address DSP pipeline issues. ISA-level bit-accurate models may be reasonable for small micro-controllers, but are inadequate when applied to complex high-performance DSPs. We introduce a new machine description language, RADL, which supports the automated generation of DSP programming tools. From RADL, we can generate production-quality tools including cycle- and phase-accurate simulators. RADL has explicit support for pipeline modeling, including delay slots, interrupts, hardware loops, hazards, and multiple interacting pipelines in a natural and intuitive way. RADL can represent both SIMD and MIMD instruction styles. We have coupled our language to an in-house tool-chain generator which is used to create production assemblers, simulators and compilers.
Design of large systems on a chip would be infeasible without the capability to flexibly adapt the system architecture to the application and the re-use of existing Intellectual Property (IP). This in turn requires the use of an appropriate methodology for system specification, architecture selection, IP integration and implementation generation. The goals of this work are: a) verification of the effectiveness of the POLIS HW/SW co-design methodology for the design of embedded systems for telecom applications; b) definition of a methodology for integrating system level IP libraries in this HW/SW co-design framework. Methodology evaluations have been carried out through the development of an industrial telecom system design, an ATM node server.
We describe an approach for incorporating cores into a system-level specification. The goal is to allow a designer to specify both custom behavior and pre-designed cores at the earliest design stages, and to refine both into implementations in a unified manner. The approach is based on experience with an actual application of a GPS- based navigation system. We use an object-oriented language for specification, representing each core as an object. We define three specification levels, and we evaluate the appropriateness of existing inter-object communication methods for cores. The approach forms the specification basis for the Dalton project.
The concept of retargetability enables compiler technology to keep pace with the increasing variety of domain-specific embedded processors. In order to achieve user retargetability, powerful processor modeling formalisms are required. Most of the recent modeling formalisms concentrate on horizontal, VLIW-like instruction formats. However, for encoded instruction formats with restricted instruction-level parallelism (ILP), a large number of ILP constraints might need to be specified, resulting in less concise processor models. This paper presents an HDL-based approach to processor modeling for retargetable compilation, in which ILP may be implicitly constrained. As a consequence, the formalism allows for concise models also for encoded instruction formats. The practical applicability of the modeling formalism is demonstrated by means of a case study for a complex DSP.
False path analysis is an activity with applications in a variety of computer science and engineering domains like for instance high-level synthesis, worst case execution time estimation, software testing etc. In this paper a method to automate false path analysis, based on a control flow graph connected to a hierarchical BDD based control representation, is described. By its ability to reason on predicate expressions involving arithmetic inequalities, this method overcomes certain limitations of previous approaches. Preliminary experimental results confirm its effectiveness.
Commonly used scheduling algorithms in high-level synthesis are not capable of sharing resources across process boundaries. This results in the usage of at least one resource per operation type and process. A new method is proposed in order to overcome these restrictions and to share high-cost or limited resources within a process group. This allows the use of less than one resource per operation type and process, while keeping the mutual independence of the involved processes. The method which represents an extension of general scheduling algorithms is not tied to a specific algorithm. The method is explained by using the common List Scheduling and further on applied to examples.
The inevitable fluctuation in fabrication processes results in LSI chips with various critical path delay even though all the chips are fabricated from the same design. Therefore, in LSI design, it is important to estimate what percentage of the fabricated chips will achieve the performance level and to maximize the percentage. This paper presents a model and a method to analyze statistical delay of RT-level datapath designs. The method predicts the probability that the fabricated circuits will work at a user specified clock period. Using the method, we can estimate a tight bound on the worst case critical path delay of the circuits. Based on the delay analysis method, a high-level module binding algorithm which maximizes the probability is also proposed. Experimental results demonstrate that the proposed statistical delay analysis method leads to lower-cost or higher-performance designs than conventional delay analysis methods.
A methodology for designing systems with concurrent error detection capability is introduced. The proposed approach consists of a functional architecture and a checking architecture to verify data computed by the functional one. The methodology reduces both redundancy and latency through hardware resources and data sharing, respectively.
In this paper, we propose the target board architecture of a rapid prototyping embedded system based on hardware software codesign. The target board contains a TMS320C30 DSP processor and up to four Xilinx XC4025E FPGAs. Various communication channels between the C30 and the FPGAs are provided and a master-master computing paradigm is supported. HW/SW communication protocols, ranging from handshaking, batch to queue controlled, as well as the corresponding interfaces are described in VHDL and C codes respectively and can be easily augmented to the mapped design. A codesign implementation example based on G.728 LD-CELP speech decoder shows the proposed communication protocols and interfaces lead to very small time and circuitry overhead.
As it becomes possible to integrate an entire system on a chip, the processor architect is presented with an unprecedented opportunity to tailor the processor to the application at hand. To fully realize the potential of this technology, it is critical for us to be able to quickly produce an optimized processor design and associated programming tools. Language and compilers have an important role in the revolution of processor IP in ASIC designs. Many of the advanced architectural ideas (parallelism, memory subsystem optimizations) originally developed for high-end processors are directly applicable to this environment. They often require more nontrivial compiler support. Opportunities for application-specific optimizations can be exposed by means of more expressive programming languages. Finally, program analyses can extract program characteristics that can be used directly in guiding the architecture customization process.
This paper presents an application-specific, heterogeneous multiprocessor synthesis system, named HeMPS, that combines a form of Evolutionary Computation known as Differential Evolution with a scheduling heuristic to search the design space efficiently. We demonstrate the effectiveness of our technique by comparing it to similar existing systems. The proposed strategy is shown to be faster than recent systems on large problems while providing equivalent or improved final solutions.
This paper describes an attempt to bring together the many different system design flows existing in architecture and system design technology research, into a more abstract but unifying meta flow. Many existing system and architecture design flows have a strong resemblance and unnecessary overlap. Mainly due to a lack of a common and consistent terminology coupled to a common reference basis, it is now nearly impossible to compare and reuse (sub)steps. In addition, there is a too strong separation between research in different communities. To alleviate this problem, we introduce a more abstract but unifying meta flows which attempts to bridge the gap between the existing flows. From this meta flow, a particular design flow can be instantiated for a given application (domain) by leaving out the non-required stages/steps, by selecting a (sub)step sequence which is compatible with the partial meta-flow order, and by selecting the appropriate technique for all remaining (sub)steps (e.g. the type of scheduler). This paper focuses on the principles at the task- and instruction-level abstractions. It also provides an illustration of the power of the metaflow principles for a realistic multi-media compression demonstrator from the MPEG4 context.
This paper describes a methodology for synthesizing the data-path of a Very Long Instruction Word (VLIW) based Video Signal Processor (VSP). Offering both performance and programmability, VSPs are important for their roles in digital video applications, which are omnipresent in today's world. Among many different architectures, VLIW is becoming increasingly popular and widely used due to its efficiency in exploiting high degree of parallelism inherent in multimedia applications. While architectural syntheses of embedded systems have been studied in depth, little literature has addressed similar issues for VLIW-based VSPs. Using an MPEG-2 video encoder as a case study, in this paper we present a combined application of trace-driven simulation and performance estimation in the data-path synthesis of a VLIW VSP. Results show that our estimations are quite precise and helpful, let alone that they are orders of magnitude faster than simulation.
Complex system specifications are often hierarchically composed of several subsystems. Each subsystem contains one or more processes. In order to provide optimization across different levels of hierarchy, a synchronicity analysis of the concerned processes has to be performed during high-level synthesis. The first step is the generation of a condensed graph representation of the inter-process communication. This graph is then utilized to detect inter-process communication which can be used to represent synchronization points between two or more processes. A synchronization point represents the starting point of an interval in which the communicating processes run synchronously. This interval is limited by unbounded data-dependent loops, denoted as de-synchronization points. As a result, different processes can only share resources in such an interval.
This paper presents a codesign approach which incorporates communication protocol selection as a design parameter within hardware/software partitioning. The presented approach takes into account data transfer rates depending on communication protocol types and configurations, and different operating frequencies of system components, i.e. CPUs, ASICs, and busses. It also takes into account the timing and area influences of drivers and driver calls needed to perform the communication. The approach is illustrated by a number of design space exploration experiments which use models of the PCI and USB communication protocols.
Reducing power dissipation is becoming more important in the design of embedded systems. Core-based system design opens up the opportunity for exploring different bus interfaces in order to optimize for reduced power. We give a first approach for exploring a range of possible bus configurations, such as width and coding schemes, for a given set of communication channels. Our approach uses power estimation formulas, for fast performance. We use this approach to explore different bus interfaces for a real GPS navigation system in order to select the optimal bus interface for minimum power consumption.
In this paper, we propose instruction encoding techniques for embedded system design, which encode immediate fields of instructions to reduce the size of an instruction memory. Although our proposed techniques require an additional decoder for the encoded immediate values, experimental results demonstrate the effectiveness of our techniques to reduce the chip area.
Designing a cost effective superscalar architecture for x86 compatible microprocessors is a challenging task in terms of both technical difficulty and commercial value. One of the important design issues is the measurements of the distribution of functional unit usage and the micro operation level parallelism (MLP), which together determine the proper allocation of functional units in the superscalar architecture. To obtain such measurements, an x86 instruction set CAD system x86 Workshop is developed, which consists of both instruction set analysis and optimization tools. x86 Workshop has been applied to analyze several popular Windows95 applications such as Word, Excel, Communicator, etc. The MLP and distribution of functional unit usage are measured for these applications. The measurements are used to evaluate several existing x86 superscalar processors and suggest future extension.
Due to the limited amount of memory resources in embedded systems, minimizing the memory requirements is an important goal of software synthesis. This paper presents a set of techniques to reduce the code and data size for software synthesis from graphical DSP programs based on the synchronous dataflow (SDF) model. By sharing the kernel code among multiple instances of a block, we can further reduce the code size below the single appearance schedule. And, a systematic approach is presented to give up single appearance schedules to reduce the data buffer requirements. Experimental results from two real examples prove the significance of the proposed techniques.
We present a tool for synthesis of pipelined implementations of hardware-software systems. The tool uses iterative hardware-software partitioning and pipelined scheduling to obtain optimal partitions which satisfy the timing and area constraints. The partitioner uses a branch and bound approach with a unique objective function which minimizes the initiation interval of the final design. It takes communication time and hardware sharing into account. This paper also presents techniques for generation of good initial solution and search space bounding for the partitioning algorithm. A candidate partition is evaluated by generating its pipelined schedule. The scheduler uses a list based scheduler and a retiming transformation to optimize the initiation interval, number of pipeline stages and memory requirements of a particular design alternative. The effectiveness of the tool is demonstrated by experimentation.
Earlier work has demonstrated that partitioning one large behavioral process into smaller ones before synthesis can yield numerous advantages, such as reduced synthesis runtime, easier package constraint satisfaction, reduced power consumption, improved performance, and hardware/software tradeoffs. In this paper, we describe a novel three-step functional partitioning methodology for automatically dividing a large behavioral process into mutually-exclusive subprocesses, and we define the problems and our solutions for each step. The three steps are granularity selection, pre-clustering, and N-way assignment. We refer to experiments throughout that demonstrate the effectiveness of the solutions.
With the decreasing feature sizes during VLSI fabrication and the dominance of interconnect delay over that of gates, control logic and wiring no longer have a negligible impact on delay and area. The need thus arises for developing techniques and tools to redesign incrementally to eliminate performance bottlenecks. Such a redesign effort corresponds to incrementally modifying an existing schedule obtained via high-level synthesis. In this paper we demonstrate that applying architectural retiming, a technique for pipelining latency-constrained circuits, results in incrementally modifying an existing schedule. Architectural retiming reschedules fine grain operations (ones that have a delay equal to or less than one clock cycle) to occur in earlier time steps, while modifying the design to preserve its correctness.