CODES 2002 ABSTRACTS

Sessions: [1] [2] [3] [4] [5] [6] [7]


Session 1: Advances in System Specification and System Design Frameworks

Codesign-Extended Applications [p. 1]
Brian Grattan, Greg Stitt, and Frank Vahid

We challenge the widespread assumption that an embedded system's functionality can be captured in a single specification and then partitioned among software and custom hardware processors. The specification of some functions in software is very different from the specification of the same function in hardware - too different to conceive of automatically deriving one from the other. We illustrate this concept using a digital camera example. We introduce the idea of codesign-extended applications to deal with the situation, wherein critical functions are written in multiple versions, and integrated such that simple compiler/synthesis flags instantiate a particular version along with the necessary control and communication behavior. By capturing a specification as a codesign-extended application, a designer enables smooth migration among platforms with increasing amounts of on-chip configurable logic.
Keywords
Hardware/software partitioning, hardware/software cospecification, configurable logic, system-on-a-chip, platform-based design.

Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances [p. 7]
Todor Stefanov, Bart Kienhuis, and Ed Deprettere

Following the Y-chart paradigm for designing a system, an application and an architecture are modeled separately and mapped onto each other in an explicit design step. Next, a performance analysis for alternative application instances, architecture instances and mappings has to be done, thereby exploring the design space of the target system. Deriving alternative application instances is not trivially done. Nevertheless, many instances of a single application exist that are worth to be derived for exploration. In this paper, we present algorithmic transformation techniques for systematic and fast generation of alternative application instances that express task-level concurrency hidden in an application in some degree of explicitness. These techniques help a system designer to speedup significantly the design space exploration process.
Keywords
system-level design, design space exploration, application instances, algorithmic transformations

Concurrent Execution Semantics and Sequential Simulation Algorithms for Metropolis Meta-Model [p. 13]
Felice Balarin, Luciano Lavagno, Claudio Passerone, Alberto Sangiovanni-Vincentelli, Yosinori Watanabe, and Guang Yang

This paper presents the simulation techniques that are available in Metropolis, an inter-disciplinary research project that develops a design methodology, supported by a comprehensive design environment and tool set, for embedded systems. System behavior is non-deterministic in general, especially in the beginning of the design process, when several key decision, such as the mapping on an implementation platform, have not yet been made, and thus the traces obtainable by simulation are not unique even under the same input sequence. One may want to visit as many traces as possible for regression tests at the final stage of designs, or may just need one valid trace for a quick validation of the design at an early stage. Our techniques can adapt to these different objectives easily. They are also platform-independent in that simulation using different languages, such as SystemC 2.0, Java, and C++ with a thread library, are possible. This feature is important for co-simulation between designs captured in Metropolis and those that have been already designed in other languages.

The Design Context of Concurrent Computation Systems [p. 19]
JoAnn M. Paul, Christopher M. Eatedali, and Donald E. Thomas

Design for performance-optimization of programmable, semicustom SoCs requires the ability to model and optimize the behavior of the system as a whole. Neither the hardware-testbench style nor the software-benchmark style is adequate to capture completely the design interactions required in concurrent software-on-hardware systems. We use a formal relationship between a computer system design content and its external context to motivate the need to consider a more effective modeling framework to which concurrent software-on-hardware computer systems are designed.
Keywords
Hardware/Software Codesign, Modeling, Simulation, Concurrent Computation, Digital System Design

A Language for Multiple Models of Computation [p. 25]
Dag Björklund and Johan Lilius

We introduce a new kernel language for modeling hardware/software systems, adopting multiple heterogeneous models of computation. The language has formal operational semantics, and is well suited for model checking, code synthesis etc. For different blocks of code, different scheduling policies can be applied, to reflect the different interpretations of e.g. parallelism in different models of computation. The user can add his own scheduling policies, to use or explore different models of computation.


Session 2: System Design Methods: Analysis and Verification

FPGA Resource and Timing Estimation from Matlab Execution Traces [p. 31]
Per Bjuréus, Mikael Millberg, and Axel Jantsch

We present a simulation-based technique to estimate area and latency of an FPGA implementation of a Matlab specification. During simulation of the Matlab model, a trace is generated that can be used for multiple estimations. For estimation the user provides some design constraints such as the rate and bit width of data streams. In our experience the runtime of the estimator is approximately only 1/10 of the simulation time, which is typically fast enough to generate dozens of estimates within a few hours and to build cost-performance trade-off curves for a particular algorithm and input data. In addition, the estimator reports on the scheduling and resource binding used for estimation. This information can be utilized not only to assess the estimation quality, but also as first starting point for the final implementation.
Keywords
Design exploration, Matlab, FPGA, Estimation

Worst-Case Performance Analysis of Parallel, Communicating Software Processes [p. 37]
A. Siebenborn, O. Bringmann, and W. Rosenstiel

In this paper we present a method to perform static timing analysis of SystemC models, that describe parallel, communicating software processes.The paper combines a worst-case execution time (WCET) analysis with an analysis of the communication behavior. The communication analysis allows the detection of points, where the program ow of two or more concurrent processes are synchronized. This knowledge allows the determination of the worst-case response time (WCRT). The method does not rely on restrictions on the system design to prevent deadlocks or data loss. Furthermore possible deadlocks and data loss can be detected during the analysis.

Symbolic Model Checking of Dual Transition Petri Nets [p. 43]
Mauricio Varea, Bashir M. Al-Hashimi, Luis A. Cortés, Petru Eles, and Zebo Peng

This paper describes the formal verification of the recently introduced Dual Transition Petri Net (DTPN) models [12], using model checking techniques. The methodology presented addresses the symbolic model checking of embedded systems behavioural properties, expressed in either computation tree logics (CTL) or linear temporal logics (LTL). The embedded system specification is given in terms of DTPN models, where elements of the model are captured in a four-module library which implements the behaviour of the model. Key issues in the development of the methodology are the heterogeneity and the nondeterministic nature of the model. This is handled by introducing some modifications in both structure and behaviour of the model, thus reducing the points of nondeterminism. Several features of the methodology are discussed and two examples are given in order to show the validity of the model.

Simulation Bridge: A Framework for Multi-Processor Simulation [p. 49]
G.D. Nagendra, V.G. Prem Kumar, and B.S. Sheshadri Chakravarthy

Multi-processor solutions in the embedded world are being designed to meet the ever increasing computational demands of the emerging applications. Such architectures comprise two or more processors (often a mix of general purpose and digital signal processors) together with a rich peripheral mix to provide a high performance computational platform. While there are many simulation solutions in the industry available to address the system partitioning issues and also the verification of HW-SW interactions in these complex systems, there are very few solutions targeted towards the SW application developers' needs.
The primary concern of the SW application developers is to debug and optimize their code. Hence, cycle accuracy and performance of the simulation solution becomes the key enablers. Desired observability and controllability of the models are additional careabouts. Secondly, application developers are more comfortable at instruction level simulations than they are with RTL or gate level simulation. These specific requirements have a bearing on the choices in the simulation solutions.
This paper describes the design of a generic, C based multi-processor instruction set simulator framework in the context of the above parameters. This framework, termed the "simulation bridge", facilitates highly accurate, yet efficient simulation. The SimBridge performs clock correct lock-step simulation of the models in the architecture using a global simulation engine that handles both intra-processor and inter-processor communication in a homogeneous fashion. It addresses the multiple key issues of execution control, synchronization, connectivity and communication.
The paper concludes with the performance analysis of the SimBridge in an experimental test setup as well as in the Texas Instruments (TI) TMS320C54x-based simulators.
Keywords: Multiprocessor simulation, Instruction Set Simulator, Simulation Framework.


Session 3: Design Space Exploration and Architectural Design of HW/SW Systems

Metrics for Design Space Exploration of Heterogeneous Multiprocessor Embedded Systems [p. 55]
Donatella Sciuto, Fabio Salice, Luigi Pomante, and William Fornaciari

This paper considers the problem of designing heterogeneous multiprocessor embedded systems. The focus is on a step of the design flow: the definition of innovative metrics for the analysis of the system specification to statically identify the most suitable processing elements class for each system functionality. Experimental results are also included, to show the applicability and effectiveness of the proposed methodology.
Keywords
Heterogeneous Multiprocessor Embedded Systems, Metrics for Hw/Sw Partitioning, System-Level Design.

Fast Processor Core Selection for WLAN Modem Using Mappability Estimation [p. 61]
Juha-Pekka Soininen, Jari Kreku, Yang Qu, and Martti Forsel

Mappability metric and a novel method for evaluating the goodness of processor core and algorithm combinations are introduced. The new mappability concept is an addition to performance and cost metrics used in existing codesign and system synthesis approaches. The mappability estimation is based on the analysis of the correlation or similarity of algorithm and core architecture characteristics. It allows fast design space exploration of core architectures and mappings with little modeling effort. The method is demonstrated by analyzing suitable processor core architectures for baseband algorithms of the WLAN modem. 140400 architecture-algorithm pairs were analyzed in total and the estimated results were similar to the results of more detailed evaluations. The method is not, however, limited to the WLAN modem, but is applicable for digital signal processing in general.
Keywords
mappability estimation, processor architecture evaluation, codesign, cost function

Multi-Objective Design Space Exploration Using Genetic Algorithms [p. 67]
Tony Givargis and Maurizio Palesi

In this work, we provide a technique for efficiently exploring a parameterized system-on-a-chip (SoC) architecture to find all Paretooptimal configurations in a multi-objective design space. Globally, our approach uses a parameter dependency model of our target parameterized SoC architecture to extensively prune non-optimal subspaces. Locally, our approach applies Genetic Algorithms (GAs) to discover Pareto-optimal configurations within the remaining design points. The computed Pareto-optimal configurations will represent the range of performance (e.g., timing and power) tradeoffs that are obtainable by adjusting parameter values for a fixed application that is mapped on the parameterized SoC architecture. We have successfully applied our technique to explore Pareto-optimal configurations for a number of applications mapped on a parameterized SoC architecture.
Keywords
Design space exploration, genetic algorithms, low power design, Pareto-optimal configurations, and system-on-a-chip architectures

Scratchpad Memory : A Design Alternative for Cache On-Chip Memory in Embedded Systems [p. 73]
Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, M. Balakrishnan, and Peter Marwedel

In this paper we address the problem of on-chip memory selection for computationally intensive applications, by proposing scratch pad memory as an alternative to cache. Area and energy for different scratch pad and cache sizes are computed using the CACTI tool while performance was evaluated using the trace results of the simulator. The target processor chosen for evaluation was AT91M40400. The results clearly establish scratchpad memory as a low power alternative in most situations with an average energy reduction of 40%. Further the average area-time reduction for the scratchpad memory was 46% of the cache memory.

Hardware Support for Real-Time Embedded Multiprocessor System-on-a-Chip Memory Management [p. 79]
Mohamed Shalan and Vincent J. Mooney III

The aggressive evolution of the semiconductor industry - smaller process geometries, higher densities, and greater chip complexity Ö has provided design engineers the means to create complex, high-performance Systems-on-a-Chip (SoC) designs. Such SoC designs typically have more than one processor and huge memory, all on the same chip. Dealing with the global on-chip memory allocation/de-allocation in a dynamic yet deterministic way is an important issue for the upcoming billion transistor multiprocessor SoC designs. To achieve this, we propose a memory management hierarchy we call Two-Level Memory Management. To implement this memory management scheme - which presents a paradigm shift in the way designers look at on-chip dynamic memory allocation - we present a System-on-a-Chip Dynamic Memory Management Unit (SoCDMMU) for allocation of the global on-chip memory, which we refer to as Level Two memory management (Level One is the operating system management of memory allocated to a particular on-chip Processing Element). In this way, processing elements (heterogeneous or non-heterogeneous hardware or software) in an SoC can request and be granted portions of the global memory in a fast and deterministic time (for an example of a four processing element SoC, the dynamic memory allocation of the global onchip memory takes sixteen cycles per allocation/deallocation in the worst case). In this paper, we show how to modify an existing Real-Time Operating System (RTOS) to support the new proposed SoCDMMU. Our example shows a multiprocessor SoC that utilizes the SoCDMMU has 440% overall speedup of the application transition time over fully shared memory that does not utilize the SoCDMMU.
Keywords
System-on-a-Chip, dynamic memory management, real-time systems, embedded systems, SoCDMMU, two-level memory management, Atalanta, real-time operating systems.

Large Exploration for HW/SW Partitioning of Multirate and Aperiodic Real-Time Systems [p. 85]
Abdenour Azzedine, Jean-Philippe Diguet, and Jean-Luc Philippe

This paper addresses the domain of fine and coarse grain HW / SW codesign for Real-Time System On-Chip. We propose a new method for the real-time scheduling and the HW / SW partitioning of multi-rate or aperiodic tasks. The large design space exploration is based on parallelism/delay trade-off curves.
Keywords
HW / SW Codesign, system design exploration, RT scheduling


Session 4: Co-Design Architecture and Synthesis

Program Slicing for Codesign [p. 91]
Jeffry T. Russell

Program slicing is a software analysis technique that computes the set of operations in a program that may affect the computation at a particular operation. Interprocedural slicing techniques have separately addressed concurrent programs and hardware description languages. However, application of slicing to codesign of embedded systems requires dependence analysis across the hardware-software interface. We extend program slicing for a codesign environment. Hardware-software interactions common in component-based systems are mapped to previously introduced dependences, including the interference and signal dependences. We introduce a novel access dependence that models a memory access side effect that results in activation of a process. A slicing algorithm that incorporates this variety of dependences is described.

Compiler-Directed Customization of ASIP Cores [p. 97]
T. Vinod Kumar Gupta, Roberto E. Ko, and Rajeev Barua

This paper presents an automatic method to customize embedded application-specific instruction processors (ASIPs) based on compiler analysis. ASIPs, also known as embedded soft cores, allow certain hardware parameters in the processor to be customized for a specific application domain. They offer low design cost as they use pre-designed and verified components. Our design goal is choosing parameter values for fastest runtime within a given silicon area budget for a particular application set. Present-day technologies for choosing parameter values rely on exhaustive simulation of the application set on all possible combinations of parameter values š a time-consuming and non-scalable procedure. We propose a compiler-based method that automatically derives the optimal values of parameters without simulating any configuration. Further, we expand the space of parameters that can be changed from the limited set today, and evaluate the importance of each. Results show that for our benchmarks, the runtimes for different configurations are predicted with an average error of 2.5%. In the two area constrained customization problem we evaluate, our method is able to recommend the same configuration that is recommended by brute force exhaustive simulation 1
Keywords: customization, embedded, soft cores, ASIP

A Study of CodePack: Optimizing Embedded Code Space [p. 103]
Avishay Orpaz and Shlomo Weiss

CodePack is a code compression system used by IBM in its PowerPC family of embedded processors. CodePac combines high compression capability along with fast and simple decoding hardware. IBM did not release much information about the design of the system and the influence of various design parameters on its performance. In our work we will present the system and its design parameters and investigate how each affects its performance on the compression rate and decoder complexity. We also present a novel efficient algorithm to optimize the class structure of the system.
Keywords: Embedded Systems, CodePack, Code Compression, Optimization, Embedded Software

A Novel Codesign Approach Based on Distributed Virtual Machines [p. 109]
Christian Kreiner, Christian Steger, Egon Teiniker, and Reinhold Weiss

This paper describes a hardware/software codesign approach for the design of embedded systems based on digital signal processors and FPGAs. Our approach is based on distributed virtual machines for simulation and verification of the application on a Linux cluster and for running the application on different target architectures (DSPs, FPGAs) as well. The main focus is the description of the virtual machine, which was designed to make DSP applications portable across different platforms while maintaining optimal code.

Optimization and Synthesis for Complex Reactive Embedded Systems by Incremental Collapsing [p. 115]
Massimiliano L. Chiodo

We propose a software synthesis procedure for reactive real-time embedded systems. In our approach, control parts of the system are represented in a decomposed form enabling more complex control structures to be represented. We propose a synthesis procedure for this representation that incrementally aggregates elements of the representation while keeping the resulting code size under tight control. This method combined with heuristic strategies works very well on real-life designs and demonstrates the potential to produce results that challenge or beat hand-written implementations.
Keywords
Software Synthesis, Finite-state Machines, Embedded Systems, Real-time Systems


Session 5: System Partitioning and Timing Analysis

Transformation of SDL Specifications for System-Level Timing Analysis [p. 121]
Marek Jersak, Kai Richter, Rafik Henia, Rolf Ernst, and Frank Slomka

Complex embedded systems are typically specified using multiple domain-specific languages. After code-generation, the implementation is simulated and tested. Validation of non-functional properties, in particular timing, remains a problem because full test coverage cannot be achieved for realistic designs. The alternative, formal timing analysis, requires a system representation based on key application and architecture properties. These properties must first be extracted from a system specification to enable analysis. In this paper we present a suitable transformation of SDL specifications for system-level timing analysis. We show ways to vary modeling accuracy in order to apply available formal techniques. A practical approach utilizing a recently developed system model is presented.

A Strongly Polynomial-Time Algorithm for Over-Constraint Resolution [p. 127]
Ali Dasdan

A system of binary linear constraints or difference constraints (SDC) contains a set of variables that are constrained by a set of unary or binary linear inequalities. In such diverse applications as scheduling, interface timing verification, real-time systems, multimedia systems, layout compaction, and constraint satisfaction, SDCs have successfully been used to model systems of both temporal and spatial constraints. Formally, SDCs are modeled by weighted, directed (constraint) graphs. The consistency of an SDC means that there is at least one instantiation of its variables that satisfies all its constraints. It is well known that the absence of positive cycles in a graph implies the consistency of the corresponding SDC, so the consistency can be decided in strongly polynomial time. If a SDC is found to be inconsistent, it has to be repaired to make it consistent. This task is equivalent to removing positive cycles from the corresponding graph. All the previous algorithms for this task take time proportional to the number of positive cycles in the graph, which can grow exponentially. In this paper, we propose a strongly polynomial-time algorithm, i.e., an algorithm whose time complexity is polynomial in the size of the graph. Our algorithm takes in a graph and returns a list of edges and the changes in their weights to remove all the positive cycles from the graph. We experimentally quantify the length of the edge list and the running time of the algorithm on large benchmark graphs. We show that both are very small, so our algorithm is practical.
Keywords
Behavioral synthesis, high-level synthesis, scheduling, timing constraints, rate analysis, constraint satisfaction.

Hardware-Software Cosynthesis of Multi-Mode Multi-Task Embedded Systems with Real-Time Constraints [p. 133]
Hyunok Oh and Soonhoi Ha

An embedded system is called multi-mode when it supports multiple applications by dynamically reconfiguring the system functionality. This paper proposes a hardware-software cosynthesis technique for multi-mode multi-task embedded systems with real-time constraints. The cosynthesis problem involves three subproblems: selection of appropriate processing elements, mapping and scheduling of function modules to the selected processing elements, and schedule analysis. The proposed cosynthesis framework defines an iteration loop of three steps that solve the subproblems separately. One of the key benefits of such a modular approach is extensibility and adaptability. Moreover, unlike the previous approaches, the proposed technique considers task sharing between modes and hardware sharing between tasks at the same time. We demonstrate the usefulness of the proposed technique with a realistic multimode embedded system that supports three modes of operation with 5 different tasks.
Keywords
Hardware-software cosynthesis, multi-mode, multi-task

Design of Multi-Tasking Coprocessor Control for Eclipse [p. 139]
Martijn J. Rutten, Jos T.J. van Eijndhoven, and Evert-Jan D. Pol

Eclipse defines a heterogeneous multiprocessor architecture template for data-dependent stream processing. Intended as a scalable and flexible subsystem of forthcoming media-processing systems-on-a-chip, Eclipse combines application configuration flexibility with the efficiency of function-specific hardware, or coprocessors. To facilitate reuse, Eclipse separates coprocessor functionality from generic support that addresses multi-tasking, inter-task synchronization , and data transport. Five interface primitives accomplish this separation. The interface facilitates the design of coprocessors that require complex control to handle data-dependent I/O, saving/restoring task state upon task switches, and pipelined processing. This paper presents how this interface enables the design of such reusable yet cost-effective coprocessors.

Hardware-Software Bipartitioning for Dynamically Reconfigurable Systems [p. 145]
Daler N. Rakhmatov and Sarma B.K. Vrudhula

The main unique feature of dynamically reconfigurable systems is the ability to time-share the same reconfigurable hardware resources. However, the energy-delay cost associated with reconfiguration must be accounted for during hardware-software partitioning. We propose a method for mapping nodes of an application control flow graph either to software or reconfigurable hardware, explicitly targeting minimization of the energy-delay cost due to both computation and configuration. The addressed problems are energy-delay product minimization, delay-constrained energy minimization, and energy-constrained delay minimization. We show how these problems can be tackled by using network flow techniques, after transforming the original control flow graph into an equivalent network. If there are no constraints, as in the case of the energy-delay product minimization, we are able to generate an optimal solution in polynomial time.
Keywords: hardware-software partitioning, reconfigurable systems, network flows.

HW/SW Partitioning and Code Generation of Embedded Control Applications on a Reconfigurable Architecture Platform [p. 151]
Massimo Baleani, Frank Gennari, Yunjian Jiang, Yatish Patel, Robert K. Brayton, and Alberto Sangiovanni-Vincentelli

This paper studies the use of a reconfigurable architecture platform for embedded control applications aimed at improving real time performance. The hw/sw codesign methodology from POLIS is used. It starts from high-level specifications, optimizes an intermediate model of computation (Extended Finite State Machines) and derives both hardware and software, based on performance constraints. We study a particular architecture platform, which consists of a general purpose processor core, augmented with a reconfigurable function unit and data-path to improve run time performance. A new mapping flow and algorithms to partition hardware and software are proposed to generate implementations that best utilize this architecture. Encouraging preliminary results are shown for automotive electronic control examples.
Keywords
CSoC, hw/sw co-design, code generation


Session 6: Energy Efficiency in System Design

Fast System-Level Power Profiling for Battery-Efficient System Design [p. 157]
Kanishka Lahiri, Anand Raghunathan, and Sujit Dey

An increasing disparity between the energy requirements of portable electronic devices and available battery capacities is driving the development of new design methodologies for battery-efficient systems. A crucial requirement for battery efficient system design is to be able to efficiently and accurately estimate battery life for candidate system architectures. Recently, efficient techniques have been developed to estimate battery life under given profiles of system power consumption over time. However, techniques for generating the power profiles themselves are either too cumbersome for system level exploration, or too inaccurate for battery life estimation. In this paper, we present a new methodology for efficiently and accurately generating power profiles for different system-level architectures. The designer can specify the manner in which (i) system tasks are mapped to a set of available implementations, and (ii) system communications are mapped to a specified communication architecture. For a given architecture, a power profile is automatically generated by analyzing an abstract representation of the system execution traces, while taking into account the selected implementations of the system‰s computations and communications. Experiments conducted on the design of an IEEE 802.11 MAC processor indicate that the power profiling approach offers run times that are several orders of magnitude lower than a simulation based power profiling technique, while sustaining negligible loss of accuracy (average profiling error was observed to be less than 3.4%).

Energy Savings Through Compression in Embedded Java Environments [p. 163]
G. Chen, M. Kandemir, N. Vijaykrishnan, M.J. Irwin, and W. Wolf

Limited energy and memory resources are important constraints in the design of an embedded system. Compression is an useful and widely employed mechanism to reduce the memory requirements of the system. As the leakage energy of a memory system increases with its size and because of the increasing contribution of leakage to overall system energy, compression also has a significant effect on reducing energy consumption. However, storing compressed data / instructions has a performance and energy overhead associated with decompression at runtime. The underlying compression algorithm, the corresponding implementation of the decompression and the ability to reuse decompressed information critically impact this overhead. In this paper, we explore the influence of compression on overall memory energy using a commercial embedded Java virtual machine (JVM) and a customized compression algorithm. Our results show that compression is effective in reducing energy even when considering the runtime decompression overheads for most applications.
Keywords
Compression, Leakage Energy, Embedded Java

Communication Speed Selection for Embedded Systems with Networked Voltage-Scalable Processors [p. 169]
Jinfeng Liu, Pai H. Chou, and Nader Bagherzadeh

High-speed serial network interfaces are gaining wide use in connecting multiple processors and peripherals in modern embedded systems, thanks to their size advantage and power efficiency. Many such interfaces also support multiple data rates, and this ability is opening a new dimension in the power/performance trade-offs between communication and computation on voltage scalable embedded processors. To minimize energy consumption in these networked architectures, designers must not only perform functional partitioning but also carefully balance the speeds between communication and computation, which compete for time and energy. Minimizing communication power without considering computation may actually lead to higher energy consumption at the system level due to elongated on-time as well as lost opportunities for dynamic voltage scaling on the processors. We propose a speed selection methodology for globally optimizing the energy consumption in embedded networked architectures. We formulate a multidimensional optimization problem by modeling communication dependencies between processors and their timing budgets. This enables engineers to systematically solve the problem of optimal speed selection for global energy reduction. We demonstrate the effectiveness of our speed selection approach with an image processing application mapped onto a multi-processor architecture with a multi-speed Ethernet.

Pruning-Based Energy-Optimal Device Scheduling for Hard Real-Time Systems [p. 175]
Vishnu Swaminathan and Krishnendu Chakrabarty

Dynamic Power Management (DPM) provides a simple, elegant and flexible method for reducing energy consumption in embedded real-time systems. However, I/O-centric DPM techniques have been studied largely for non-real-time environments. We present an offline device scheduling technique for real-time systems that generates an energy-optimal device schedule for a given task set while guaranteeing that all real-time deadlines are met. Our method takes as inputs a task set and a device-usage list for each task, and it schedules the tasks such that the energy consumed by the set of I/O devices is minimized. We compare our algorithm to an exhaustive enumeration method and show that the proposed algorithm is very efficient in terms of memory usage and computation time. We also present case studies to show that I/O-centric DPM methods can result in significant energy savings.

Energy Frugal Tags in Reprogrammable I-Caches for Application-Specific Embedded Processors [p. 181]
Peter Petrov and Alex Orailoglu

In this paper we present a software-directed customization methodology for minimizing the energy dissipation in the instruction cache, one of the most power consuming microarchitectural components of high-end embedded processors. We target particularly the instruction cache tag operations and show how an exceedingly small number of tag bits, if any, are needed to compute the miss/hit behavior for the most frequently executed application loops, thus minimizing the energy needed to perform the tag reads and comparisons. The proposed methodology exploits the fact that the code layout structure of the program loops can be identified after compile and link, and that it typically resides in a very confined memory location, for which very few bits from the effective address can be utilized as a tag. Subsequently, we present an efficient, programmable implementation to apply the suggested energy minimization technique. The experimental results show a significant decrease in energy dissipation for a set of real-life applications.


Session 7: System Design Methods: Scheduling Advances

Holistic Scheduling and Analysis of Mixed Time/Event-Triggered Distributed Embedded Systems [p. 187]
Traian Pop, Petru Eles, and Zebo Peng

This paper deals with specific issues related to the design of distributed embedded systems implemented with mixed, event-triggered and time-triggered task sets, which communicate over bus protocols consisting of both static and dynamic phases. Such systems are emerging as the new standard for automotive applications. We have developed a holistic timing analysis and scheduling approach for this category of systems. We have also identified several new design problems characteristic to such hybrid systems. An example related to bus access optimization in the context of a mixed static/dynamic bus protocol is presented. Experimental results prove the efficiency of such an optimization approach.

Locality-Conscious Process Scheduling in Embedded Systems [p. 193]
I. Kadayif, M. Kandemir, I. Kolcu and G. Chen

In many embedded systems, existence of a data cache might influence the effectiveness of process scheduling policy significantly. Consequently, a scheduling policy that takes inter-process data reuse into account might result in large performance benefits. In this paper, we focus on array-intensive embedded applications and present a locality-conscious scheduling strategy where we first evaluate the potential data reuse between processes, and then, using the results of this evaluation, select an order for process executions. We also show how process codes can be transformed by an optimizing compiler for increasing inter-process data reuse, thereby making locality-conscious scheduling more effective. Our experimental results obtained using two large, multi-process application codes indicate significant runtime benefits.

Reconfigurable SoC Design with Hierarchical FSM and Synchronous Dataflow Model [p. 199]
Sunghyun Lee, Sungjoo Yoo, and Kiyoung Choi

We present a method of runtime configuration scheduling in reconfigurable SoC design. As a model of computation in system representation, we use a popular formal model of computation, hierarchical FSM (HFSM) with synchronous dataflow (SDF) model, in short, HFSM-SDF model. In reconfigurable SoC design with the HFSM-SDF model, the problem of configuration scheduling is challenging due to the dynamic behavior of the system such as concurrent execution of state transitions (by AND relation), complex control flow (in the HFSM), and complex schedules of SDF actor firing. Thus, compile-time static configuration scheduling may not efficiently hide configuration latency. To resolve the problem, it is necessary to know the exact order of required configurations during runtime and to perform runtime configuration scheduling. To obtain the exact order of configurations, we exploit the inherent property of HFSM-SDF that the execution order of SDF actors can be determined before the execution of state transition of top FSM. After obtaining the order information in a queue called ready configuration queue, we execute the state transition. During the execution, whenever there is new available FPGA resource, a new configuration is selected from the queue and fetched by the runtime configuration scheduler. We applied the method to an MPEG4 decoder design and obtained up to 21.8% improvement in system runtime with a negligible overhead of runtime (1.4%) and memory usage (0.94%).

Dynamic Run-Time HW/SW Scheduling Techniques for Reconfigurable Architectures [p. 205]
Juanjo Noguera and Rosa M. Badia

Dynamic run-time scheduling in System-on-Chip platforms has become recently an active area of research because of the performance and power requirements of new applications. Moreover, dynamically reconfigurable logic (DRL) architectures are an exciting alternative for embedded systems design. However, all previous approaches to DRL multi-context scheduling and HW/SW scheduling for DRL architectures are based on static scheduling techniques. In this paper, we address this problem and present: (1) a dynamic scheduler hardware architecture, and (2) four dynamic run-time scheduling algorithms for DRL-based multi-context platforms. The scheduling algorithms have been integrated in our codesign environment, where a large number of experiments have been carried out. Results demonstrate the benefits of our approach.
Keywords
Dynamic run-time scheduling, reconfigurable architectures.

Extended Quasi-Static Scheduling for Formal Synthesis and Code Generation of Embedded Software [p. 211]
Feng-Shi Su and Pao-Ann Hsiung

With the computerization of most daily-life amenities such as home appliances, the software in a real-time embedded system now accounts for as much as 70% of a system design. On one hand, this increase in software has made embedded systems more accessible and easy to use, while on the other hand, it has also necessitated further research on how complex embedded software can be designed automatically and correctly. Enhancing recent advances in this research, we propose an Extended Quasi-Static Scheduling (EQSS) method for formally synthesizing and automatically generating code for embedded software, using the Complex-Choice Petri Nets (CCPN) model. Our method improves on previous work in three ways: (1) by removing model restrictions to cover a much wider range of applications, (2) by proposing an extended algorithm to schedule the more unrestricted model, and (3) by implementing a code generator that can produce multi-threaded embedded software programs. The requirements of an embedded software are specified by a set of CCPN, which is scheduled using EQSS such that the schedules satisfy limited embedded memory requirements and task precedence constraints. Finally, a POSIXbased multi-threaded embedded software program is generated in the C programming language. Through an example, we illustrate the feasibility and advantages of the proposed EQSS method.