# MAPLE chip: a processing element for a static scheduling centric multiprocessor

Kenta Yasufuku †

Riku Ogawa †

†Keio University

Dept. of Information and Computer Science 3-14-1, Hiyoshi Yokohama, 223-8522 Japan

Tel: +81-45-560-1063 Fax: +81-45-560-1064

e-mail: {yasufuku, riku, hunga}@am.ics.keio.ac.jp

Keisuke Iwai †† Hideharu Amano †

††National Defense Academy Dept. of computer science 1-10-20, Hashirimizu Yokosuka, 239-8686 Japan

> Tel: +81-468-41-3810 Fax: +81-468-44-5911 e-mail: iwai@nda.ac.jp

Abstract—A custom processor called MAPLE which supports static scheduling by automatic parallelizing compilers is implemented and evaluated. MAPLE has a high performance floating point arithmetic unit and low latency data transfer mechanism for other MAPLE chips. The maximum operational frequency is 80MHz in simulation, and the operation on the prototype board with 23MHz clock is confirmed. It requires about 0.56W at 23MHz operation.

## I. INTRODUCTION

Although it is easy to enhance the peak performance of the multiprocessor only by adding a number of processing units, it is difficult to exploit effective performance for users without support of automatic parallelizing compilers. However, such compilers have been tailored for existing multiprocessors which are designed without care of them well. The multiprocessor system ASCA(Advanced Scheduling oriented Computer Architecture) has been proposed based on the idea that not the parallelizing software is tailored for machines, but a multiprocessor system should be designed to make the best use of parallelizing software[1].

In ASCA, a multi-grain parallelizing compilation scheme is adopted[2]. The scheme can exploit parallelism of the user program in various levels of granularity: coarse-grain parallelism (macro-data flow computation), medium-grain parallelism (loop-level parallelism) which is used in most of the current compilers, and near-fine-grain parallelism(statement level parallelism). Since the near-fine-grain parallelizing compilation especially requires a precise static scheduling between operations, uncertain behavior of the processor must be completely excluded. To solve the problem, we have proposed the custom processor MAPLE(Multiprocessor system ASCA Processing eLEment)[1].

# II. CHIP FEATURES

MAPLE is a 32-bit RISC processor which provides a simple structure with highly predictable operations. Its instruction set is an extension of that of DLX[3]. Five-stage pipeline structure of MAPLE is designed to execute every operation in a fixed number of clocks. 32-bit/64-bit IEEE std 754-1985 floating-point units are provided[4].

There are 32 integer registers and 32 floating point registers.

Furthermore 16 32-bit special registers called receive registers for the low latency data transfer between other MAPLE chips are provided. The receive register is directly connected to instruction decode stage of MAPLE pipeline, and when the source processor executes a transfer operation between registers, data is directly sent out from the memory access stage of the pipeline. The transferred data is also directly received by a receive register of destination processor. Fig. 1 illustrates the data transfer between other MAPLE chip by using receive registers.



Fig. 1. Data Transfer with Receive Register(RR)

## III. IMPLEMENTATION

The MAPLE chip in this study has been fabricated in the chip fabrication program of VLSI Design and Education Center(VDEC), the University of Tokyo with the collaboration by Rohm Corporation and Toppan Printing Corporation.

The chip specification and required gates are shown in Table I and Table II. Fig. 2 shows its layout and the packaged die. The static delay path analysis results show that the operational frequency of this MAPLE chip is 80MHz. However, we could not confirm it with a chip tester, because the size of chip is too big to use the tester equipped by VDEC.

TABLE I
THE SPECIFICATION OF MAPLE

| Technology                     | Die size              | Package    |
|--------------------------------|-----------------------|------------|
| Rohm CMOS $0.35\mu$ m Std.Cell | 14.2mm × 14.2mm       | PGA 572pin |
| Poly 2, Metal 3, Vdd=3.3V      | 14.211111 × 14.211111 |            |

TABLE II Number of Gates

| Module           | Number of Gates |
|------------------|-----------------|
| Integer Unit     | 21,486          |
| Floating Unit    | 145,745         |
| Receive Register | 5,254           |
| Others           | 1,525           |
| Total            | 174,010         |





Fig. 2. The MAPLE Cell Layout and packaged Die

#### IV. EVALUATION

# A. Prototype PE board

A prototype PE (Processing Element) board with one MAPLE chip is developed as a target of near-fine grain scheduling(Fig. 3). It provides software cache control system, 512k-byte main memory, 32kbyte instruction RAM, 32kbyte flash ROM, and a serial interface. When the system is starting up, a monitor program in the flash ROM runs. Under the management of the monitor program, the user program code is loaded from the host computer through a serial interface, and executed.



Fig. 3. Prototype PE board

This board is operational at 23MHz clock which is much less than 80MHz which is the target frequency of the MAPLE chip. The main reason of the frequency degradation is that the I/O pins assignment error was found after the board fabrication, and the MAPLE chip is mounted on a large daughter board for replacing the pin connections, which introduces various electronic problems.

# B. Performance

The performance of the MAPLE chip with 23MHz clock is evaluated by  $\pi$ -series-calculation which includes 30,000 itera-

tions, and shown in Table III. For the comparison, the execution result with UltraSPARC-II, which is a chip with similar level of technology, is also shown in the table [6].

| CPU                        | Clock rate [MHz] | Power [W] | Time [ms] |
|----------------------------|------------------|-----------|-----------|
| UltraSPARC-II <sup>a</sup> | 300              | 31.20     | 48.6      |
| MAPLE 1PE <sup>b</sup>     | 23               | 0.56      | 750.0     |

<sup>&</sup>lt;sup>a</sup>OS:SunOS 5.8, Compiler:gcc 2.95.3, Compile Option:-O3

The MAPLE chip is designed as a PE for multiprocessor. We simulated 4 PEs as 1 cluster, and found out that  $\pi$ -seriescalculation performance of the cluster with using receive registers is about 2.25 times higher than that of single PE[5]. If the MAPLE chip works 80MHz clock as designed, and runs with 4 PEs, the execution time becomes 95.8ms, 7.83 times better than the result in the table. Since a single MAPLE chip with 23MHz clock requires 0.56W power consumption. However, since power reduction techniques are not used in this design, this value can be much reduced.

#### V. CONCLUSIONS

The MAPLE chip is an element processor for a static scheduling centric multiprocessor multiprocessor ASCA. Although the performance is lower than UltraSPARC-II, the number of gates and power consumption of the MAPLE chip is so small that the cost/performance and power/performance of the MAPLE cluster has possibility to contend with the recent supersclar processors.

#### **ACKNOWLEDGEMENTS**

This study is supported by STARC (Semiconductor Technology Academic Research Center, Japan) as the program "A multiprocessor system for the perfect static scheduling."

## REFERENCES

- T.Fujiwara, K.Sakamoto, T.Kawaguchi, K.Iwai, H.Amano, "A Custom Processor for the Multiprocessor System ASCA," *Applied Informatics* '98, 1998, pp258-261.
- [2] H.Kasahara, H.Honda, A.Mogi, A.Ogura, K.Fujiwara, S.Narita, "A Multi-Grain Parallelizing Compilation Scheme for OSCAR," 4th Workshop on Languages and Compilers for Parallel Computing, 1991.
- [3] J.L.Hennessy and D.A. Patterson, COMPUTER ARCHITECTURE A QUANTITATIVE APPROACH SECOND EDITION, Morgan Kaufmann Publishers, 1996.
- [4] T.Kawaguchi, T.Fujiwara, K.Sakamoto, K.Iwai, H.Amano, "Floating Point Arithmetic Unit for the Custom Processor MAPLE," *Applied Infor*matics '99, 1999, pp578-580.
- [5] T.Abe, T.Morimura, T.Suzuki, K.Tanaka, M.Koibuchi, K.Iwai, H.Amano, "ASCA chip set: Key components of multiprocessor architecture for multi-grain parallel processing," *Proc. of COOL Chips IV*, 2001, pp223-247.
- [6] G.Goldman, P.Tirumalai, "UltraSPARC-II: The Advancement of Ultra-Computing," CompCon 96, 1996.

<sup>&</sup>lt;sup>b</sup>Compiler:gcc 2.7.2.3, Compile Option:-O3