# Area-Efficient and Reusable VLSI Architecture of Decision Feedback Equalizer for QAM Modem

Hyeongseok Yu\*, Byung Wook Kim\*, Yeon Gon Cho\*, Jun Dong Cho\*, Jea Woo Kim\*\*, Jae Kon Lee\*\*, Hyun Cheol Park\*\*, and Ki Won Lee\*\*

\*Department of Electrical and Computer Engineering SungKyunKwan University \*\*Corporate R&D Center of Samsung electronics co., LTD

**Abstract—In this paper, an area efficient VLSI architecture of decision feedback equalizer is derived accommodating 64/256 QAM modulators. This architecture is implemented efficiently in reusable VLSI structure using EDA tool due to its regular structure. The main idea is to employ a timemultiplexed design scheme grouping the adjacent filter taps with correlated internal dataflow and with data transfer having same processing sequence between blocks. We simulated the proposed design scheme using SYNOPSYS™ and SPW™.** 

**Index Terms—Decision feedback equalizer, QAM, reusable VLSI implementation, FIR filter** 

# **I. Introduction**

The demands for high rate data communication service grow in proportion to the increase of internet user and multimedia application. For providing these services, the development of stable, reliable and high performance modem that has efficient structure is required. In addition, CATV become the solution for realization of bidirectional multimedia as high-speed infra-network and plays an important role through joining with broadcasting and communication. A QAM modulation is now the standard modulation technique for CATV application and proper for high-rate data communication. The European Digital Audio-Visual Council (DAVIC), Digital Video Broadcasting (DVB) standards and Multi- media Cable Network System (MCNS) specifications for Data-Over-Cable Interface have selected quadrature amplitude modulation (QAM) as the modulation format for downstream delivery of video and data services through coaxial cable networks. Moreover, this modulation format is suitable for high-rate wireless communication environments.

 In high-rate communication, significant performance degradation is inter-symbol interference (ISI) introduced by the channel. This ISI can be greatly reduced if the receiver includes a channel equalizer. In the presence of inter-symbol interference by multi-path channel, many equalizers have been used to enhance the data-rate and reduce the ISI by channel compensation. While linear equalizers tend to have large tapped-delay-line (TDL) size in the case of employing the FIR filter structure, nonlinear equalizer class has been proven very useful. One such equalizer is decision feedback equalizer (DFE), the combination of feedforward and feedback equalizer, which has the small TDL size and convergence time. Moreover, decision feedback equalizer can offer optimal

l

steady state operation in hostile environment. Thus, DFE are finding increasing use in many areas of high-rate data communication, including wireless communication where usually QAM modulation format has been employed.

However, equalizer is the computational intensive functional block and consumes the large portion of area and power in demodulator. Most of the channelequalizing algorithms for hardware implementation are based on adaptive finite impulse response (FIR) filter [1]- [3][6][7]. The adaptive FIR filter presented in [6] is pipelined for hardware efficiency and throughput is independent on filter length. However, DFE have a feedback loop. Feedback loop makes a pipelining difficult or impossible and increases delay and hardware complexity if pipelined. While Canonic Signed Digit (CSD) representations in [5] can be employed for reducing the fixed numerical operation such as multiplying, they are not suitable for coefficient adaptation. The design in [3] utilizes the bit-plane folding architectures, but the area reduction is limited by a layout-level. In [1] and [2], equalizers are implemented in time multiplexed structure. Though this technique reduces the area complexity efficiently in VLSI implementation, the structures in [1][2] suffer from multiple clock, complicated controlling and limited timemultiplexing level. These are less than ideal for VLSI implementation when matched against accepted criteria for design elegance and efficient implementation.

For providing the optimized structures, in this paper, we will build the bit-optimized equalizer having reduced operation and regular structure by time multiplexing.

The building block of proposed architecture can be reused efficiently in VLSI implementation due to its regularity using new time multiplexed design scheme and equalizer can be implemented by cascading the building block. The time multiplexing in this paper is performed by grouping the adjacent filter taps with correlated internal dataflow and with data transfer having same processing sequence between blocks. The approach is implemented efficiently in hardware.

In section II, we illustrated our design specification of decision feedback equalizer. Section III presents the algorithmic optimization of our DFE structure. Next, we reviewed previous time-multiplexed structure and we shall propose a new design scheme of time-multiplexed structure and its hardware implementation in section IV. Finally, Section V draws our experimental results and conclusion.

# **II. Decision Feedback Equalizer**

Decision feedback equalizer consists of a feedforward section and a feedback section connected as shown in Fig 1, including decision device, coefficient adaptation block and error function block [8].

#### *A. Basic Structure*

Fig. 1 shows the block diagram of decision feedback equalizer. While Feedforward adaptive FIR section apply the received signal to equalize, feedback adaptive FIR section receives the decision value of the previously detected equalizer output. Thus, DFE could induce the error propagation. However, this phenomenon is less significant than performance enhancement by subtracting out the portion of ISI produced by previously detected symbols form the estimates of future symbols [8].



**Fig. 1 Decision Feedback Equalizer Block Diagram**

#### *B. Selection of FIR Filter Structure*

The speed of FIR filter is defined as the input symbol rate and limited by critical path delay. There are two classes of FIR filters structure such as transverse and transposed form as shown in Fig. 2. In Fig. 2 (a), the accumulator of filter is implemented as an adder tree having shorter delay than adder chain. If  $T_m$  is the multiplier delay and  $T_a$  is the adder delay and  $N$  is filter length, the critical path delay of this structure is defined as Eq. (1) [9].

$$
T_{transverse} = T_m + (\log_{1.5} N) \times T_a \tag{1}
$$

As repositioning the TDL register of the transverse structure, we can get the transposed structure as shown in Fig. 2 (b). This structure can reduce the critical path delay further as defined Eq. (2).

$$
T_{transposed} = T_m + T_a \tag{2}
$$

Though the delay line register size of transposed form filter has to be larger than that of transverse filter, transposed filter has a constant critical path delay independent of the increase of filter length and good timing stability. This property is important. In timemultiplexed design, since the multiplier of equalizer must operate at higher rate than received symbol rate, a path delay of multiplier is more critical. Pipelining the accumulator of transverse form can reduce critical path delay, but increases the use of register, that is, the increase of area complexity when implemented in VLSI. Moreover, the adder tree has the drawback of irregular

structure. Thus, we will primarily interest in the transposed form FIR filter in this paper.



*C. Simplification of Coefficient Adaptation Block.* 

Equalizer makes it possible for the filter to perform satisfactorily in an environment where complete knowledge if the relevant signal characteristic is not available [8]. This operation is performed by adaptive coefficient-control mechanism such as RLS (Recursive Least Square) and LMS (Lease Mean Square). In this paper, equalizer coefficient adaptation method is based on LMS algorithm, which is an important member of the family of stochastic gradient algorithm [8].

$$
c(k, n+1) = c(k, n) + \mu \cdot e(n) \cdot x^*(n-k) \tag{3}
$$

LMS coefficient adaptation is specified as Eq. (3) where  $\mu$  is adaptation step size,  $c(k, n)$  is  $k-th$  tap coefficient and  $e(n)$  is the estimated error and  $x*(n-k)$  is the conjugate of k-times delayed input signal in n-*th* processing sequence , respectively. Since LMS coefficient adaptation block needs two complex multipliers and one complex adder, this block itself occupies the large portion of equalizer. Thus, in this paper, we employed the signed LMS algorithm in Eq. (4) for the simplification.

$$
c(k, n+1) = c(k, n) + \mu \cdot e(n) \cdot \text{sgn}[x^{*}(n-k)] \qquad (4)
$$

 Eq. 4 shows the signed LMS and differs slightly from LMS algorithm in Eq. (3) in that the input signal is multiplied in forms of sign  $(+1, -1)$ . In Eq. 4, one complex multiplier can be replaced by just exclusive-OR. Furthermore, by restriction the step size  $\mu$  to be a power-of-two, the hardware implementation of coefficient adaptation block is greatly simplified.

## *D. Error Function Block.*

Common equalizers have two modes of operation: a training mode and a decision-directed mode. However, in some situations it may be costly to send a training sequence and the training sequence may be unavailable at the receiver. In these cases, blind training is needed, in which we make use of some known statistics of the transmitted data, but not the exact data values, to adapt the equalizer coefficients. Sato's and Godard's techniques are perhaps the most referenced blind training (or blind equalization) methods [5].

$$
e(n) = y(n) \Big[ R^2 - |y(n)|^2 \Big], \quad R^2 = \frac{E[|a(n)|^4]}{E[|a(n)|^2]} \tag{5}
$$

$$
e(n) = Decision[y(n)] - y(n)
$$
 (6)

 Eq. (5) and (6) shows Constant Modulus Algorithm (CMA) for blind equalization and Decision Directed Algorithm (DDA), respectively. Equalizer initially operates in blind mode and then is switched to decision directed mode. In the blind mode, the constant modulus algorithm (CMA) is used to update the tap coefficients, and in the decision directed mode algorithm is used.

### *E. Channel Environment.*

The Decision Feedback Equalizer accommodates 64 and 256-QAM mode for cable channel. The time span of the ISI depends on the channel characteristics and signaling scheme. For successful equalizer operation, the FIR filter needs to cover the ISI time span [7]. And, filter length must be determined to achieve a required performance. The equalizer is designed to equalize the multipath channel given in the Standford Telecom based on MCNS specifications. The channel environment is summarized in Table 1. We choose the optimum Tap length of FFE, DFE and step size  $\mu$  of CMA, decision directed algorithm (DDA) for this channel.

**Table 1. Summarization of Channel Environment** 

| Impairment              | 64 OAM             |             | 256 OAM            |             |
|-------------------------|--------------------|-------------|--------------------|-------------|
| Symbol Rate             | 5.056941 MHz       |             | 5.360537 MHz       |             |
| Roll-off Factor         | 0.18               |             | 0.12               |             |
| <b>Frequency Offset</b> | $\pm 0.07/T$       |             | $\pm 0.05/T$       |             |
| <b>Input Noise SNR</b>  | $23.5 \text{ dB}$  |             | $30.0 \text{ dB}$  |             |
| Multipath               | $-5dB$ @           | $500$ nsec  | $-5dB$ @           | $500$ nsec  |
|                         | $-10$ dB $@$       | $1000$ nsec | $-10dB$ @          | $1000$ nsec |
|                         | $-15dB$ @          | $1500$ nsec | $-15dB$ @          | $1500$ nsec |
|                         | $-25dB@>1500$ nsec |             | $-25dB@>1500$ nsec |             |

## **III. Requirements of Optimization of DFE Structure.**

The equalizer consists of 16-tap fractionally spaced (*T/2*) feedforward equalizer (FFE), an 16-tap feedback equalizer (FBE), and a error monitor.

 Fig. 3 shows the 16-taps transposed form adaptive filter, in which both FFE and FBE are implemented.

#### *A. Fractionally Spaced Equalizer.*

Considerably better performance is possible on severely delay-distorted channels by combining both the mated filtering and equalizer [2]. A fractionally spaced equalizer has the taps at spacing  $T^m$ , which are smaller than *T* if equalizer is *M*-spaced when operating with a sample period *T*  $m = T/M$ . The samples for symbol  $I_k$  are  $x_{Mk-m}$  when  $m=0$  ... *M-1*. Consequently, a filter with a span of *N* symbols and a spacing of *M* has *NM* taps.

$$
y_k^m = \sum_{n=0}^{NM-1} w(k, n)x(kM - m - n) \quad \text{for } m = 0 \cdots M - 1 \text{ (7)}
$$
\n
$$
y_k^m = y_k^s \text{ where } s \text{ is one of } \{0 \cdots M - 1\} \tag{8}
$$

Filter output become Eq. (7) and one of m filter outputs is arbitrarily selected for each symbol. Thus, TDL register of feedforward filter in Fig. 3 operated at the 2 times symbol rate. Since, this higher operation rate in *T/2* spaced equalizer limits the time-multiplexing factor by 2, we implemented FFE in the form of two 8-tap parallel filters operating at symbol rate.



#### *B. Delayed LMS Algorithm.*

As substituting Eq. (5) and (6) in Eq. (4), we get the critical path delay of adaptation process as Eq. (9).

$$
T_{adaptation} = 2T_m + T_{sub} + 2T_a + T_{XOR}
$$
\n(9)

Where  $T_m$  is the multiplier delay and  $T_a$  is adder delay and  $T_{sub}$  is subtraction delay and  $T_{XOR}$  is exclusive-OR gate delay. This path delay is very critical because of higher operation rate of time-multiplexed structure. Thus, we delay the error signal and the data by one operation clock cycle for reducing the adaptation path delay. Consequently, the current coefficient update is performed using error from the previous symbol period in Eq. (10).



**Fig. 4 Performance Estimation of Bit Precision**



**Fig. 7 4-time Multiplexed 16 Tap Equalizer Architecture using Proposed Multiplexing** 

 $c(k, n+1) = c(k, n) + \mu \cdot e(n-1) \cdot \text{sgn}[x^{(n-k-1)}]$  (10)

 This modification is implemented in dashed line of Fig. 3 by adding the delay elements and does not affect convergence of the algorithm. Moreover, the delay is necessary in order for the FFE to be capable of performing the LMS coefficient updating in parallel with the FIR filter computation.

## *C. Optimization of Bit Precision.*

Fig**.** 4 shows plots of MSE performance at the slicer versus word length size for signals within the DFE. The experimental results suggest that 12 bits are sufficient for the coefficients used the filtering operation, 21 bits for the coefficient updating accumulator, 16 bits for the multiplyaccumulator section of the filter, and 12 bits for the CMA error or DDA error section of the error generation. The final MSE obtained for the simulation was -10.3 dB obtained with infinite precision signals, as well as being close to -10.45 dB in floating point simulation.

# **IV. Efficient VLSI Implementation of DFE**

#### *A. Previous Time Multiplexing Architecture.*



**Fig. 5 Time-Multiplexed Architecture Sharing Multiplier by factor of 4 in [1]**

 The use of time multiplexing techniques in [1][2] is very attractive, since adaptive filter can be implemented and extended by just cascading the time-multiplexed macro blocks. They also have regular structure and the macro block can be reused in the automated design tool due to its regular structure.

Fig. 5 shows the 4 times multiplexed architecture in [1]. The time multiplexed design in [1] share the complex multiplier/adder and the filter length can be extended by just single tap, while this scheme uses the multiple clocks for control unit and area reduction is limited by the factor 4.

Fig. 6 shows 4 times multiplexed architecture in [2]. If the number of filter tap is N and multiplexing factor is M and k-th tap delay register store  $y(k,n)$   $(k = 0, 1, \ldots N-1)$ and n is processing sequence), this scheme separates registers of N filter taps by the groups of M taps as follows ;

{
$$
y(0,n)
$$
,  $y(1,n)$ ......,  $y(M-1,n)$ }  
{ $y(M,n)$ ,  $y(M+1,n)$ ......,  $y(2M-1,n)$ }  
::  
{ $y(N-M,n)$ ,  $y(N-M+n)$ ...... $y(N-1,n)$ }





**Fig. 6 Time-Multiplexed Architecture Sharing Tap by factor of 4 in [2]**

In each block having M taps, single tap operate M times faster and transferred data between blocks is processed in different sequence. In [2], area reduction is more efficiently performed by sharing the group of filter taps and time-multiplexing factor can be increased above the factor 4. Moreover this block operates at single-phase clock different from [1]. The multiplexed structure in [2] shown in Fig. 6 has two more adders than that in [1] shown in Fig. 5. This drawback of Fig. 5 is overcome by the use of carry save adder (CSA) between complex multiplier and complex carry propagation adder. Unfortunately, the structure in [2] has the complicated control units (e.g. MUX) and these control units increase in proportion to filter length.

# *B.A New Time Multiplexing Architecture.*

 We represent a new time-multiplexed design scheme for sharing filter taps. This time-multiplexed scheme groups the adjacent filter taps with correlated internal dataflow and with data transfer having same processing sequence between blocks.

In this scheme, registers of the N taps filter are separated and grouped having M taps as follows;

{
$$
y(0,n)
$$
,  $y(M-1,n)$ ......,  $y(N-M,n)$ }  
{ $y(1,n)$ ,  $y(M,n)$ ......,  $y(N-M+1,n)$ }  
:  
{ $y(N/M-1,n)$ ,  $y(N/M+M-2,n)$ ...... $y(N-1,n)$ }

 Thus, the transferred data in one grouped block is processed in different processing sequence *n* and correlated. In addition, the transferred data between blocks is processed in same processing sequence. Fig. 7 shows the example of 4 times multiplexed 16 tap equalizer using proposed multiplexing scheme. The dashed line represents the group of taps in this scheme. The block having grouped taps has a regular structure. As shown Fig. 7, the proposed time multiplexed architecture has less control logic and use the single-phase clock. Moreover, the filter length can be easily adjusted by cascading the block.

## **V. Experimental Result and Conclusion.**



**(a) MSE Plot of Transpose Form Equalizer in dB scale** 



**(b) 256 QAM Signal Eye Pattern and Constellation** 

## **Fig. 8 SPW™ Simulation Result**

Fig. 8 illustrate (a) MSE convergence curve and (b) 256 QAM signal eye pattern constellations in convergence state at multipath distortion of -10dB at 0.6 , -20dB at 1.2 , -30dB at 1.6 , and -50dB at 1.2 . A representative plot of symbol error rate (SER) versus Eb/ No for 64 and 256 QAM is illustrated in Fig.

9. Implementation loss is measured to be 1.1 and 1.3 dB for 64 and 256 QAM, respectively, at a SER of  $10^{-6}$  and our result shows 1.5 dB improvement for 256 QAM, and almost same quality for 64 QAM.

 In this paper, we derived an area-efficient VLSI architecture of decision feedback equalizer accommodating 64 and 256 QAM demodulators and consisting of 16-tap fractionally spaced (T/2) FFE, an 16 tap FBE. This architecture was implemented efficiently in reusable VLSI structure due to its regularity using new time-multiplexed design scheme. In Table 2, we illustrated and summarized comparison results with other designs, where equalizer has 16 tap and signed LMS coefficient adaptation. In turn, the item of Reg. in Table 2 represents equalizer input data registers for LMS,



**Fig. 9. 64 and 256 QAM SER Performance Plot**

registers for LMS update and registers for tapped-delayline.

The equalizer operates at 5.056941 Mbaud symbol rate on 64 QAM mode and 5.360537 Mbaud symbol rate on 256 QAM mode. Thus, if we employ the 4-time multiplexing scheme in Fig. 7 and an asynchronous clocking, the equalizer has to operate at the maximum frequency of 35 MHz.

 This equalizer is synthesized by SYNOPSYS™ using SamSung Electronic Corporation STD-90 0.35 µm Standard cell ASIC library and requires about 85,000 gates.

**Table 2. Architecture Comparison Results** 



#### **REFERENCES**

[1] H. Samueli, "A 70-Mb/s Variable-Rate 1024-QAM Cable Receiver IC with Integrated 10-b ADC and FED Decoder", *IEEE Journal of solid State Circuits,* vol. 33, No. 12, Dec, 1998.

- [2] C. J. Nicole, et-al, "A Low-Power 128-Tap Digital Adaptive Equalizer for Broadband Modems", *IEEE Journal of solid State Circuits,* vol. 32, No. 11, Nov, 1997..
- [3] S. R. Meier, et-al, "Efficient and Reusable Time-Sharing Architectures for Equalizer Structure", in *Proceeding of IEEE 2000 Custom Integrated Circuits Conference*, pages 477-480, 2000.
- [4] H. Samueli, et-al "Design Techniques for Silicon Compiler Implementations of High-Speed FIR Digital Filters"*, IEEE JSSC*, vol 31, May. 1996
- [5] S. Kinjo, et-al "A Design of FIR Filter Using CSD with Minimum Number of Registers", *Proc. Of APCCS*, 1996
- [6] S. C. Douglas, et-al "A pipelined LMS adaptive FIR filter architecture without adaptation delay", *IEEE Transaction on Signal Processing*, vol. 46, No. 3, Mar. 1998.
- [7] K. Azadet, C. J. Nicole, "Low-power equalizer architectures for high speed modems", IEEE Communications Magazine, vol. 36, Oct. 1998.
- [8] S, Haykin, *Adaptive Filter Theory*, third edition, Prentice-Hall, NJ. 1996.
- [9] R. Jain, P.T. Yang, T. Yoshino, "FIRGEN: a computer-aided design system for high performance FIR filter integrated circuits", *IEEE Transactions on Signal Processing*, vol. 39, issue: 7, Jul. 1991.
- [10] T.C. Denk, C.J. Nicol, P. Larsson, K. Azadet, "Reconfigurable hardware for efficient implementation of programmable FIR filters", *Proc. of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing*, 1998, vol. 5
- [11] C--I HWANG and D. W. LIN, "Joint Low-Complexity Blind Equalization, Carrier Recovery and Timing Recovery with Application to Cable Modem Transmission", *IEICE Trans. Commun*. Vol. E82 B. No.1 Jan. 1999