## Digitally Enhanced Information Efficient Wireline Signaling

by

Shovon Dey

A thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science

in

Integrated Circuit and Systems

Department of Electrical and Computer Engineering University of Alberta

© Shovon Dey, 2020

## Abstract

With the increase in integrated functionalities within a single chip due to the continuous scaling of the gate length of transistors, off-chip bandwidth must catch up to make the increased functionalities accessible. The future of off-chip bandwidth might reach 100 Tbps according to ITRS (International Technology Roadmap for Semiconductors). But the electrical channel introduces latency and frequency-dependent attenuation which causes significant hindrance towards achieving higher off-chip bandwidth. Power efficiency is also one of the main concerns while working in a high-speed environment. This work introduces several digital techniques that increase the power and bandwidth efficiency of High-Speed Wireline systems.

The first work in this dissertation explains a digital technique that uses a generated alternated sequence to detect and correct any burst errors in Decision Feedback Equalization (DFE). This scheme takes advantage of the correlation between Inter-Symbol Interferences (ISI) to correct propagated error in DFE. The proposed architecture achieves a BER of less than 10<sup>-12</sup> while working at 16 Gbps.

The second work in this thesis describes a complete low power 16 Gbps SerDes Transceiver with 0.0375 pJ/Bit link efficiency while transmitting 90% sparse data. In this work, the transmitted bits were transition encoded which results in a significant power efficiency improvement while working with random as well as sparse data. The receiver architecture includes a CTLE

(Continuous Time Linear Equalizer) and the rest of the architecture is similar to 1-tap speculative DFE.

The third work describes a binary search ADC (Analog-to-Digital Converter) with a novel digital interpolation technique. Concurrent Binary Search was used in this work to reduce the conversion time of the SAR (Successive Approximation Register) type ADC. The 4-way time-interleaved architecture achieved 4 GS/s speed with an 8-bit resolution having 39.56 dB SNDR and 48.21 dB SFDR at the Nyquist frequency.

## Acknowledgments

Although this dissertation only has one author listed, many of the works wouldn't be possible without the help that I have got during my almost two and a half years as a graduate student at the University of Alberta. I must admit Dr. Masum Hossain among all the people from whom I have got help and support. My utmost respect and gratitude go to him. Without his proper guidance, encouragement, and tolerance when I made mistakes, not a single work of this thesis could have been successful. I am and will remain grateful to Dr. Masum Hossain for the rest of my life for providing me the beacon of knowledge in this very field.

I would also like to thank Dr. Kambiz Moez and Dr. Bruce F. Cockburn for being on the final exam committee. Sincere gratitude goes to CMC Microsystems for lending us equipment to test the circuits.

During my first one and a half years at the Mixed-Signal Integrated Circuits in Nano-Technology (MINT) lab at the University of Alberta, Aurangozeb provided me with technical insights and encouragement. I am extremely grateful to Aurangozeb for all of his help and friendship. I would also like to thank Carson, Zarraf, Alberta, Shakib, and Behdad for their help and for making the workplace fun while being productive.

Finally, thanks to my parents for believing in me and providing me with moral support whenever it was needed.

# Contents

| Abstract                                                                         | ii   |
|----------------------------------------------------------------------------------|------|
| Acknowledgments                                                                  | iv   |
| Contents                                                                         | V    |
| List of Tables                                                                   | vii  |
| List of Figures                                                                  | viii |
| List of Abbreviations                                                            | xii  |
| Chapter 1: Introduction                                                          | 1    |
| 1.1: Motivation                                                                  | 1    |
| 1.2: High-Speed Wireline Communication System                                    | 3    |
| 1.3: Contribution of the Thesis                                                  | 5    |
| 1.4: Organization of the Thesis                                                  | 8    |
| Chapter 2: Low Latency Burst Error Detection and Correction in Decision Feedback | 10   |
| 2.1: Introduction                                                                | 10   |
| 2.2: DFE Error Propagation                                                       | 13   |
| 2.3: Precursor Equalization Challenges                                           | 19   |
| 2.4: Proposed DFE Error Detection and Correction                                 | 22   |
| 2.5: Implementation and Measurement                                              | 26   |
| A. Digital Burst Error Detection and Correction Unit                             | 28   |
| B. Implemented Prototype                                                         | 29   |
| 2.6: Conclusion                                                                  | 33   |

| Chapter 3: Low Power 16-Gbps High-speed Wireline Transceiver in 65nm CMOS     |    |
|-------------------------------------------------------------------------------|----|
| 3.1: Background                                                               | 35 |
| 3.2: Introduction                                                             | 37 |
| 3.3: Proposed Transmitter Architecture                                        | 38 |
| 3.4: Receiver and Decoder Architecture                                        | 43 |
| 3.5: Implementation and Measurement                                           | 46 |
| Chapter 4: A 4-GS/s Digitally Interpolated 8-Bit Concurrent Binary Search ADC | 50 |
| 4.1: Background                                                               | 50 |
| 4.2: Introduction                                                             | 51 |
| 4.3: SAR Based Concurrent Binary Search ADC                                   | 54 |
| A. Interpolation Logic                                                        | 57 |
| B. Redundancy                                                                 | 59 |
| C. Power and Area Consideration                                               | 60 |
| 4.4: Implementation and Measurement Results                                   | 62 |
| A. Reference Calibration                                                      | 63 |
| 4.5: Conclusion                                                               | 67 |
| Chapter 5: Concluding Remarks                                                 | 68 |
| 5.1: Future Works                                                             | 70 |
| Bibliography                                                                  | 71 |

# **List of Tables**

| Table 2.1: Performance Summary and Comparison      | 33 |
|----------------------------------------------------|----|
| Table 3.1: Link Performance Summary and Comparison | 48 |
| Table 4.1: Performance Summary of the ADC          | 66 |

# **List of Figures**

| Figure 1.1: System level view of a MAC accelerator with memory interface            | 1  |
|-------------------------------------------------------------------------------------|----|
| Figure 1.2: A General Mixed-Signal Wireline Transceiver                             | 3  |
| Figure 1.3: General Wireline Transceiver with ADC-based Receiver                    | 4  |
| Figure 1.4: Contribution of the Thesis                                              | 6  |
| Figure 2.1: Two-tap decision feedback equalizer implementation (a) direct feedback  | 11 |
| structure (b) loop-unrolled or speculative structure                                |    |
| Figure 2.2: Error propagation mechanism in loop-unrolled DFE. Four comparator       | 12 |
| references are shown with blue lines. Selected comparators during each bit decision |    |
| are highlighted using red arrows. Detected bits and error bits are marked in red.   |    |
| Figure 2.3: Markov model of error propagation in DFE with traceback path with       | 14 |
| corrected sequence detection based on the relative conditional probability          |    |
| distribution function                                                               |    |
| Figure 2.4. SNR penalty in pre cursor equalization using peak power constrained in  | 17 |
| 8 tap transmit FIR (3 pre-tap, main and 4 post-tap)                                 |    |
| Figure 2.5. Source degeneration equalizer and its frequency and time domain         | 18 |
| performance that demonstrates under equalized pre-cursor and over equalized post    |    |
| cursor in time domain single bit response and resultant eye diagram.                |    |

| Figure 2.6: Proposed sequence DFE with error detection and correction technique.    | 20 |
|-------------------------------------------------------------------------------------|----|
| The lines on the transient response are associated with the 4-bit sequences. The    |    |
| decoding process is described sequentially as likely sequence generation and        |    |
| selection based on previous decision. Figure also shows the error detection and     |    |
| correction based on next bit (B+1) and finally the initial DFE decisions along with |    |
| corrected bits.                                                                     |    |
| Figure 2.7. Receiver architecture based on tw- tap sequence DFE followed by burst   | 24 |
| error correction using Store-Compare-Select (SCS) unit                              |    |
| Figure 2.8: Implemented prototype in 65-nm CMOS with power break down               | 26 |
| Figure 2.9: Operation of a SCS Unit                                                 | 27 |
| Figure 2.10: Synthesized Layout of the Digital Error Detection and Correction Unit  | 27 |
| Figure 2.11. (a) BER bathtub curve of the receiver. (b) Eye diagram at AFE output   | 30 |
| (a) 16 Gb/s and after the sequence decoder (a) 4 Gb/s.                              |    |
| Figure 2.12. Measured SBR at CTLE output and BER performance as a function of       | 31 |
| signal to noise ratio. Conv. and proposed raw BER are measured and these error      |    |
| sequences are post processed to generate post FEC performance with Reed Solomon     |    |
| encoding (RS 528, 514)                                                              |    |
| Figure 2.13. Measured SBR at CTLE output and BER performance as a function of       | 32 |
| signal to noise ratio. Conv. and proposed raw BER are measured for moderate to      |    |
| high loss channels in (a) and (b) respectively. The measured raw error sequences    |    |
| are post processed to generate post FEC performance with Reed Solomon encoding      |    |
| (RS 528, 514).                                                                      |    |
| Figure 3.1: CML Driver                                                              | 35 |
| Figure 3.2: Conventional Voltage mode NRZ transmitter                               | 37 |

| Figure 3.3: Proposed transition encoded signaling transmitter                       |    |
|-------------------------------------------------------------------------------------|----|
| Figure 3.4: Simulation waveforms of conventional and proposed voltage mode          |    |
| transmitter                                                                         |    |
| Figure 3.5: VDD Ripple Voltage vs. Supply node Capacitance                          | 41 |
| Figure 3.6: Common gate amplifier and its small signal equivalent model             |    |
| Figure 3.7: Overall Receiver and Decoder Architecture                               |    |
| Figure 3.8: Source-Degenerated Differential Pair                                    | 44 |
| Figure 3.9: Implemented prototype in TSMC 65nm CMOS                                 |    |
| Figure 3.10: Equalizer Output Eye                                                   | 47 |
| Figure 3.11: Link BER Bathtub Curve                                                 | 47 |
| Figure 4.1: 3-Bit Binary Search ADC                                                 | 50 |
| Figure 4.2: (a) Time-interleaved SAR ADC architecture with CDAC (b) An example      | 52 |
| of transient operation of the Asynchronous SAR                                      |    |
| Figure 4.3: Comparator power as a function of ADC resolution in 65 nm CMOS to       | 53 |
| achieve decision time less than 125 ps. Here dynamic range of the ADC is assumed    |    |
| to 600mVpp                                                                          |    |
| Figure 4.4: Proposed concurrent binary search ADC (a) single-channel ADC with       | 55 |
| interpolation logic                                                                 |    |
|                                                                                     |    |
| Figure 4.4: Proposed concurrent binary search ADC (b) Reference generation and      | 56 |
| blue and red lines are showing potential references in each cycle of the conversion |    |
| Figure 4.5: An example of the reference update in each cycle of the concurrent      | 57 |
| search and generated digital code via interpolation                                 |    |
| Figure 4.6: An example of the error even during conversion                          | 58 |

| Figure 4.7: An example of the ADC operation where comparator 1 and 2 retires after | 59 |
|------------------------------------------------------------------------------------|----|
| 1st cycle and comparator 3 and 4 continues CBS                                     |    |
| Figure 4.8. Implemented quad-channel ADC in 65nm CMOS. Detail of the single        | 61 |
| channel is also shown in detail                                                    |    |
| Figure 4.9. Code density-based reference offset correction                         | 62 |
| Figure 4.10. Measured SNDR and SFDR of a single-channel SAR ADC as a               | 63 |
| function of input frequency.                                                       |    |
| Figure 4.11. Measured output spectrum of 4GS/s ADC at Nyquist Frequency            | 64 |
| Figure 4.12. Measured output spectrum of 4GS/s ADC at 10-MHz Frequency             | 64 |
| Figure 4.13. Measured INL and DNL profile of the ADC prototype                     | 65 |
| Figure 5.1: Channel Performance Impact [26]                                        | 68 |

# **List of Abbreviations**

| MAC         | Multiplication and Accumulation   |
|-------------|-----------------------------------|
| PLL         | Phase Lock Loop                   |
| ISI         | Inter Symbol Interference         |
| CTLE        | Continuous Time Linear Equalizer  |
| SerDes      | Serializer Deserializer           |
| CDR         | Clock Data Recovery               |
| Transceiver | Transmitter and Receiver          |
| DSP         | Digital Signal Processing         |
| DFE         | Decision Feedback Equalization    |
| SNR         | Signal to Noise Ratio             |
| AFE         | Analog Front End                  |
| BER         | Bit Error Rate                    |
| FEC         | Forward Error Correction          |
| SAR         | Successive Approximation Register |
| ADC         | Analog to Digital Converter       |
| INL         | Integral Non-Linearity            |
| DNL         | Differential Non-Linearity        |
| SNDR        | Signal to Noise Distortion Ratio  |
| SFDR        | Spurious-Free Dynamic Range       |

# Chapter 1.

# Introduction

#### **1.1. Motivation**

The renewed interest in wireline signaling is motivated by AI-driven applications that require faster processing, learning, and decision making. Accelerators for AI applications are specifically designed to replicate 'brain neurons' in hardware form. In a human brain, external stimulations are summed up by their synaptic connections in neurons and the combined output of these neurons make decisions. In hardware emulation, these neurons are described as Multiplication and Accumulation (MAC) units (Figure 1.1). In this analogous world, external stimulations are



Figure 1.1: System level view of a MAC accelerator with memory interface

represented by 'input excitations' and synaptic connections are represented by weight vectors. Therefore, when the input excitation is multiplied by the 'weight vector' that summation is reflective of a single neuron's output. By combining different neuron outputs ML hardware can make a decision that is known as 'inference'. Like the synaptic connections, these weight vectors also get updated during the learning process. Note that the size of the weight vectors requires them to be stored in a memory. On the other hand, computation is most efficient in the advanced CMOS nodes specifically in processors. Therefore, the learning process mandates significant processor-memory bandwidth. In fact, this is one of the key challenges that we must address to develop for ML hardware.

The number of transistors present in a single chip has increased over the years in accordance with Moore's law. Along with that, the computing capability of a chip is increasing at the same rate as the number of transistors. The wide range adaptation of hardware accelerators has further increased the computational throughput. But at a system level, this increase in computing capabilities must be associated with the interconnect bandwidth to read and write from these devices with lower latency. Nowadays, wireline communication between chip-to-chip is the main bottleneck of the scaled system which prevents us from achieving higher communication bandwidth. On top of that, as the process technology has been scaled down, the power density of the chips also increased along with that. As we are trying to achieve a higher data rate, the amount of power consumed in the wireline transmission of data has become an important factor. Energy efficiency in the wireline transmission must be increased to maintain the power budget for packages, even while working at a much faster data rate.

The motivation of this thesis is to enhance the information efficiency in wireline signaling to obtain high performance (bandwidth and power) for the communication system. The approaches adopted in this thesis to obtain the desired performance are mostly digital. Before introducing these techniques in full detail, a brief introduction of high-speed wireline communication system architecture is given for background.

#### **1.2. High-Speed Wireline Communication System**

Figure 1.2 shows a general Mixed-Signal-based wireline transceiver. A serializer (SER) serializes



Figure 1.2: A General Mixed-Signal Wireline Transceiver

the parallel data it receives using a high-speed clock generated by the Phase Lock Loop (PLL) and feeds them to the TX Driver. The Driver transmits the data through the channel. The transmitter and the receiver side use proper termination to avoid any reflection at microwave frequencies. Frequency-dependent loss in the channel translates to ISI (Inter Symbol Interference) that is added to the transmitted data in addition to noise and that corrupts the received signal. To equalize the distorted data at the receiving end, a CTLE (Continuous Time Linear Equalizer) is often used that essentially inverts the channel response and is implemented as a peaking filter [57]. The Receiver (RX) recovers the data by sampling the received signal at the peak of the eye opening where the signal-to-noise ratio (SNR) at a maximum. Once digitized, the data is deserialized to parallel paths. For optimal sampling, the receiver needs to know the correct sampling phase and frequency. This

can be done in two ways. The first one is using a CDR (Clock Data Recovery) circuit [58] which successfully detects the transmitter operating frequency from the received samples and generates



Figure 1.3: General Wireline Transceiver with ADC-based Receiver

that frequency clock and feeds it to the RX. Another approach is to forward the transmitter clock using another channel and then use that forwarded clock in the RX [59] for the successful detection of data. Although the later method requires another channel, it relieves us from the complexities of a CDR circuit. This kind of Mixed-Signal-Based transceiver is usually good enough for a short and medium-length channel, where the amount of ISIs and noise is comparatively small and the CTLE is enough in most cases for equalizing the received samples. In the case of a longer length channels, the loss is higher, and the residue ISI significantly degrades SNR. In such cases ADCbased receivers become more practical as they enable us to use strong digital equalization using the Digital Signal Processing (DSP) block. Figure 1.3 shows a general wireline transceiver with ADC-based receiver.

The evolving landscape of wireline signaling has reached an interesting point after decades of evolution. In the first decade (1990), data rates increased from 1 Gb/s to 28 Gb/s consistently on a relatively predictable path by adopting equalization and making them more affordable through

CMOS scaling [44]. However, more sophisticated equalizations such as using a Maximum Likelihood Sequence Detector (MLSD), still remains unaffordable for SerDes applications. In recent years, more disruptive approaches were introduced for 56 Gb/s and higher data rates. Multilevel signaling is adopted for higher spectrum efficiency but its SNR penalty mandates Forward Error Correction (FEC) to meet the BER requirement [51]. As the analog performance of the transistors are saturated, ADC-based receivers are becoming widely adopted where DSP-based equalization enables longer reach. Meanwhile custom links NVLink or Chord signaling are finding their way towards energy efficiency [47]. As quantum computers are entering commercial space through IBM Q-System [48] and D-Wave, their interface requires more attention. Although CMOS operation at mK temperature is validated, the link energy efficiency requires significant improvement.

#### **1.3.** Thesis contribution

The main contributions of the thesis are the digital enhancements shown in Fig. 1.4. To optimize energy efficiency and reduce latency, GPU-GPU custom links focus on the aggregate bandwidth of the interface rather than increasing individual link BW. To achieve the projected BW increase (2x in 4 years) we either need to compensate for more loss or improve the spectral efficiency. Unfortunately, both approaches will degrade energy efficiency. Intuitively, compensating for high frequency channel loss requires more gain at high frequency and while amplifying the high frequency signal, noise is also amplified. In other words, linear equalizers correct the ISI noise at the cost of random and crosstalk noise. Therefore, an alternate approach is to use decision feedback equalization (DFE) that does not amplify the random noise. But DFE can only correct the post cursor ISI, we still need pre-cursor ISI compensated by linear equalizer. One limitation of the DFE is the error propagation and usually it increases with lower SNR. To combat SNR



Figure 1.4: Contribution of the Thesis

degradation, FEC is adopted to improve the raw BER from 10<sup>-6</sup> to 10<sup>-12</sup> at the cost of significant latency, complexity and overhead [51]. Since in DFE equalization is based on the prior decisions, any decision error directly impacts equalization performance. In simple words, wrong 'decisions' add ISI instead of correcting ISI and that can cause additional decision errors. This is known as 'error propagation' and is one of the main challenges of using DFE. This work focuses on detecting and correcting the DFE error propagation. Prior efforts to mitigate DFE error have been based on forward error correction (FEC) that also causes overhead and latency. Instead, in this work we detect and correct the error based on constructive use of ISI. To the best of our knowledge this is the first reported DFE error detection and correction as part of the data detection process that does not use FEC and achieves much lower latency and complexity. The concept, architecture and measurement results of this work has been submitted for publication in *IEEE Open Journal of Circuit and Systems* and it is currently under review. The manuscript was invited to this journal. "Low Latency Burst Error Detection and Correction in Decision Feedback Equalization," Shovon Dey, Aurangozeb, Masum Hossain, IEEE Open Journal of Circuit and Systems 2020." I was

responsible for the analysis of the correction algorithm, designing the digital Error Correction and Detection, and writing the manuscript. Dr. Masum Hossain was the supervisor of the project and data collection and comparison generation was done by him. He was actively involved in manuscript writing.

Nowadays, the need for multi-Gbps wireline transceivers is greater than ever. While working in the multi-Gbps range, the power and area consumption of the TX driver must be reasonable. In this case, providing more channels, and thus lowering the per-channel data rate can be a solution to achieve a higher data rate. But the amount of power and area consumed increases in proportion to the number of channels. Designing a single channel with a higher data rate is often more costeffective. But with a higher data rate, the driver power becomes an issue as it needs to drive more samples per second. To address this concern, the second contribution of the thesis implements transition encoded signaling in wireline communication system. The transition scheme takes advantages of the sparse data and reduces the energy consumption inversely proportional to the transition density. A manuscript on this Link architecture is now being prepared for publication. The manuscript will be submitted to a journal for publication after it is completed. The journal where this manuscript will be submitted for publication not been decided yet. I was responsible for designing the link architecture, data collection, and currently participating in manuscript writing. Dr. Masum Hossain was the supervisor of this project and he was involved in providing guidance while designing the architecture and currently involved in manuscript writing.

For ADC-based receivers, the ADC must work fast enough to keep up with the data rate. While working fast, the power consumption of the ADC should also stay in a reasonable range. Successive Approximation Register (SAR) ADC's have shown greater performance while

considering power. The amount of conversion time needed for a SAR ADC to convert a sample is often quite long. Thus, in the case of an ADC-based receiver, there is trade-off between power and speed. It is a problem while achieving higher wireline communication bandwidth. To address these concerns, a second contribution of this dissertation introduces concurrent binary search and novel digital interpolation technique as a third contribution. To the best of our knowledge, it is the first time where the digital interpolation technique has been used in Analog-to-Digital Converters. This ADC architecture had been reviewed for publication in the journal of *IEEE Solid State Circuit Letters* and a minor revision was recommended. A revision has indeed been submitted and now it is being considered for publication. "A 4GS/s Digitally interpolated 8-bit Concurrent Binary Search ADC," Shovon Dey, Aurangozeb, Rajiv Shukla, Masum Hossain, IEEE Solid State Circuit Letters. I was responsible for designing the ADC architecture, assisting in data collection and manuscript writing. Aurangozeb assisted me in designing the ADC architecture and he played the lead role in data collection. Dr. Masum Hossain was the supervisor of the project and most of the manuscript writing was done by him.

#### 1.4. Organization of the Thesis

This thesis includes three works focused on enhancing information efficiency in wireline systems. The first work is on error detection and correction for Decision Feedback Equalization. The second one is on designing an information-efficient SerDes (Serializer Deserializer). The last work focuses on digitally enhancing the resolution of a SAR (Successive Approximation Register) ADC.

Chapter 2 describes a low-latency burst error detection and correction technique for a Decision Feedback Equalizer. Implemented in 65-nm CMOS technology, this architecture uses the uncorrected pre-cursor present in the Decision Feedback Equalizer to detect the presence of error bursts and corrects them by tracing back. This error correction and detection technique is implemented digitally, which doesn't add any latency to the encoder and decoder. The implemented prototype in 65-nm technology operates at 16Gbps and can compensate for 32dB channel loss. This chapter describes how to detect bit sequences and equalize them using a digital architecture.

Chapter 3 describes a 16-Gbps low-power transition-encoded wireline transceiver in 65-nm technology. Traditional transceivers usually are not dependent on the data. They consume an equal amount of power regardless of what is being transmitted. This chapter explains a new transition encoding technique which reduces the power consumption to be inversely proportionally to the amount of transition present in the bitstream. A complete transceiver, including the transition encoded transmitter and receiver, is described in full detail in this chapter. The implemented prototype in 65-nm technology achieved a 16-Gbps data rate with a BER of less than 10<sup>-12</sup>.

Chapter 4 describes a 4-way time-interleaved binary search ADC architecture with a novel digital interpolation technique in 65-nm CMOS technology. Successive Approximation Register (SAR) ADC's are well known for their low-power requirement. But the main drawback of SAR ADC's is the conversion time that it takes to digitize a sample. Flash ADC's are the fastest one but their exponential power and area hungry nature make them unrealistic for the design of high-resolution ADCs. A concurrent binary search ADC designed along with a novel digital interpolation technique and will be described in full detail in this chapter. The implemented prototype in 65-nm CMOS technology achieved 4-GS/s speed with an 8-bit resolution.

Chapter 5 summarizes the contributions of this thesis towards achieving more information efficiency of wireline communication and propose ideas for future work.

## Chapter 2.

# Low Latency Burst Error Detection and Correction in Decision Feedback Equalization

#### **2.1. Introduction**

Frequency-dependent channel loss has been a major challenge to achieve high data rates. Despite using low-loss dielectric, we often need to compensate for 30+ dB loss at 10+ Gb/s for longer reach channels. Therefore, equalization has become a critical component of SerDes. There are two possible equalization options: linear equalization in the form of feedforward equalization (FFE), CTLE, and non-linear equalization in the form of decision feedback equalization (DFE). The linear equalizer is often preferred because of its simplicity but its performance is often limited due to cross talk and noise amplification while compensating for high-frequency losses. Compared to that, DFE corrects ISI without amplifying noise and therefore, more effective for long-reach channels with higher loss. However, DFE can correct only post cursor ISI. The link needs to equalize pre-cursor ISI through the transmitter side or receiver side FFE. Given that the maximum transmitter side voltage swing is limited by the supply, a high-frequency boost is provided by reducing the low-frequency swing. In other words, equalization comes at the cost of a SNR penalty. This reduction in SNR causes a significant increase in the 'error probability'. Since DFE equalization is based on the prior decisions, any decision errors directly impact equalization performance. In simple words, wrong 'decisions' add ISI instead of correcting ISI and that problem



(a)



(b)

Figure 2.1. Two-tap decision feedback equalizer implementation (a) direct feedback structure and (b) loop-unrolled or speculative structure

can cause additional decision errors. This is known as 'error propagation' and one of the main challenges of using DFE. This work focuses on detecting and correcting the DFE error propagation. Prior efforts to mitigate DFE error were based on forward error correction (FEC) that also causes overhead and latency. Instead, in this work, we detect and correct the error based on



Figure 2.2: Error propagation mechanism in loop unrolled DFE. Four comparator references are shown with blue lines. Selected comparators during each bit decision are highlighted

using red arrows. Detected bits and error bits are marked in red.

the constructive use of ISI. To the best of our knowledge, this is the first reported DFE error detection and correction as part of the data detection process that does not use FEC and achieves much lower latency and complexity. In Section 2.2 we will review the error propagation mechanism in DFE. In addition to the intuitive explanation, we will provide a Markov model that also provides insight into potential error detection and correction. In Section 2.3 we will first

describe the challenge associated with pre-cursor ISI equalization. We will introduce the architecture and circuit implementation of the error detection and correction technique in Section 2.4. Implemented prototype in 65-nm CMOS and experimental validation of the concepts will be presented in Section 2.5 followed by a conclusion and comparison to the state-of-the-art solutions in Section 2.6.

#### **2.2. DFE Error Propagation**

noise

Conceptual and practical implementations of DFE are shown in Figure 2.1(a). For simplicity, we start with a full-rate 2-tap implementation as shown in the figure. In general, the  $n^{th}$  received signal, y[n], at the receiver input can be written as:

$$y[n] = \sum_{m=-p}^{k} h[m]x[n-m]$$
 2.1

Here, h[m] is the symbol spaced channel response and x[n-m] are the last m transmitted samples. Also, note that here we are considering p pre and k post cursors. The direct feedback structure subtracts, based on the previous decisions, the ISI at the summing node.

$$S[n] = y[n] - \underbrace{I[n]}_{\text{DFE correction}} + \underbrace{N[n]}_{noise}$$

$$S[n] = h[0]x[n] + \underbrace{\sum_{m=1}^{K} (x[n-m] - D[n-m])h[m]}_{\text{ISI cancellation}} + \underbrace{\sum_{m=-p}^{0} x[n-m]h[m]}_{\text{Uncorrected ISI}} + \underbrace{N[n]}_{2.2}$$

Here D[n] are the decisions made after ISI subtraction. Therefore, if the prior *m* decisions are correct, D[n-m] = x[n-m], the post cursor ISI would be perfectly canceled. But an erroneous previous decision would double the ISI seen by the sampler. For the simplistic binary (+1 -1) case with 2-tap channel, the slicer margin reduction due to the successive wrong decisions is:

$$S[n] = h[0] - \underbrace{h[1] - h[2]}_{\text{DFE Error}} - \underbrace{h[-1]}_{\text{uncorrected pre}} + \underbrace{N[n]}_{noise}$$
2.3



Figure 2.3: Markov model of error propagation in DFE with traceback path with corrected sequence detection based on the rel conditional probability distribution function

As expected, with the wrong decisions, the margin severely degrades. Especially, if the tap magnitudes are higher, then the probability of an error burst increases exponentially [1]. In practice, the direct feedback structure creates timing constraints: the decision and subtraction must happen within 1 UI and that is hard to meet this at 10+ Gb/s. Therefore, the loop-unrolled version shown in Figure 2.1(b) is more pragmatic [2]. Here, we pre-compute the current decisions assuming both possibilities of the previous decisions. Then we select only one of the decisions. Although the DFE error propagation mechanism still remains the same, it manifests itself in comparator selection, as shown in Figure 2.2. Here, we are plotting the signal at the slicer input for a 4-tap (1 pre, main, and 2 post cursors) channel. Note that, to account for uncorrected post cursors, the slicer threshold levels are shifted. From these four comparators (C0 to C3) we select only one of them based on previous 2-bit decisions. Therefore, once a wrong decision is made, that results in wrong comparator selection and consecutive wrong decision. But eventually, the comparators recover from the 'error burst' when the received sample is as such the decisions are the same regardless of the comparator selection.

The DFE error propagation described in Figure 2, can also be captured in a mathematical model for further insight. For that, we adopt a Markov model-based approach to represent the probability of error similar to the one described in [3]. For simplicity, we consider a 3-tap channel [0.9 0.4 0.2] for NRZ signaling {+1, -1}. At the  $k^{th}$  sample, if the DFE detected bit is d(k) and the transmitted bit is x(k), then the amount of error caused by the DFE at the  $k^{th}$  sample is expressed as  $D_k$ 

$$D(k) = d(k) - x(k)$$
2.4

The states of the Markov DFE error propagation model consist of two values, error in the two previous decisions  $\langle D_{k-1} D_{k-2} \rangle$ . So, there can be a total of 9 states that are marked in Figure 2.3.

If an error occurs at the  $k^{th}$  sample, how this error can potentially propagate is also shown in Figure 2.3. The state transition probabilities are easily calculated using the standard error function. For example, the probability of going from state 8 to state 6 is much less than the probability of going from state 8 to state 9, as depicted in the pdfs,  $P_{8,6} = \frac{1}{2}Q(\frac{-h(0)-2h(1)}{\sigma}) << P_{8,9} = \frac{1}{2}Q(\frac{-h(0)+2h(1)}{\sigma})$  for this given channel. Similarly, all other state transition probabilities were also calculated. Here,  $\sigma$  is the standard deviation of the white Gaussian noise (AWGN) to model random circuit noise and crosstalk noise. The equations for state transition probabilities are given below.

$$P_{8,6} = \frac{1}{2} Q(\frac{-h(0) - 2h(1)}{\sigma}) << P_{8,9} = \frac{1}{2} Q(\frac{-h(0) + 2h(1)}{\sigma})$$
 2.5

$$P_{9,2} = \frac{1}{2} Q\left(\frac{-h(0) + 2h(1) - 2h(2)}{\sigma}\right) >> P_{9,5} = \frac{1}{2} Q\left(\frac{-h(0) + 2h(1) + 2h(2)}{\sigma}\right)$$
2.6

$$P_{3,2} = \frac{1}{2} Q(\frac{h(0) - 2h(1)}{\sigma}) >> P_{3,5} = \frac{1}{2} Q(\frac{h(0) + 2h(1)}{\sigma})$$
 2.7

$$P_{2,9} = \frac{1}{2} Q(\frac{h(0) + 2h(1) + 2h(2)}{\sigma}) >> P_{2,6} = \frac{1}{2} Q(\frac{h(0) + 2h(1) - 2h(2)}{\sigma})$$
 2.8

As shown in the example in Figure 2, the burst error often takes the form "....+2,-2,+2...." that also implies that the probability of an error burst increases when the transmitted bits are negatively correlated means they are likely to be in the form of "...+1,-1,+1...". This DFE error model can be further extended to incorporate pre-cursor. For the sake of brevity, here we summarize its impact. The probability of an error occurrence increases in proportion to the amplitude of the precursor as the DFE cannot correct the precursor. If the precursor is equalized from the transmitter side, the amplitude reduction of the main cursor also increases the probability of error.

In this work, we further extend this model to formulate an approach for error detection and correction. This approach is based on the observation explained in the previous section – error propagation happens where the correlation between consecutive symbols is negative. For a given channel such regions can be easily defined based on channel tap coefficients. For example, in 4-



Figure 2.4. SNR penalty in pre cursor equalization using peak power constrained in a 8-tap transmit FIR (3 pre-tap, main and 4 post-tap)

tap channel, samples between the +h(0)+h(-1)-h(1)-h(2) and -h(0)-h(-1)+h(1)+h(2) boundaries are negatively correlated and are prone to error propagation. But this information can be advantageous too : if the next symbol is known to be correct using this negative correlation, we can both detect and correct the error. This constructive use of negative correlation has previously provided opportunistic error correction in [5]. But here we are extending this to completely correct the burst that results in much more improvement in BER. The correction mechanism described in Fig. 3. We start with the last symbol, where error burst ends or in other words, makes the 1<sup>st</sup> error-free detection. In this example, that would be  $(k+4)^{th}$  sample. Given this symbol is correctly detected, we compare two conditional probabilities,  $Pr(k+3)|_{k+4} = +1 = +1$  and  $Pr(k+3)|_{k+4} = +1 = -1$ . These two probabilities are shown by the shaded blue and red regions in the PDFs. The relative difference



between these two probabilities is then used to correct the error at the  $(K+3)^{th}$  sample. This

Figure 2.5. Source degeneration equalizer and its frequency and time domain performance that demonstrates under-equalized pre-cursor and over-equalized post-cursor in time domain single bit response and resultant eye diagram.

approach is continued till the 1<sup>st</sup> symbol in the error burst and this backpropagation is known as

traceback. While this technique, from a theoretical standpoint, looks attractive and an effective way to correct burst errors, several implementation challenges must be addressed to make it feasible. First, we need to determine the beginning and end of the burst through detection of the symbols that are error-free. Second, computing and storing the pdfs can be complicated if not infeasible. In this work, we use high confidence decisions to define the start and end of the error burst to initiate traceback. In addition to that, we use a sequence-based approach instead of computing and comparing pdfs that can be implemented with much lower power and much less complexity. In Section 2.4 we will be discussing the hardware implementation of the error correction technique described above.

#### **2.3. Precursor Equalization Challenges**

Equalization of the pre-cursor ISI is one of the major challenges in highspeed SerDes. There are three main approaches to pre-cursor equalization in the existing literature. The first and the most common approach is to use a finite impulse response (FIR) filter at the transmitter side to reduce the pre-cursor ISI. But due to maximum swing constraint (also known as peak-power constraint) in the transmitter, the pulse amplitude is reduced by the pre and post cursor tap weights. In other words, the Tx FIR improves the signal to pre-cursor ISI ratio at the cost of signal-to-noise ratio, and as a result the voltage margin does not improve significantly (Figure 2.4). The second approach is to enable pre-cursor equalization at the receiver side. Unlike Tx FIR, the receiver equalization is not peak power-constrained – therefore, the precursor can be compensated without sacrificing SNR. Since no backchannel is needed, adaptation is significantly less complicated. Finally, shifting the equalization to the receiver side avoids jitter amplification [6]. The continuous-time linear equalizer (CTLE) is often considered to be a low-power alternative to the Tx FIR, as described in [7]. Intuitively, they both provide a high-frequency boost with a similar

magnitude response. Therefore, they are widely assumed to be equivalent. Although CTLE



Figure 2.6: Proposed sequence DFE with error detection and correction technique. The lines on the transient response are associated with the 4-bit sequences. The decoding process is described sequentially as likely sequence generation and selection based on previous decisions. The figure also shows the error detection and correction based on the next bit (B+1) and finally the initial DFE decisions along with corrected bits.

structures have evolved over the years, fundamentally they are all based on peaked amplifiers that provide relatively higher gain as frequency increases. The most popular choice is source degeneration (Figure 2.5), where the differential amplifier's effective transconductance is a function of frequency, and eventually, this translates to one zero and two poles. For low-power applications, passive equalizers are often preferred. At low frequencies, the signal gets attenuated by the resistor divider whereas at high frequencies the capacitor bypasses the resistor and the entire signal appears at the receiver input. Similar to source degeneration, the passive equalizer also causes a zero and two poles in the transfer function. Note that, in both cases, the zeros are actually in the left half-plane. The consequence of this limitation in the CTLE implementation is not evident in magnitude response. The transient response reveals that although we are able to flatten the magnitude response, the pre-cursor remains under-equalized and the post cursor is over-equalized. Note that, increasing the boost will only further increase the undershoot. From the SNR point of view, undershoot is sub-optimal since we are enhancing the proportional noise without decreasing ISI. To summarize, CTLE is not an effective way to equalize pre-cursor ISI. An alternative is to adjust the sampling of the data eye: rather than sampling at the peak of the single-bit response, the sampling point is slightly shifted left so that both main and pre-cursor values are reduced [8]. However, the SBR is relatively flattened at the peak compared to the rising edge. As a result, the loss of the main cursor is less compared to the reduction in the precursor. Therefore, similar to the TxFIR, this technique also degrades the SNR while improving the signal to pre-cursor ratio. But the tradeoff is graceful and often results in an equal or better voltage margin compared to Tx FIR [8]. The two main limitations are (a) the technique is only effective for small precursors only where the slope difference between main and pre-cursors are significant, and (b) adjusting the sampling point causes the post-cursor ISI to increase. Therefore, the sampler must be followed by decision

feedback equalization (DFE) to equalize the enhanced post cursor ISI. In fact, without DFE this method is ineffective, and even with DFE due to higher tap value probability of error propagation increases.

#### 2.4. Proposed DFE Error Detection and Correction

The proposed error detection and correction method is based on an analog-to-sequence converter where every received sample is converted to a time sequence of bits. The length of the sequence is set by the number of taps in the channel response. For a 4-tap (1 pre, main, and 2 post cursors) channel, each sample can be mapped to a 4-bit sequence,  $B_{+1}B_0 B_{-1}B_{-2}$  [9]. Here, with  $B_0$  as the current bit,  $B_{+1}$  is the next bit and  $B_{-1}$ ,  $B_{-2}$  are the two previous bits. Therefore, the receiver needs to select the most likely sequence from the 16 possible sequences. From a flash ADC point-of-view, N-bit detection requires a  $2^{N}$ -1 number of comparators. However, 15 Comparators would consume significant power. Fortunately, the correlation between the samples can be utilized to reduce the number of comparators. For example, n-th and (n+1)-th sample can be expressed as the corresponding sequence as follows:

$$Seq[n] = B_{n+1}B_nB_{n-1}B_{n-2}$$
$$Seq[n-1] = B_nB_{n-1}B_{n-2}B_{n-3} \quad 2.9$$

Note that there are 3 bits that overlap between these two sequences that can be utilized for both correct sequence detection and error detection and correction. First, based on the received sample we select 4 likely sequences from 16 possible sequences. This selection is done by comparing the received sample to 8 reference levels. These 8 reference levels are defined as follows:

$$V_{TH} = \pm h_{+1} \pm h_{+2} \pm h_{-1}$$
 2.10

These 8 comparator outputs are used to select 4 likely sequences that are differentiable based on

two previous decisions. Therefore, by correlating these 4 selected sequences with previous decisions we can select the 'most likely' sequence. This selection process is very similar to DFE and similarly subject to error propagation. The same example case is shown here in Figure 2.6 to explain the sequence detection in the event of a single error. A single error, in this case, is translated to the erroneous selection of sequences for consecutive UIs and is eventually able to make a correct decision to the error burst.

A major difficulty in DFE error propagation is the detection of such error burst. Although each decision has a different probability of error, the existing DFE does not recognize that. To resolve that we associate 'confidence' with each bit decision. For simplicity, there are two confidence levels, 'high confidence' and 'low confidence', which are determined by comparing to two threshold levels. The signals exceeding the top threshold are like to be '+1', meaning error probability is less than  $10^{-18}$ . Similarly, signals below the bottom threshold can be considered error-free '-1'. Finally, signals that are between these two thresholds can be considered 'low confidence' – these decisions can be right or wrong and therefore, should be considered for error detection and correction. Given that the sequences have both future and current bits (B<sub>+1</sub>B<sub>0</sub>), we can use that information to detect errors and as well as correct them. Probability Density functions of B<sub>0</sub> are shown on the left side of Figure 2.6. Here, the red pdf's are the probabilities of B<sub>0</sub> being 1 and the blue pdf's are the probabilities of B<sub>0</sub> being -1. Conditional pdfs are also illustrated in here by solid and dashed lines. Solid pdf lines show the probability of B<sub>0</sub> when B<sub>+1</sub> is 1 and dashed lines show

We also have two other thresholds, th1 and th2, which don't need any extra comparisons at the hardware level. The way we have implemented it is that 8 comparisons are made in two cycles. If the signal is below the +1+1-1-1 level, we can say that the signal is also below the +1+1-1+1 level.

So, instead of comparing the signal with the +1+1-1+1 level, we compare the signal with threshold



Figure 2.7. Receiver architecture based on a two-tap sequence DFE followed by burst error correction using the Store-Compare-Select (SCS) unit
th1. A similar explanation can be given for th2. A signal named 'Threshold' will give us a value 1 when the received signal is above th1 or below th2.

Note that when the signal is out of the negative correlation region, we cannot correct errors using alternate bit sequences. For that, we use the threshold signal value. If it is one, we correct the present bit by using the next or previous 'high confidence' decision. They must have a positive correlation with their neighboring 'high confidence' decision as they are out of the negative correlation region. The thresholds th1 and th2 are placed in such a way that they have an error probability of less than  $10^{-12}$ . For example, +1-1+1+1 is the sequence with maximum constellation value that yields to the present bit being -1. The threshold th1 is placed at a value so that even with noise the probability of the +1-1+1+1 signal constellation crossing that value is less than  $10^{-12}$ . The error detection and correction process is better illustrated with the example in Figure 2.6. The bits between the two high confidence levels are a potential error burst and therefore we will be checking for error. Error detection starts at the end of the burst where the next bit is high confidence. By comparing  $B_0(n+1)$  with  $B_{+1}(n)$  we can detect the error on the detection of  $B_0(n)$ . But to correct that error we need two 'most likely' sequences with opposite values of  $B_{+1}(n)$ . Therefore, after selecting the most likely sequence using DFE, we also generate an alternate sequence. In the next UI when a high confidence decision is available, we use that decision to compare and select the correct one. A similar approach was developed previously in [5] for PAM-

4 signaling, but it is limited to single error corrections only. A high confidence decision can only correct the immediately previous decision. Therefore, in a DFE error propagation scenario, the rest of the burst remains uncorrected. The proposed approach addresses this issue: after reaching each high confidence decisions, the decisions propagated backward till the previous high confidence decision each reached.

# 2.5. Implementation and Measurement



(a)

| Error Correction. | 16 mW |
|-------------------|-------|
| Sequence sel.     | 5 mW  |
| Sequence Gen.     | 3 mW  |
| Clock Dist.       | 6 mW  |
| Pulse gen.        | 3 mW  |
| Divider           | 6 mW  |
| Comparator        | 18 mW |
| Sampler           | 3 mW  |
| AFE               | 14 mW |
| Power Consumption |       |

Total:

74 mW @ 16 Gb/s

(b)

Figure 2.8: Implemented prototype in 65-nm CMOS with power break down

|                              |       | Data store                                                             | l69 ns                                                                                                              |                   |                                                                                                                           |
|------------------------------|-------|------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|-------------------|---------------------------------------------------------------------------------------------------------------------------|
| Name                         | Value | 651 ns                                                                 | 652 ns  653 ns                                                                                                      | s  654.ns  655.ns | 656 ns  657 ns                                                                                                            |
| CB <sub>0</sub> (n) (output) | 0     |                                                                        |                                                                                                                     |                   |                                                                                                                           |
| CLK                          | 1     |                                                                        |                                                                                                                     |                   |                                                                                                                           |
| B <sub>o</sub> (n) (inp)     | 0     |                                                                        |                                                                                                                     |                   |                                                                                                                           |
| 🔓 Flag                       | 0     |                                                                        |                                                                                                                     |                   |                                                                                                                           |
| 🔓 Enable                     | 0     |                                                                        |                                                                                                                     |                   |                                                                                                                           |
| 🐻 load                       | 1     |                                                                        |                                                                                                                     |                   |                                                                                                                           |
| B <sub>+1</sub> (n+1)        | 0     |                                                                        |                                                                                                                     |                   |                                                                                                                           |
|                              |       | Load = 1<br>CB <sub>0</sub> (n)(output) =<br>B <sub>0</sub> (n)(input) | Load = 0<br>CB <sub>0</sub> (n)(output) = B <sub>0</sub> (n)(stored)<br>Previous value of Cn(output) is<br>retained |                   | Enable = 1<br>$CB_0(n)(output) = \overline{B_0(n)}(stored)$<br>As,<br>$B_{*1}(n+1) \text{ xor } B_{*1}(n)(stored) =$<br>1 |

Figure 2.9: Operation of a SCS unit



Figure 2.10: Synthesized Layout of the Digital Error Detection and Correction Unit

The implemented receiver architecture is shown in Figure 2.7. Given that the transmit swing is supply-limited, we prefer the bulk of the equalization to take place in the receiver side. The received signal first passes through a continuous time linear equalizer (CTLE) to limit the channel's response to within four taps: one pre-cursor, main, and two post-cursor taps. Based on the tap values, an analog-to-sequence converter (ASC) converts each received sample to a 4-bit sequence,  $B_{+1}B_0$   $B_{\cdot 1}B_{\cdot 2}$ . To reduce the hardware penalty, the ASC is implemented with only 4 comparators similar to the conventional 2-tap loop unrolled DFE. However, these 4 comparators are used for 8 comparisons. After the 1st comparison, their own outputs are used to update the capacitive DAC for one additional comparison similar to a successive approximation ADC. To

accommodate this extra comparison, we adopt a quarter rate architecture that allows 3-UI of the evaluation period. Note that there are two additional comparators for high confidence decisions. In the event that high confidence decisions are available, the DFE and its corresponding logic functionality are disabled to save power.

#### A. Digital Burst Error Detection and Correction Unit

My contribution towards the implementation of this work was designing the digital burst error detection and correction unit. Burst error correction requires us to buffer the entire data length between two high confidence decisions. Although the worst-case burst length can be very long, in this implementation, we have designed the hardware to handle a maximum of 32 UI. The error detector and corrector design is based on cascaded store-compare-select (SCS) units. There are a total of 32 SCS units in the prototype designed. Each of the SCS Units individually works at 500-MHz in a time-interleaved fashion, which aggregates to 16 Gbps speed. At the beginning of the burst, it loads the data based on the 'load' signal. Following that load remains 'low' and the FFs hold the data by recirculating until the next high confidence flag is high, which indicates the end of burst. An Enable signal is generated and the high flag value directly propagates the decision to the output and as well triggers the comparison for the next unit. By comparing with the two possible sequences, the incorrect bit is corrected through the 'selection' signal and the corrected bit further propagates to the next decision. To further understand the operation of a SCS unit, a simulated waveform is shown in Figure 2.9. At first when the load signal is high, the input data gets stored. Then when the load signal is zero, the stored data is circulated. At the end, when the Enable signal is high, a comparison with the next bit happens and the output is updated following the result of the comparison. The synthesized layout of the Digital Error Detection and Correction Unit is shown in Figure 2.10. The unit occupies an active area of  $150\mu m \ge 50\mu m$ .

#### **B.** Implemented Prototype

The implemented 65-nm prototype is shown in Figure 2.8. The quarter rate receiver architecture works at 16-Gb/s consuming 58-mW including clocking. Considering error detection and correction unit, the receiver power increases to 74-mW. The implemented prototype can compensate for 34 dB loss with a 0.15 UI timing margin and less than 5pJ/bit power consumption (Figure 2.11). Although detected 4-bit sequences are used for diagnostic purposes, as shown in Figure 2.11, the BER is measured using the current bit. The error correction capability was evaluated in two ways – first, a single error event can be created by injecting an error bit in the detection process. This we find to be a useful way to characterize burst error behavior. The second approach is to reduce the transmit signal amplitude to observe the performance of the receiver for different SNR values. We have explored both approaches in this work and the results are provided in Figure 2.12 and Figure 2.13. Note that Reed Solomon based forward error correction (FEC) is becoming more popular to achieve an acceptably low BER in lossy channels. Therefore, we wanted to compare the performance of the proposed error correction techniques with an existing FEC based on the RS (528, 514) code. This particular version of FEC is capable of correcting up to 7 errors in a 528-symbol word. As discussed in Section 2.4, the proposed error correction capability makes constructive use of pre-cursor ISI. Therefore, in a relatively low ISI channel where the precursor ISI is 1/3<sup>rd</sup> or less compared to the main, the gain is 2-dB only. Here, the SNR is sufficiently large to keep the error count within a word low. Although there is a visible difference in raw BER, in such cases the post FEC BERs are identical (Figure 2.12) due to the fact that the number of errors within a word is less than 7, the design can reduce the BER to better than 10<sup>-12</sup> which is the limit of our measurement equipment. But as we move to moderate and high loss channels, the benefit of error correction becomes more visible. As the pre-cursor magnitude reaches 50% of the



(a)



Figure 2.11. (a) BER bathtub curve of the receiver. (b) Eye diagram at the AFE output @ 16 Gb/s and after the sequence decoder @ 4 Gb/s.



Figure 2.12. Measured SBR at CTLE output and BER performance as a function of signal to noise ratio. Conv. and proposed raw BER are measured and these error sequences are post processed to generate post FEC performance with Reed Solomon encoding (RS 528, 514)

performance of FEC providing more 3 dB gain. It is also interesting to note that as the precursor



(a)



Figure 2.13. Measured SBR at CTLE output and BER performance as a function of signal to noise ratio. Conv. and proposed raw BER are measured for moderate to high loss channels in (a) and (b) respectively. The measured raw error sequences are post processed to generate post FEC performance with Reed Solomon encoding (RS 528, 514).

exceeds 66% of the main, FEC is less effective. This is mainly because of burst errors, when the

number of errors in a word exceeds the correction capability of the given FEC. This is where the proposed error correction becomes very effective. We can achieve 6 dB gain without FEC and with FEC the gain exceeds 7.5-dB. Here not only do we get the benefit of a reduced number of errors, we also get the benefit of spreading out the errors to make RS(528,514) much more effective at detecting and correcting errors.

### 2.6. Conclusion

For high-loss channels, DFE is effective way to equalize post-cursor ISI. However, equalizing the

|                 | This work           | [9]                       | [2]                | [10]                |
|-----------------|---------------------|---------------------------|--------------------|---------------------|
| Technology      | 65nm                | 90nm                      | IBM 22nm           | 45nm SOI CMOS       |
| Data Rate       | 16Gb/s              | 16Gb/s                    | 16Gb/s             | 16Gb/s              |
| Channel Loss    | 33 dB at 8GHz       | 22dB at 8GHz              | 27dB at            | 40'' 32dB at 8Ghz   |
|                 |                     |                           | 5GHz               |                     |
| Equalizer       | 2 tap Tx FIR +      | 1 <sup>st</sup> tap FFE + | CTLE +             | 3 tap FFE (TX) +    |
|                 | CTLE +              | 3 tap DFE                 | 8 tap DFE          | CTLE +              |
|                 | 2 Tap DFE           |                           |                    | 5-12 tap DFE        |
| DFE burst Error | Yes                 | No                        | No                 | No                  |
| Correction (EC) |                     |                           |                    |                     |
| Power           | 58 mW w/o EC        | 69mW                      | 59.7mW             | 400 mW              |
| Consumption     | 74 mW w EC          |                           |                    | (Tx + Rx + CDR)     |
| Area            | 600um               | 227um                     |                    | 850um               |
|                 | x 600um             | x 276 um                  |                    | x 320um             |
| BER             | < 10 <sup>-12</sup> | <10-12                    | <10 <sup>-12</sup> | < 10 <sup>-15</sup> |

Table 2.1: Performance Summary and Comparison

pre-cursor ISI often degrades the SNR and at relatively low SNR DFE causes error propagation. This work makes constructive use of the pre-cursor to detect and correct burst errors enabling higher loss compensation. By addressing burst errors, the proposed solution makes room for simple binary BCH encoding instead of a higher latency FEC code. The lower latency (less than 20 UI) of the built-in error corrector makes our design attractive solution for different chip-to-chip links. Even with the error correction features, this solution achieves better energy efficiency compared to the state-of-the art solutions, as shown in Table -2.1

# Chapter 3.

# Low Power 16-Gbps High-speed Wireline Transceiver in 65-nm CMOS Technology

## 3.1 Background

In this section, A brief description of current-mode (CML) driver and voltage-mode driver will be given. Figure 3.1 shows a CML Driver with single-ended termination. When a switch is on, half of the current is drawn from the TX termination and other half of the current is drawn from the RX termination. So, only half of the current translates to the RX signal swing. Another important



Figure 3.1: CML Driver

thing is to notice here is that the polarity of current propagation doesn't change at the RX side.

This limits the signaling efficiency of CML Driver. Fig 3.2 shows conventional voltage mode NRZ Transmitter Driver with series termination. Here, when a switch is on all the current flow through the TX termination resistance to the RX termination resistance. Also, another advantage of this architecture over the CML architecture is that the polarity of current flown through the RX termination resistance changes polarity. So, this architecture becomes 4-times more efficient in terms of signaling efficiency. The following chart shows a comparison of different architecture.

| Driver/Termination    | Current Level          | RX Voltage Swing (peak-to- |
|-----------------------|------------------------|----------------------------|
|                       |                        | peak)                      |
| Current-Mode/Parallel | (RX Voltage Swing)/R   | I*R                        |
| Current-Mode/Series   | (RX Voltage Swing)/R   | I*R                        |
| Voltage-Mode/Parallel | (RX Voltage Swing)/2*R | $V_{Supply}$               |
| Voltage-Mode/Series   | (RX Voltage Swing)/4*R | V <sub>Supply</sub>        |

So, power consumption for voltage-mode driver with series termination is 4-times lower than the current consumption for current-mode driver with parallel termination.

### **3.2 Introduction**

Energy efficiency is one of the most important aspects of designing a transceiver. Nowadays, chipto-chip serial links are working at multi-Gbps speeds. In this scenario, the power dissipation of



Figure 3.2: Conventional Voltage mode NRZ transmitter

these links has become more important than ever due to their high speed. Traditionally, high-speed transceivers are information agnostic: meaning regardless of the value of the digital bits, transmitting and receiving requires equal energy. In differential signaling, transmitting '1' or '0' simply reverses the direction of the current in the termination resistor but the current drawn from the supply remains the same. Given this scenario, the only plausible way to reduce power is to be more efficient in the use of signaling current. To that end, voltage mode transmitters are 4x more efficient compared to current mode transmitters by transmitting all the current from the supply to

the receive [13-15]. Regardless of using voltage mode transmitters, we cannot reduce the power consumption of these transceivers to a significant extent. A conventional differential voltage mode transmitter driver is shown in Figure 3.2. Based on the bits to be transmitted, the switches are turned on or off. For transmitting a 1, the switch D gets turned on and the driver starts to draw current from the supply which flows through the load impedance. For transmitting a 0, the switch  $\overline{D}$  gets turned on, the transmitter starts transmitting the current in the opposite direction to when it was transmitting a 1. The important thing to notice is that the transmitter is drawing current constantly irrespective of what it is transmitting. Therefore, the amount of power dissipation remains constant for conventional voltage-mode transmitters. In this work, we have tried to use the relation between two consecutive bits to reduce power consumption, and thus coming up with an excellent energy-efficient transmitter that exploits the sparsity in the transmitted bits. The rest of the chapter is organized as follows. In Section 3.3, transition encoded signaling and the proposed transmitter are discussed. In Section 3.4, the proposed receiver and decoder architecture are explained. Section 3.5 contains the measurement results and a comparison of this architecture with similar works.

#### **3.3 Proposed Transmitter Architecture**

To increase the power efficiency of the transmitter, we employ transition encoding. In a data stream, we might have sparse data when the consecutive bits can have the same value. To take advantage of sparsity, we map the two consecutive bits, D(n)D(n-1) to three signal levels. Here, '10' and '01' bits are represented with the +VTX and -VTX signal levels. Note that both '11' and



Figure 3.3: Proposed transition encoded signaling transmitter

'00' translate to zero differential output by shorting the differential lines. Note that, during this time both signal lines are isolated from supply and ground. Therefore, does not consume any signaling current. However, transmitter side termination remains active to avoid additional reflection. There are several advantages to this transition encoded signaling. First, we can reduce the signaling power proportional to the information. In a random data stream, we have 50% transition density, and therefore this signaling improves the efficiency by 2X. But when transmitting a synaptic weight matrix with 80% null, we can improve signaling energy efficiency by 5X. This benefit will further increase with techniques such as pruning. Second, the simple encoding technique shown in Figure 3.3 does not add any significant latency but it improves the spectral efficiency of the modulation. Specifically, it removes the DC and low-frequency content sufficiently without increasing the bandwidth requirements. As a result, equalization becomes simpler and we can have an AC-coupled interface with less than 1 pF capacitance. Such a reduction in capacitance allows on-chip integration and thus removes the requirement for on-board or on-package capacitor and associated signal integrity challenges. Although this signaling improves the

information efficiency, it does not need to know the content of the data. Therefore, this signaling is compatible with conventional SerDes IP standards. A comparison between this proposed transmitter and a conventional voltage-mode transmitter is illustrated in Figure 3.4. It is evident from the simulation waveforms that the proposed method is much more power-efficient compared to the conventional one as it only draws current from the supply when there is transition in the input, whereas in case of the conventional transmitter the current drawn from the supply is almost constant. One important thing to consider is that all physical connections to the supply node will form parasitic inductance. As we are trying to drive the transmitter at a much faster date rate, also there will be changes in the current drawn from the supply node, and the  $L \frac{di}{dt}$  voltage drop can



Figure 3.4: Simulation waveforms of conventional and proposed voltage mode transmitter

become a point of concern. For that, the supply voltage ripple can become significant and may cause instability in signal amplitude and frequency. To solve this problem, we have attached enough capacitance to the supply node which provides an extra charge when needed and causes a reduction of the ripple voltage of the supply node. In general, this capacitor works as a voltage regulator for that supply node. A reasonable value of inductance generated in the physical connection of the supply node can be around 200-300pH. Figure 3.5 shows the amount of ripple present in the supply node while we increase the capacitance attached to the supply node. As we can see from the graph, the ripple voltage caused by the  $L \frac{di}{dt}$  voltage drop can be significantly reduced if we use the proper amount of capacitance. Enough decoupling capacitance was added to the supply in the chip which reduced the need for off-chip capacitance. In Figure 3.4, random bits were used to observe the power efficiency of the proposed transition-encoded transmitter. The



Figure 3.5: VDD Ripple Voltage Vs. Supply node Capacitance

amount of current drawn from the supply is also included. The average power consumption for randomly generated transmitted bits got reduced by almost 50% compared to the conventional voltage-mode transmitter.



Figure 3.6: Common gate amplifier and its small signal equivalent model

### **3.4 Receiver and Decoder Architecture**

Receiver side termination is important to avoid any signal reflection. This receiver provides active termination in the form of a common gate amplifier. The input impedance of the common gate amplifier is  $1/g_m$  which works as the termination impedance in this case. Here,  $g_m$  is the small-



Figure 3.7: Overall Receiver and Decoder Architecture

signal transconductance of the common gate amplifier. If we consider the frequency response of a common gate amplifier, we can see that the transfer function of the amplifier is

$$H(s) = \frac{g_m R_D}{\left(1 + \frac{C_S}{g_m}s\right)(1 + R_D C_D s)}$$

Figure 3.6 shows the small-signal equivalent circuit of a common gate amplifier. The dominant pole of this amplifier depends on  $g_m$ . As we are working with a very high data rate (16 Gbps), we need to boost the  $g_m$  at the higher frequencies. To do so, we have introduced feed-forward



Figure 3.8: Source-Degenerated Differential Pair

capacitors which relax the amount of transconductance required at the higher frequencies. Figure 3.7 shows the overall receiver and decoder architecture. After providing active termination, we need buffering and equalization before decoding the transmitted bits. The common gate amplifier

is followed by a source-degenerated differential amplifier stage that provides both the equalization and buffering to the comparator bank. The source-degenerated differential amplifier works as a continuous-time linear equalizer (CTLE) as shown in Figure 3.8. This amplifier introduces a zero in the system. This zero depends on the values of the degeneration resistance and capacitance (wz  $= 1/R_DC_D$ ), which can be modeled in an optimal way that can provide the peaking gain at our desired frequency. This CTLE also reduces high-frequency noises and provide enough signal strength to drive the comparator bank. Now, as we have discussed, the transmitted signals are mapped into three levels. Therefore, the receiver must decode the bits from these three levels. The Decoder is designed very similarly to the speculative DFE (Decision Feedback Equalizer). In the DFE, the current decision is based on previous decisions. The feedback path provides the previous decision which in turn detects the current bit. The mid-level in the RX eye can represent either 11 or 00 in the form of D(n)D(n-1). So, when we detect this level, the previously detected bit must come in a feedback path to detect the current bit. We compared the received signal with two threshold levels which are  $+\alpha$  and  $-\alpha$  as shown in the RX eye. These two comparator outputs are then fed into a logic block which selects the current bit through a 2x1 multiplexer. An important thing to notice is that the strong-arm comparator decision time in TSMC 65-nm CMOS for 200mV was found to be around 60ps when simulated. As we are trying to achieve a much higher data rate, we made four blocks of this decoder architecture which works in a time-interleaved fashion. Each of them worked at a 4-GHz clock which made it possible to achieve a 16-Gbps data rate without stressing the comparator decision. This also relaxed the time complexity of the comparator bank and ensured correct decisions. Each of the blocks provided its output to the next stage as feedback. The critical path was the feedback path, which had to carry the output of one stage to the next stage within the 25ps time window while working at a speed of 16 Gbps. Much attention was given

while doing the layout of the circuit on these feedback paths so that the routings of the paths are small as possible and symmetrical. Post layout simulation and physical testing ensured the error-free operation at 16-Gbps.

## **3.5 Implementation and Measurement**

The prototype implemented in TSMC 65nm CMOS is shown in Figure 3.9. The chip contains a link and a MAC accelerator. My contribution towards implementing the chip was designing and implementing the link. The MAC accelerator used the link for its operation. This chapter only



Figure 3.9: Implemented prototype in TSMC 65-nm CMOS

includes the description and performance of the link. Note that the lack of transition density in sparse data makes timing recovery challenging. To address that, we adopt a clock forwarded solution where a half-rate clock is forwarded from the transmitter side that is then divided to generate a quadrature clock that is used to interpolate between orthogonal phases. Each lane's



Figure 3.10: Equalizer Output Eye

skew is compensated individually and at the same time, synchronous jitter is tracked out without continuous tracking loop such as CDR. This high-speed transceiver occupies an active area of



Figure 3.11: Link BER Bathtub Curve

|              | This work            | [13]              | [14]              | [15]              |
|--------------|----------------------|-------------------|-------------------|-------------------|
| Technology   | 65nm                 | 16 nm FinFET      | 65nm              | 90nm              |
| Signalling   | Differential         | Single Ended      | Differential      | Differential      |
|              | Voltage Mode         | Voltage Mode      | Voltage Mode      | Current Mode      |
| Data Rate    | 16 Gb/s              | 25 Gb/s           | 16 Gb/s           | 10 Gb/s           |
| (Gbps)       |                      |                   |                   |                   |
| Equalization | Transition Encoding  | Tx EQ+CTLE        | Dicode Encoding   | N.A.              |
|              | + CTLE               |                   | Sequence decoding |                   |
| AC Coupling  | Yes                  | No                | No                | No                |
|              | No extra encoding or | Requires Encoding | Requires Encoding | Requires Encoding |
|              | off-chip component   | or large off-chip | or large off-chip | or large off-chip |
|              | needed               | Capacitor         | Capacitor         | Capacitor         |
| Signalling   | 0.1875 pJ/bit        |                   |                   |                   |
| Efficiency   | (Random Data)        | N.A               | 0.375 pJ/bit      | 2.4 pJ/bit        |
| (for 500mV   | 0.0375 pJ/bit        |                   |                   |                   |
| Rx swing )   | (90% Sparse)         |                   |                   |                   |
| Link Energy  | 1.125 pJ/bit         | 1.17 pJ/bit       | 2.56 pJ/bit       | 10 pJ/bit         |
| Efficiency   |                      |                   |                   |                   |
| Channel Loss | 20 dB                | 8 dB              | 24.2 dB           | N.A.              |
| Clocking     | Clock Forwarded      | Clock Forwarded   | N.A               | Clock Forwarded   |
| BER          | <10-12               | <10-15            | <10-12            | <10-12            |

Table 3.1: Link Performance Summary and Comparison

0.6mm<sup>2</sup>. This chip was tested using the pattern generator circuits as well as using a high-speed digital interface to inject the data into the transmitter. The equalizer output eye is shown in Figure 3.9. Taking advantage of the transition encoding, the link achieves the best signaling efficiency for 500-mV Rx swing. For 500-mV Rx swing, the link achieves a BER (Bit Error Rate) of less than 10<sup>-12</sup>. The received signal after equalization provided a voltage margin of almost 75mV and a timing margin of 35ps. This voltage margin helped the comparator bank to work faster and the chances of the comparator working in the metastability region decreased significantly. The timing margin made it easy for the orthogonal phases in the Rx to sample the data at ease. This transceiver's operating frequency can be changed easily by changing the VCO frequency. The

maximum data of 16-Gbps was achieved with a 20-dB lossy channel. The receiver consumed a total of 14mW power while working at 16-Gbps and the link efficiency is 1.125-pJ/bit at 16-Gbps. The link BER bathtub curve with a 20dB lossy channel is shown in Figure 3.11. The link performance was compared with the similarly published works and it key performance parameters are given in Table 3.1.

# Chapter 4.

# A 4-GS/s Digitally Interpolated 8-Bit Concurrent Binary Search ADC

# 4.1. Background

A brief description of Successive Approximation (SAR) ADC will be given in this section. SAR ADC digitizes the analog input by a binary search algorithm. In a SAR ADC, there is only one comparator. This comparator does *N* number of comparisons to get *N*-bit resolution. So, for an *N*-



Figure 4.1: 3-Bit Binary Search ADC

bit SAR ADC, we need N number of cycles for one conversion. The advantage of SAR ADC is that it only requires N number of comparisons for N-bit resolution compared to the flash ADC where  $2^N$  number of comparisons are needed to get N-bit resolution. But the conversion time is the trade-off here. The conversion time increases in proportional to the resolution of the ADC in SAR implementation. Figure 4.1 shows a 3-bit binary search space [60]. In SAR ADC, the reference of the comparator is updated after each comparison depending on the comparator output.

#### 4.2. Introduction

Adoption of digital equalization in wireline links has renewed the research interest in high-speed (~GHz), medium-resolution Analog-to-Digital Converters (ADC) [16]. Traditionally Flash ADCs were the most popular for such applications. However, multilevel signaling mandates 6-bit or higher resolution, where the flash converter's power consumption becomes a concern. This is mainly because of the exponential increase in the number of comparisons with increasing resolution. For an N-bit ADC, a straightforward implementation requires a  $2^N$  number of comparisons. In contrast, the Successive Approximation Register (SAR) ADC shown in Figure 4.2 requires only N comparisons for N-bit resolution. In addition to the reduced number of comparisons, CDAC-based implementation further improves energy efficiency. Obviously, these comparisons happen sequentially based on the binary search algorithm and, therefore, we need N-1 additional cycles for conversion (Figure 4.2(b)). In reality, mitigating metastability and offset of the comparator requires additional cycles. First, in the absolute worst-case scenario, the comparator may be required to resolve metastability twice in the conversion cycle and that may lead to 2 LSB error. Fortunately, this error can be reduced by adding redundancy, where we use one additional conversion cycle [17]. Second, higher resolution requirement mandates offset



Figure 4.2: (a) Time-interleaved SAR ADC architecture with CDAC (b) An example of transient operation of the Asynchronous SAR



Figure 4.3: Comparator power as a function of ADC resolution in 65 nm CMOS to achieve a decision time less than 125 ps. Here dynamic range of the ADC is assumed to 600mVpp and correction. To address this, an additional cycle is used for offset detection [17]. Third, metastability by itself causes a longer decision time. In wireline applications, AFE linearity limits the ADC input signal swing [16]. For example, in 65-nm CMOS with a 1.2V supply, the signal swing should be less than 600 mV for 8-bit resolution. That sets the resolution and metastability requirement to less than 2 mV. In addition to the increased power consumption, as shown in Figure 4.3, this also reduces the conversion rate. To address the conversion rate, concern several techniques are used including asynchronous clocking and multi-bit per conversion cycle solutions. But fundamentally we are limited by the number of conversion cycles. For example, an 8-bit SAR ADC requires 10 conversion cycles and that limits single SAR speed to 750 MS/s in 65-nm CMOS.

SAR speed only improved to 875 MHz in 16-nm finFET. Eventually, this reduction in throughput is compensated by interleaving multiple SAR ADCs which adds significant power and complexity.

This work is motivated to overcome the speed challenge of SAR ADC that is fundamentally related to conventional binary search. One way to improve conversion speed is concurrent binary search (CBS), similar to parallel processing. Here, two SAR ADCs concurrently converge to two outputs that are digitally interpolated to improve the resolution. Since interpolation is done after the SAR conversion cycles are completed, it improves resolution without sacrificing the conversion rate. In addition, the metastability and offset requirements are relaxed by 2x. Overall the proposed SAR architecture achieves 8-bit resolution using only 6 conversion cycles which accommodate both redundancy and mission mode offset correction. Compared to a traditional SAR implementation, which would require 10 cycles, this is a 40% improvement in conversion speed.

#### 4.3. SAR-Based Concurrent Binary Search ADC

Reducing the search space can improve the conversion rate of the SAR ADC. For example, if the search space for SAR can be reduced to 1/8, we can save 3 (i.e. log28) conversion cycles. This motivates a Flash-SAR hybrid approach in literature [18]. But comparing to the 8 references (Ref1-to-Ref8) and processing the logic to update their corresponding references is a power and time-consuming prospect that hinders conversion speed improvement. More importantly, any comparator error can lead to the selection of the wrong search space for the SAR. To avoid these concerns, the proposed single-channel ADC is implemented with 4 comparators associated with 4 CDAC shown in Figure 4.4(a). The full scale of the ADC is coarsely divided by 8 references but they are grouped - Ref1,3,5,7 and Ref2,4,6,8 are generated from two separate reference ladders to create overlaps between regions. Overlapping between regions enables both redundancy and



(a)

Figure 4.4: Proposed concurrent binary search ADC (a) single-channel ADC with interpolation logic references in each cycle of the conversion for two SAR ADCs

interpolation as shown in Figure 4.4(b). Comparisons with 8 references are done in two cycles: in



Figure 4.4: (b) Reference generation and blue and red lines are showing potential references in each cycle of the conversion for two SAR ADCs

the 1st cycle of the operation, we compare with Ref3,4,5,6 and, depending on their decisions, the following references are selected similar to the SAR algorithm. For example, the 3rd comparator compares the input with Ref4 in the 1st cycle, and depending on the outcome the following comparison happens with Ref2 or Ref6. Since the comparator output can be directly used for reference updates, it improves reference settling and conversion speed compared to a flash implementation. However, the thermometric output of the 4 comparators are processed where

latency is not critical: the two comparators furthest away from the signal are then retired to save energy and offset correction. The two remaining adjacent comparators that encompass the input signal then initiate the concurrent binary search. An example conversion process is shown in Figure 4.5 where we plot the references of the comparators throughout the conversion cycles and corresponding generated outputs of two concurrent SARs. The 1st two cycles of the conversion generate 4-bit outputs including 1-bit redundancy. By design, one of the SAR ADCs starts from Ref1,3,5,7 and another from Ref2,4,6,8 and their generated 8-bit outputs are codeBot<0:7> and codeTop<0:7> respectively. These codes are then interpolated to generate a 9-bit output including 1-bit redundancy.



#### A. Interpolation Logic

Figure 4.5: An example of the reference update in each cycle of the concurrent search and generated digital code via interpolation

For interpolation to work, the references are offset by 1 LSB such that their outputs are also shifted from each other by the same amount. Interpolation is done based on the polarity of the residue error and relative location of the two SAR ADC outputs. For example, if the top ADC output has +ve residue error and bottom ADC output has -ve then we can either subtract ½ LSB from the bottom ADC output or add ½ LSB with the top ADC output to generate the interpolated output. Similarly, other possibilities are also shown in Figure 4.4(a) that is implemented through simple sequential digital logic. Note that when the polarity of the residue error is the same for both top and bottom ADCs, by the construction of the reference, we ensure they also converge to the same code to avoid min/max operation. To minimize the interpolation logic complexity, we simply generate a 9-bit code where the addition or subtraction operation is reflected through the selection



Figure 4.6: An example of the error event during conversion

of 1 or 0s respectively, as the 9th bit. In the example case shown in Figure 4.5, for an input signal of 69.65 the top ADC generated output is 70 with -ve residue error, and the bottom ADC output is 69 with +ve residue error. Using the method described above the interpolated output is 69.5 that reduces the residue error to +0.15 only.

#### B. Redundancy

During the conversion cycle, the comparator can make a wrong decision due to CAP DAC settling error or comparator metastability. In the binary search algorithm, such a wrong decision leads to the preselection of the wrong search space where the input signal does not belong. If the remaining decisions are correct, the residue error can be reduced but may still be more than 1 LSB without



Figure 4.7: An example of the ADC operation where comparator 1 and 2 retires after 1st cycle and comparator 3 and 4 continues CBS.

redundancy. However, the proposed implementation provides tolerance to a single comparison error without requiring an additional conversion cycle. First, by creating an overlapping sub-range, we ensure that a single wrong decision in the search algorithm does not lead to exclusion of the search space where the input signal belongs. For example, if the input sample is at the border region between 010 and 100, regardless of which region is selected input sample will be covered by the selection of the 011 region. Second, the interpolation also provides tolerance to comparator error. An example case is shown in Figure 4.6. In the 5th conversion cycle, the input becomes very close to the reference. Assuming this comparison outcome is errored, this ADC will converge with 2 LSB error. However, this error does not impact the interpolated outcome – it is still 69.5 with the same residue error of +0.15.

#### C. Power and Area Consideration

Improving the conversion rate often comes at the cost of power and area penalty. For example, in an *N*-bit SAR ADC, using a  $2^{M}$  number of comparators in every cycle can reduce the conversion time to *N/M* cycles. But the total number of comparisons increases to  $N*2^{M}/M$ . For example, reducing conversion time by a factor of 2 doubles the number of comparisons and associated power. In reality, processing the  $2^{M}$  comparators adds complexity and latency that further limits the achievable improvement in conversion speed. The hybrid approach proposed in this work provides a better trade-off between power and conversion rate. In a traditional 1 bit/cycle SAR, 8 bits require 10 conversion cycles including redundancy and offset correction [17]. Compared to that the proposed architecture requires only 6 conversion cycles and the total number of comparisons is 14 (4 comparisons in 1st cycle and 2 comparisons in every cycle for the next 5 cycles). But thanks to digital interpolation, the metastability requirement is relaxed, and therefore,
the power consumption of the comparators is reduced. Although the ADC decodes 9-bits, the effective nominal resolution is 8-bit with 1-bit redundancy. The same is true for other existing SAR ADCs [14]. Therefore, we compare the area and power of the proposed ADC to a 9-bit ADC including 1-bit redundancy. Considering the area, four 6-bit CDACs in total consume area which is comparable to single 9-bit CDAC but switching energy consumed for 4x6-bit CDAC is significantly lower compared to 9-bit CDAC. The unit (LSB) capacitance value is 3.5 fF. The main area and power penalty come from the additional two comparators and additional interpolation logic that is relatively simple and will greatly benefit from technology scaling.

To summarize, the proposed architecture makes efficient use of 4 comparators to speed up the conversion rate and subranges the overlapping regions for concurrent binary search. The



Figure 4.8. Implemented quad-channel ADC in 65nm CMOS. Detail of

the single channel is also shown in detail

interpolation improves resolution through simple digital processing and at the same time enables redundancy. Single SAR simulation is shown in Figure 4.7, where the asynchronous clocks of Comp 1 and Comp 2 are gated while Comp 3 and Comp 4 continue SAR operation to generate 8-bit outputs including 1-bit redundancy.

### 4.4. Implementation and Measurement Results

The 4-way time-interleaved ADC is implemented in 65-nm CMOS to achieve aggregate 4-GS/s speed consuming 28 mW (Figure 4.8). Each channel consumes 120 um X 160 um area where the digital part (standard SAR logic and async. clock gen.) takes significant area, the proposed interpolation logic overhead is only 10% of that. The ADC prototype was implemented in TSMC 65-nm CMOS GP technology. Each SAR ADC can operate up to 1-GS/s over different corners.



Figure 4.9. Code density-based reference offset correction

Therefore, the implemented prototype used 4-way time interleaving to achieve the aggregate 4-GS/s speed (Figure 4.7). A 2-GHz DCD corrected clock is divided to generate quadrature sampling phase that is retimed by the input clock to keep the sampling clock jitter low.

## A. Reference Calibration

The main challenge in this architecture is the generation of offset references. Two resistive ladders are used for reference generation with an RDAC inserted at the top and at the bottom of the resistive ladder to create the offset. Ref1 to Ref8 are generated by tapping intermediate nodes of the ladder. These nodes are heavily loaded with capacitance (>2 pF) to reduce glitching during reference selection. Note that the mux output takes a significantly long time to settle within 0.5



Figure 4.10. Measured SNDR and SFDR of a single-channel SAR ADC as a function of input frequency.



Figure 4.11. Measured output spectrum of 4-GS/s ADC at the Nyquist Frequency.



Figure 4.12. Measured output spectrum for a 4-GS/s ADC at 10-MHz input frequency LSB. Fortunately, taking advantage of the redundancy and smaller switching step settling time can 64



Figure 4.13. Measured INL and DNL profile of the ADC prototype

be improved. Consequently, switch resistance in the mux is kept sufficiently low to achieve less than 80 ps settling time, as shown in Figure 4.7. Reference generation consumes approximately 2 mW, but when amortized over more time-interleaved channels would not significantly impact the power consumption. However, to achieve precise reference offset a code density-based calibration approach is adopted. We use bin counter to count the number of hits in each code of the possible to  $2^N$ . The count difference between two adjacent bins is indicative of the reference offset error. Note that this count difference is also influenced by individual comparator offset and input excitation. For a sinusoid input, the boundaries will have higher count values compared to the middle part. Therefore, the adjacent bin count differences are averaged across the dynamic range to identify the systematic offset that is a more accurate representation of the reference offset. Once the calibration is completed, bin counts become more uniform as shown in Figure 4.9. Once calibrated, INL and DNL reduce to less than 1 LSB and 0.7228 LSB respectively. Note that resolution is reduced at the top and bottom of the dynamic range where there is no overlap to interpolate. As a result, instead of 256 codes, we have 224 codes which translates to ~7.75 bit resolution. Despite that, interpolation improves the SNDR of single-channel ADC improves by 4

|                                 | [18]                         | [19]                               | [20]                         | [21]                               | This Work                                                                                                                                                      |                                        |
|---------------------------------|------------------------------|------------------------------------|------------------------------|------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|
| Speed (GS/s)                    | 0.7                          | 1.2/1.3                            | 4                            | 5                                  | 4                                                                                                                                                              | 1                                      |
| Resolution                      | 8                            | 8                                  | 8                            | 7                                  | 8                                                                                                                                                              |                                        |
| Architecture                    | Single Channel               | Single Channel                     | 4xTI                         | 8xTI                               | 4xTI Single<br>Channel                                                                                                                                         |                                        |
| АДС Туре                        | Flash-SAR                    | Pipelined-SAR                      | Pipelined                    | SAR                                | Concurrent Binary Search +<br>Interpolation                                                                                                                    |                                        |
| Redundancy                      | Yes                          | Yes                                | No                           | Yes                                | Yes                                                                                                                                                            |                                        |
| Comparator Offset<br>Correction | Offline Offset<br>Correction | Offline Offset<br>Correction       | Offline Offset<br>Correction | Offline Offset<br>Correction       | Mission Mode Offset Correction                                                                                                                                 |                                        |
| Technology                      | 65nm                         | 65nm                               | 65nm                         | 55nm                               | 65nm                                                                                                                                                           |                                        |
| Supply Voltage (V)              | 1.2                          | 1.25/1.3                           | 1.2/1.4                      | 1.2/1.5                            | 1.2                                                                                                                                                            |                                        |
| SNDR (dB)                       | 41.6 @Nyq.<br>Freq           | 43.7 @ Nyq.<br>Freq<br>45.1 @10MHz | 44.4 @Nyq.<br>Freq           | 35.9 @ Nyq.<br>Freq<br>42.7 @10MHz | 39.56 @ Nyq.<br>Freq<br>48.47@10MHz                                                                                                                            | 40.53 @ Nyq.<br>Freq<br>48.73@10MHz    |
| SFDR (dB)                       | 47 @Nyq.<br>Freq             | 58.1 @Nyq.<br>Freq<br>57.2 @10MHz  | 55 @Nyq. Freq                | 45 @Nyq.<br>Freq<br>65 @10MHz      | 48.21 @Nyq.<br>Freq<br>69.5 @10MHz                                                                                                                             | 51.21 @Nyq.<br>Freq<br>69.65<br>@10MHz |
| Power (mW)                      | 5.96                         | 5/5.8                              | 120                          | 38                                 | Sampler and Interleaver –<br>11(4xTI)/2.79(SC)<br>Digital Logic – 14.2(4xTI)/3.2(SC)<br>Clk & Clk Div –<br>2.56(4xTI)/0.55(SC)<br>Total – 27.76(4xTI)/6.54(SC) |                                        |
| Area (mm <sup>2</sup> )         | 0.033                        | 0.013                              | 1.35                         | 0.69                               | 0.019                                                                                                                                                          |                                        |
| FoM @Nyq Freq.<br>(fJ/step)     | 86.7                         | 35                                 | 219                          | 150                                | 89.32                                                                                                                                                          | 75.31                                  |
| FoM @10MHz<br>(fJ/step)         | -                            | 28.35                              | -                            | 69                                 | 32.05                                                                                                                                                          | 29.30                                  |

Table 4.1: Performance Summary of the ADC

to 6 dB up to Nyquist frequency (Figure 4.10). The reduction of SNDR and SFDR at the Nyquist rate relative to the low-frequency input is indicative of the timing noise sensitivity of the S/H.

While such reduction is common [16], it is possible to improve by increasing the slew rate at the cost of clock buffer power. FFT of the quad-channel also demonstrates the improvement of the quantization noise floor (Figure 4.11).

### **4.5.** Conclusion

The proposed ADC is compared to other Nyquist-rate ADCs in similar technologies and with similar resolution. Compared to other single-channel SAR architectures such as [18], this solution achieves a higher conversion rate. Hybrid solutions such as [19] can achieve 1-GS/s speed but require a combination of pipelining and SAR adding additional complexity, resampling, and interleaving after the pipeline stage. By creating overlapping search regions, we can accommodate redundancy and relax metastability requirements in addition to increasing the conversion rate. Such features, although not reflected in FoM, are critical for wireline applications. The lack of interpolation at the boundary of the dynamic range limits the SNDR for sinusoidal input. We expect the performance to be better for a wireline application where the input signal remains mostly in the mid-range.

## Chapter 5.

# **Concluding Remarks**

The increase in the functionality of a single chip due to the exponential scaling of transistors demands greater bandwidth in the chip-to-chip communication system. To stay within the power budget, more efficient signaling is needed. At the same time, the speed of the chip-to-chip communication should improve to make the increased functionality of the chips realistic.

For achieving high-speed bandwidth between chip to chip, the channel is the bottleneck of the system due to the frequency-dependent loss of the channel. Along with signal attenuation at the high frequencies, the channel adds Inter Symbol Interference and noise to the transmitted signal. Figure 5.1 shows the different channels frequency responses and as well as the impulse responses.



Figure 5.1: Channel Performance Impact [26]

As we can see from the figure, the performance of the channels deteriorates significantly at higher frequencies. So, we need strong equalization for error-free detection of data.

In this thesis, the main focus was given to enhancing the information efficiency of several blocks of high-speed communication systems digitally. At first, a low-latency burst error detection method for sophisticated equalization technique DFE (Decision Feedback Equalization) was proposed. Decision Feedback equalization has many advantages such as no noise and crosstalk amplification, non-linearity, etc. while the most significant issue with it being the probability of error propagation. A thorough analysis was conducted to find a correlation between two adjacent symbols and the found correlation was used to detect and correct any presence of burst error for NRZ signaling without the area and power overhead of the FEC (Forward Error Correction) Encoder. The concept was implemented in TSMC 65-nm technology. The work described in Chapter 2 has been submitted for publication to *IEEE Open Journal of Circuit and Systems* and it is currently under review.

A complete 16-Gbps Low power SerDes Transceiver was proposed in Chapter 3. This transceiver achieved 1.125-pJ/Bit efficiency at 16 Gbps by using transition encoding and CTLE (Continuous Time Linear Equalizer) for equalization. The idea and concept of this work were used in MAC accelerator which achieved 0.0375-pJ/Bit efficiency while transmitting 90% sparse data.

A 4-GS/s 8-bit concurrent binary search ADC was proposed in Chapter 4 which can be utilized in ADC-DSP based communication system. ADC based Receivers are now becoming very popular as they allow for strong digital equalization. At higher frequencies, channel loss becomes significant and ADC-based receivers allows strong equalization to compensate for high loss. The analysis was done to reduce the comparator's decision time and a novel digital interpolation was used to get information without stretching the conversion time. This ADC has been revised for publication in *IEEE Solid State Circuit Letters* and now is being considered for publication.

### **5.1 Future Work**

The focus of this thesis was entirely on implementing digital techniques that improve the information efficiency of the high-speed wireline communication system. In Chapter 2, an analysis was conducted on the correlation between symbols to increase the performance of DFE for NRZ signals. As the world is now focusing a lot on Pulse Amplitude Modulated (PAM) signals, a similar analysis could be conducted to increase the performance of DFE for PAM4/PAM8 signals. The digital interpolation technique described in Chapter 4 can also be used to implement other circuit blocks such as DAC (Digital to Analog Converter).

The present-day scaled Technologies such as 28-nm FDSOI, 16-nm finFET provides more power efficiency and bandwidth for the circuit. The work in Chapter 3 and Chapter 4 could be repeated in present-day scaled technologies which will should lead to better power efficiency and speed.

# **Bibliography**

[1] R. Narasimha, N. Warke and N. Shanbhag, "Impact of DFE Error Propagation on FEC-Based High-Speed I/O Links," *GLOBECOM 2009 - 2009 IEEE Global Telecommunications Conference*, Honolulu, HI, 2009, pp. 1-6.

[2] P. A. Francese et al., "A 16Gb/s 3.7mW/Gb/s 8-tap DFE receiver and baud rate CDR with 30kppm tracking bandwidth," *2013 IEEE Asian Solid-State Circuits Conference (A-SSCC)*, Singapore, 2013, pp. 33-36.

[3] Ming Yang, Shayan Shahramian, Hossein Shakiba, Henry Wong, Peter Krotnev, Anthony Chan Carusone, "A Statistical Modeling Approach for FEC-Encoded High-Speed Wireline Links", in *DesignCon*, 2020

[4] H. Sugita, K. Sunaga, K. Yamaguchi and M. Mizuno, "A 16Gb/s 1st-Tap FFE and 3-Tap DFE in 90nm CMOS," *2010 IEEE International Solid-State Circuits Conference - (ISSCC)*, San Francisco, CA, 2010, pp. 162-163.

[5] Aurangozeb, M. Mohammad and M. Hossain, "Analog to Sequence Converter-Based PAM-4 Receiver With Built-In Error Correction," in *IEEE Journal of Solid-State Circuits*, vol. 53, no. 10, pp. 2864-2877, Oct. 2018.

[6]Ieee802.org,2020.[Online].Available:http://www.ieee802.org/3/ck/public/18\_11/gopalakrishnan\_3ck\_01a\_1118.pdf.

[7] T. Toifl et al., "A 2.6 mW/Gbps 12.5 Gbps RX With 8-Tap Switched-Capacitor DFE in 32 nm CMOS," in *IEEE Journal of Solid-State Circuits*, vol. 47, no. 4, pp. 897-910, April 2012.

71

[8] J. Ren et al., "Precursor ISI Reduction in High-Speed I/O," 2007 *IEEE Symposium on VLSI Circuits*, Kyoto, 2007, pp. 134-135.

[9] A. D. Hossain, Aurangozeb, M. Mohammad and M. Hossain, "A 35 mW 10 Gb/s ADC-DSP less direct digital sequence detector and equalizer in 65nm CMOS," *2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits)*, Honolulu, HI, 2016, pp. 1-2.

[10] G. Gangasani et al., "A 16-Gb/s Backplane Transceiver With 12-Tap Current Integrating DFE and Dynamic Adaptation of Voltage Offset and Timing Drifts in 45-nm SOI CMOS Technology", *IEEE Journal of Solid-State Circuits*, vol. 47, no. 8, pp. 1828-1841.

[11] Mike Steinberger, "Link Modelling and the Challenges and Limitations for IBIS AMI for 56+Gb/s in Matching Circuit Performance," *ISSCC 2019 Forum*.

[12] X. Zhou et al., "Addressing Sparsity in Deep Neural Networks," *IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.*, vol. 38, no. 10, pp. 1858–1871.

[13] J. M. Wilson et al., "A 1.17pJ/b 25Gb/s/pin ground-referenced single-ended serial link for off- and on-package communication in 16nm CMOS using a process- and temperature-adaptive voltage regulator," *2018 IEEE International Solid - State Circuits Conference - (ISSCC)*, San Francisco, CA, 2018, pp. 276-278.

[14] Y. Chun and T. Anand, "A 13.6-16Gb/s Wireline Transceiver with Dicode Encoding and Sequence Detection Decoding for Equalizing 24.2dB with 2.56pJ/bit in 65nm CMOS," *2019 IEEE Custom Integrated Circuits Conference (CICC)*, Austin, TX, USA, 2019, pp. 1-4,

[15] E. Prete, D. Scheideler, and A. Sanders, "A 100mW 9.6Gb/s Transceiver in 90nm CMOS for Next-Generation Memory Interfaces," *in 2006 IEEE International Solid State Circuits Conference*Digest of Technical Papers, Feb. 2006, pp. 253–262.

[16] P. Upadhyaya et al., "A Fully Adaptive 19–58-Gb/s PAM-4 and 9.5–29-Gb/s NRZ Wireline Transceiver With Configurable ADC in 16-nm FinFET," in *IEEE Journal of Solid-State Circuits*, vol. 54, no. 1, pp. 18-28, Jan. 2019.

[17] L. Kull et al., "A 3.1mW 8b 1.2GS/s single-channel asynchronous SAR ADC with alternate comparators for enhanced speed in 32nm digital SOI CMOS," *2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers*, San Francisco, CA, 2013, pp. 468-469.

[18] D. G. Muratore et al., "An 8-bit 0.7-GS/s single channel flash-SAR ADC in 65-nm CMOS technology," *ESSCIRC Conference 2016: 42nd European Solid-State Circuits Conference*, Lausanne, 2016, pp. 421-424.

[19] H. Huang, L. Du and Y. Chiu, "A 1.2-GS/s 8-bit Two-Step SAR ADC in 65-nm CMOS With Passive Residue Transfer," in *IEEE Journal of Solid-State Circuits*, vol. 52, no. 6, pp. 1551-1562, June 2017.

[20] H. Wei, P. Zhang, B. D. Sahoo and B. Razavi, "An 8 Bit 4 GS/s 120 mW CMOS ADC," in *IEEE Journal of Solid-State Circuits*, vol. 49, no. 8, pp. 1751-1761, Aug. 2014.

[21] Y. Chung, C. Hu and C. Chang, "A 38-mW 7-bit 5-GS/s Time-Interleaved SAR ADC with Background Skew Calibration," *2018 IEEE Asian Solid-State Circuits Conference (A-SSCC)*, Tainan, 2018, pp. 243-246.

[22] J. Yang, T. L. Naing and R. W. Brodersen, "A 1 GS/s 6 Bit 6.7 mW Successive Approximation ADC Using Asynchronous Processing," in *IEEE Journal of Solid-State Circuits*, vol. 45, no. 8, pp. 1469-1478, Aug. 2010.

[23] H. Hong et al., "A Decision-Error-Tolerant 45 nm CMOS 7b 1 GS/s Nonbinary 2b/Cycle SAR ADC," in *IEEE Journal of Solid-State Circuits*, vol. 50, no. 2, pp. 543-555, Feb. 2015.

[24] B. Razavi, "Design Considerations for Interleaved ADCs," in *IEEE Journal of Solid-State Circuits*, vol. 48, no. 8, pp. 1806-1817, Aug. 2013.

[25] C. Liu, S. Chang, G. Huang and Y. Lin, "A 10-bit 50-MS/s SAR ADC With a Monotonic Capacitor Switching Procedure," in *IEEE Journal of Solid-State Circuits*, vol. 45, no. 4, pp. 731-740, April 2010.

[26] Palermo, S., 2020. [online] Ece.tamu.edu. Available at: <a href="http://www.ece.tamu.edu/~spalermo/ecen689/lecture7\_ee689\_channel\_transient.pdf">http://www.ece.tamu.edu/~spalermo/ecen689/lecture7\_ee689\_channel\_transient.pdf</a> [Accessed 29 May 2020].

[27] Michael Kanellos, "152,000 Smart Devices Every Minute In 2025: IDC Outlines the Future of Smart Things," Forbes.com, March 3, 2016. [Online].

[28] Ramune Nagisetty,"The Path to A Chiplet Ecosystem" ODSA Workshop at Intel: June 10th,2019 [Online]

[29] "An Elegant Interconnect for a More Civilized Age" in Intel webpage [Online]

[30] "DARPA Announces Next Phase of Electronics Resurgence Initiative" in DARPA webpage[Online]. Available: https://www.darpa.mil/news-events/2018-11-01a.

[31] Aurangozeb, A. D. Hossain, C. Ni, Q. Sharar and M. Hossain, "Time-Domain Arithmetic Logic Unit with Built-In Interconnect," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 25, no. 10, pp. 2828-2841, Oct. 2017.

[32] M. Hossain et al., "A Fast-Lock, Jitter Filtering All-Digital DLL Based Burst-Mode Memory Interface," in *IEEE Journal of Solid-State Circuits*, vol. 49, no. 4, pp. 1048-1062, April 2014.

[33] M. Hossain et al., "A 400MHz – 1.6GHz fast lock, jitter filtering ADDLL based burst mode memory interface," *2013 Symposium on VLSI Circuits*, Kyoto, 2013, pp. C244-C245.

[34] B. Zimmer et al., "A 0.11 pJ/Op, 0.32-128 TOPS, Scalable Multi-Chip-Module-based Deep Neural Network Accelerator with Ground-Reference Signaling in 16nm," *2019 Symposium on VLSI Circuits*, Kyoto, Japan, 2019, pp. C300-C301

[35] A. D. Hossain, Aurangozeb, M. Mohammad and M. Hossain, "A 35 mW 10 Gb/s ADC-DSP less direct digital sequence detector and equalizer in 65nm CMOS," *2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits)*, Honolulu, HI, 2016, pp. 1-2.

[36] Aurangozeb, M. Mohammad and M. Hossain, "Analog to Sequence Converter-Based PAM-4 Receiver with Built-In Error Correction," in *IEEE Journal of Solid-State Circuits*, vol. 53, no. 10, pp. 2864-2877, Oct. 2018 (Invited Paper)

[37] Aurangozeb, A. D. Hossain, M. Mohammad and M. Hossain, "Channel-Adaptive ADC and TDC for 28 Gb/s PAM-4 Digital Receiver," in *IEEE Journal of Solid-State Circuits*, vol. 53, no.
3, pp. 772-788, March 2018. [38] M. Hossain, Aurangozeb, A. K. M. Delwar Hossain and M. Mohammad, "A 82 mW 28 Gb/s PAM-4 digital sequence decoder with built-in error correction in 28nm FDSOI," *2017 IEEE Asian Solid-State Circuits Conference (A-SSCC)*, Seoul, 2017, pp. 85-88.

[39] Aurangozeb, C. Dick, M. Mohammad and M. Hossain, "A 32Gb/s 2.9pJ/b Transceiver for Sequence-Coded PAM-4 Signalling with 4-to-6dB SNR Gain in 28nm FDSOI CMOS," *2019 IEEE International Solid- State Circuits Conference - (ISSCC)*, San Francisco, CA, USA, 2019, pp. 480-482.

[40] Y. Ren, D. Perron, F. Aurangozeb, Z. Jiang, M. Hossain and V. Van, "A Continuously Tunable Silicon Double-Microring Filter With Precise Temperature Tracking," in *IEEE Photonics Journal*, vol. 10, no. 6, pp. 1-10, Dec. 2018, Art no. 6602310.

[41] Y. Ren, D. Perron, F. Aurangozeb, Z. Jiang, M. Hossain and V. Van, "Broadband-Tunable Cascaded Vernier Silicon Photonic Microring Filter with Temperature Tracking," *2019 Optical Fiber Communications Conference and Exhibition (OFC)*, San Diego, CA, USA, 2019, pp. 1-3.

[42] A. K. M. Delwar Hossain, Aurangozeb and M. Hossain, "Burst mode optical receiver with 10 ns lock time based on concurrent DC offset and timing recovery technique," in *IEEE/OSA Journal of Optical Communications and Networking*, vol. 10, no. 2, pp. 65-78, Feb. 2018.

[43] Y. Ren, D. Perron, F. Aurangozeb, Z. Jiang, M. Hossain and V. Van, "A Continuously Tunable SOI Microring Filter with Temperature Tracking," *2018 IEEE Photonics Conference* (IPC), Reston, VA, 2018, pp. 1-2.

[44] U. Singh et al., " A 780mW 4×28Gb/s transceiver for 100GbE gearbox PHY in 40nm CMOS,"
2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC),
San Francisco, CA, 2014, pp. 40-41

[45] O. Agazzi et al., "A 90nm CMOS DSP MLSD Transceiver with Integrated AFE for Electronic Dispersion Compensation of Multi-mode Optical Fibers at 10Gb/s," *2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers*, San Francisco, CA, 2008

[46] M. Pisati et al., "A 243-mW 1.25–56-Gb/s Continuous Range PAM-4 42.5-dB IL ADC/DAC-Based Transceiver in 7-nm FinFET," in *IEEE Journal of Solid-State Circuits*, vol. 55, no. 1, pp. 6-18, Jan. 2020,

[47] A. Shokrollahi et al., "10.1 A pin-efficient 20.83Gb/s/wire 0.94pJ/bit forwarded clock CNRZ5-coded SerDes up to 12mm for MCM packages in 28nm CMOS," 2016 IEEE International SolidState Circuits Conference (ISSCC), San Francisco, CA, 2016, pp. 182-183.

[48] Jacob Aron "IBM unveils its first commercial quantum computer," NewScientist.com Jan 8,
2019 [Online]. Available: https://www.newscientist.com/article/2189909-ibm-unveils-its-first-commercial-quantum-computer/

[49] U.S. Patent. US9614538B1 "Analog-to-digital conversion based on signal prediction" MasumHossain, Maruf H. Mohammad, Granted on April 4, 2017

[50] P. Upadhyaya et al., "A Fully Adaptive 19–58-Gb/s PAM-4 and 9.5–29-Gb/s NRZ Wireline Transceiver With Configurable ADC in 16-nm FinFET," in *IEEE Journal of Solid-State Circuits*, vol. 54, no. 1, pp. 18-28, Jan. 2019.

[51] M. LaCroix et al., "A 60Gb/s PAM-4 ADC-DSP Transceiver in 7nm CMOS with SNR-Based Adaptive Power Scaling Achieving 6.9pJ/b at 32dB Loss," *2019 IEEE International Solid- State Circuits Conference - (ISSCC)*, San Francisco, CA, USA, 2019, pp. 114-116

[52] K. Tan et al., "A 112-GB/S PAM4 Transmitter in 16NM FinFET," 2018 IEEE Symposium on VLSI Circuits, Honolulu, HI, 2018, pp. 45-46

[53]U.S. Patent. US20160217872A1 "Adaptive analog-to-digital conversion based on signal prediction" Masum Hossain, Maruf H. Mohammad, Granted on September 27, 2016

[54] U.S. Patent. US9378843 "Collaborative analog-to-digital and time-to-delay conversion based on signal prediction" Masum Hossain, Maruf H. Mohammad, Granted on June 28, 2016

[55] B. Patra et al., "Cryo-CMOS Circuits and Systems for Quantum Computing Applications," in *IEEE Journal of Solid-State Circuits*, vol. 53, no. 1, pp. 309-321, Jan. 2018.

[56] João Marques Lima, " Data Centres Of The World Will Consume 1/5 Of Earth's Power By 2025," data-economy.com, Dec 12, 2017. [Online].

[57] W. T. Beyene, "The design of continuous-time linear equalizers using model order reduction techniques," *2008 IEEE-EPEP Electrical Performance of Electronic Packaging*, San Jose, CA, 2008, pp. 187-190,

[58] D. Kim, W. Choi, A. Elkholy, J. Kenney and P. K. Hanumolu, "A 15Gb/s 1.9pJ/bit sub-baudrate digital CDR," *2018 IEEE Custom Integrated Circuits Conference (CICC)*, San Diego, CA, 2018, pp. 1-4, [59] W. Redman-White et al., "A Robust High Speed Serial PHY Architecture With Feed-Forward Correction Clock and Data Recovery," in *IEEE Journal of Solid-State Circuits*, vol. 44, no. 7, pp. 1914-1926, July 2009,

[60] A. K. Walter and Analog Devices Inc, *Data Conversion Handbook*, 1 edition. Amsterdam;Boston: Newnes, 2004.