95% Leakage-Reduced FPGA using Zigzag Power-gating, Dual-\(V_{TH}/V_{DD}\) and Micro-\(V_{DD}\)-Hopping

Canh Q. Tran, Hiroshi Kawaguchi and Takayasu Sakurai

Institute of Industrial Science and Center for Collaborative Research, University of Tokyo
4-6-1 Komaba, Meguro-ku, Tokyo, 153-8505 Japan
{canh, kawapy, tsakurai}@iis.u-tokyo.ac.jp

Abstract Low-power FPGA architecture is proposed based on fine-grained \(V_{DD}\) control scheme called micro-\(V_{DD}\)-hopping. Four Configurable Logic Blocks (CLB) are grouped into one block where \(V_{DD}\) is shared. In the \(V_{DD}\)-hopping scheme, \(V_{DD}\) of each block is varied between the higher \(V_{DD}\) \((V_{DDH})\) and the lower \(V_{DD}\) \((V_{DDL})\) spatially and temporally to achieve lower power, while keeping performance undegraded. A level shifter that has less contention is proposed. The FPGA also incorporates Zigzag power-gating scheme, special care has been taken to cope with sneak leakage path problem. The proposed FPGA is fabricated using 0.35µm CMOS technology together with the conventional fixed-\(V_{DD}\) FPGA. Measurement shows that the dynamic power can be reduced by 86% when the required speed is half of the highest speed. Simulation using 90nm CMOS technology shows that a leakage power reduction of 95% can be achieved, when the proposed method is used. Area overhead of the proposed FPGA is 2%.

I. INTRODUCTION

In the 90nm era and beyond, the number of transistors on a chip summounts 1 billion and the development cost and time have been increasing rapidly. One solution for this problem is to use a configurable LSI such as an FPGA (field programmable gate arrays). An FPGA is attractive because of their inherently low non-recurring engineering (NRE) cost and short time-to-market [1]. Since an FPGA uses more transistors per function than SoC (system-on-a-chip) in order to achieve programmability, power consumption, especially the leakage power of an FPGA is larger than that of the SoC. These days, one CLB (configurable logic block) shows dynamic power of 2.3µW/MHz [2]. Since an FPGA chip can have \(10^5\) CLBs in a 90-nm CMOS technology, it would consume the dynamic power of 40W when it operates at 200 MHz and almost same amount of leakage power will be added to the dynamic power.

Until recently, most of the studies on FPGA have mainly focused on area and performance, and there have been little work carried out for reducing power of the FPGA. On the other hand, there have been extensive studies on low dynamic and leakage power design techniques for SoC such as \(V_{DD}\) hopping [3], SCCMOS (super cutoff CMOS) [4] and MTCMOS (multi-threshold CMOS) [5]. The long-range trend for low-power design is to apply an adaptive control of \(V_{DD}/V_{TH}\) in time and space in finer granularity. In this paper, integrated low-power architecture are proposed and implemented for FPGA, which fully utilizes the fine-grain assignment of \(V_{DD}/V_{TH}\) in time and space.

In the \(V_{DD}\) hopping, supply voltage is dynamically changed adaptive to the required speed. It is demonstrated that using the method, power consumption can be reduced by more than 75% if the average required speed is a half of the maximum speed. The \(V_{DD}\) hopping method was applied to a chip level but in order to reduce the power more, it is necessary to apply the method to a block level. In this paper, a micro-\(V_{DD}\)-hopping scheme is proposed and manufactured for FPGA.

The SCCMOS and MTCMOS can effectively cut off leakage current in a standby mode but they suffer from a long wake-up time, which is the time to recover from a standby mode to an active mode. Zigzag power-gating scheme [6-7] can reduce the wake-up time to less than 1/5 of a clock cycle. Thus zigzag CMOS can be used as a substitute technique for clock gating when the clock gating loses its merit in a leakage-dominant era. Therefore, the zigzag CMOS can be used even in an active mode. The zigzag CMOS scheme is successfully applied for an FPGA for the first time to reduce leakage power. The main caveat to apply the zigzag CMOS to the FPGA is a sneak path problem [8], whose countermeasure is also proposed in this paper.

II. ARCHITECTURE

Fig.1 shows the architecture of the proposed FPGA. Four CLBs are clustered into one \(V_{DD}\) island where the same \(V_{DD}\) is used. \(V_{DD}\) of the block of 4 CLBs is either high \(V_{DD}\) \((V_{DDH})\) or low \(V_{DD}\) \((V_{DDL})\). Only two levels of \(V_{DD}\) are used because testing and characterization is feasible if the number of levels is confined to two. One more merit of confining the number of levels to two is that the switching in between the levels is quick. If there are more levels than two, more number of \(V_{DD}\) grids and \(V_{DD}\) switches are required and overhead becomes unacceptable. The island is applied \(V_{DDH}\) when high performance is required and is changed to \(V_{DDL}\) if the blocks operate at lower speed. The size of the block, which is four CLBs in this architecture, has been optimized using simulations with a number of benchmark circuits. If the number of CLBs decreases in a block, the finer control is possible but area and delay overhead increase. Here, four is selected to keep the chip area and delay overhead below 5%. One CLB includes 4 BLEs (basic logic elements) and 5 inputs, 3 outputs. One BLE consists of one LUT (Look-up table), one D-FF and one 2-1 MUX. This configuration is chosen because it
is one of the best configurations for delay, area and logic utilization [1].

Since $V_{DD}$ of a block can be $V_{DDH}$ and the interconnect uses $V_{DDL}$, a level shifter is one of the keys in this architecture. Thus improved circuit design for level shifters is described in detail in the next section.

In the proposed FPGA, if the supply voltage of the clustered block is $V_{DDH}$, the clock frequency of the block is $f$. When the supply voltage is $V_{DDL}$, the clock frequency of the block should be reduced to $f/2$. Since the FPGA has a Manhattan layout, an H-tree clock system can be easily implemented as shown in Fig.2(a). To cope with the skew problems in between $f$ and $f/2$, $f$ is generated from $f$ as in Fig.2(b). There are two different phases of $f/2$ if $f/2$ is generated without constraint as is shown in Fig.2(b) and the system suffers from a synchronization problem among the different phases. To prevent this issue, a RESET signal is asserted at the power-up and only one phase for $f/2$ is generated. Timing chart of the clock frequency and the supply voltage of one block are shown in Fig.3. When the block does not need high speed, the clock frequency is reduced to $f/2$ and then the voltage is pulled down to $V_{DDL}$. When the block requires high speed, at first the supply voltage is pulled up to $V_{DDH}$ and then the clock frequency is changed to $f$.

III. CIRCUIT DESIGN

Power Gating in CLB

There are a lot of SRAM cells in an FPGA but the SRAM cell and their output buffers do not have to operate at high speed, since they only drive local nodes statically. Thus, high $V_{TH}$ ($V_{THH}$) can be used for SRAM’s, and leakage current is not an issue there. On the other hand, logic blocks consume much leakage power because they use low $V_{TH}$ to enhance the speed. The leakage current in the logic blocks can be mitigated if we apply power gating technique such as zigzag CMOS.

Fig.4 shows the schematics of NAND's and INV's that use the ZSCCMOS scheme. The voltage of the virtual $V_{DD}$ and $V_{SS}$ lines are denoted as $V_{DDV}$ and $V_{SSV}$, respectively. In a standby mode, they are neither at $V_{DD}$ nor at $V_{SS}$, but stay between $V_{DD}$ and $V_{SS}$. Thus the wake-up time is shorter than the other power-gating schemes. The overdrive voltage, $V_{OD}$, in the figure

![Fig.1 Proposed FPGA](image1)

![Fig.2(a) Clock distribution. (b) $f/2$ is generated from $f$ at the clustered block.](image2)

![Fig.3 Timing chart of the clock frequency and the supply voltage.](image3)

![Fig.4 Zigzag CMOS scheme.](image4)

![Fig.5 Sneak leakage problem.](image5)

![Fig.6 Schematic of the proposed LUT. The black-painted NOR gates are used with high threshold voltage.](image6)
can be zero. Even if the $V_{OD}$ is zero, the leakage can be suppressed by an order of magnitude because of the off-off stacking structure incorporated in the scheme.

For the CMOS gates like INV or NAND, straightforward application of zigzag CMOS is fine but at the CMOS and transmission gate interface as shown in Fig.5, if the zigzag CMOS is straightly applied, the sneak leakage problem occurs. In FPGA, the multiplexors which use transmission gate are not only used by the logic block but also the switch block. Therefore, at the interface of CMOS gates and transmission gates, special care has been taken to avoid the sneak leakage problem. For example, for a Look-Up-Table (LUT), to solve the sneak leakage problem, small NOR gate which uses high $V_{TH}$ is added to the SRAM cell output to set all input logic level to ‘L’. Thus the sneak leakage current disappears as is shown in Fig.6.

Fig.7 shows the zigzag power-gated and sneak-leakage-path suppressed CLB. At the outputs of the CLB there are keepers to maintain the states of CLB outputs when the CLB is cut off. These keepers can be made with high $V_{TH}$ transistors of minimum size to reduce leakage current because they only have to keep the states of the CLB outputs and do not need to operate fast.

Interconnect

Since the inter-block interconnects in this scheme use $V_{DDL}$, only NMOS transistor is needed for a switch box and connection block. Therefore, the switch block and connection block become simple and the capacitance of the interconnect is reduced. These lead to smaller area and lower power consumption compared with the case using $V_{DDH}$ by the interconnect.

The reduced swing is effective to reduce the power but it will give rise to signal integrity issue. In normal SoC designs, if some interconnects use the low-voltage signal and the CMOS use high-voltage swing, the high-swing aggressor induces noise on the low-swing victims of the height more than the logic threshold. This is inevitable since in SoC designs, high-swing signals and low-swing signals are laid out in a totally intermingled way. On the other hand, in a structured LSI like an FPGA, interconnects are laid out in an orderly fashion and low-swing inter-block interconnects can be bundled together. Thus, the signal integrity issue among inter-block interconnects can be solved. The only remaining source of the signal integrity issue is the coupling between inter-block interconnects and intra-block interconnects.

To solve this issue, a ground line is inserted between intra-block lines and inter-block interconnects, which can be done because the FPGA has an ordered structure.

Level Shifter Design

When the micro-VDD-hopping technique is applied as shown in Fig. 1, each block could be operated under either $V_{DDH}$ or $V_{DDL}$. Thus, voltage level shifters are needed between blocks and the interconnects in order to avoid static current at a receiver side. The conventional level shifter shown in Fig.8(a) has large delay because it suffers from contention between the pull-down and the pull-up transistors. The contention problem increases both delay and power consumption (because of large crowbar current). Fig.8(b) shows the proposed level shifter, namely a Bypassing Enabled Level Shifter (BELS). Two PMOS and three NMOS transistors are added to the conventional shifter. The BELS has two operation modes: “SHIFT” mode and “NON-SHIFT” mode. When the output voltage $V_{DD}$ is $V_{DDH}$, the BELS is in a “SHIFT” mode. In the “SHIFT” mode, by setting the Bypass signal to $V_{DDL}$, the contention at node B will be reduced and the logic value of node B is established faster. Therefore, the delay is less than the case of Bypass signal being set to 0V. When the $V_{DD}$ is $V_{DDL}$, the shifting function is not required and the BELS is switched to “NON-SHIFT” mode. In the “NON-SHIFT” mode, signal EN is set to $V_{DDH}$ to cut off used transistors from the power supply lines. Because the Bypass signal is set to $V_{DDH}$ in the “NON-SHIFT” mode, MN5 can pass through signal without threshold voltage loss (Assuming that $V_{DDH} > V_{DDL} + V_{TM}$).

IV. SIMULATION AND MEASUREMENT RESULTS

Both of the proposed and the conventional FPGA are manufactured using 0.35µm CMOS technology with the nominal supply voltage of 3.3V to demonstrate how much the proposed approach can reduce power consumption. Fig.11 shows the fabricated chip microphotograph. Area overhead of the proposed FPGA is 2%. Fig.9 shows the measured power and delay of the proposed and the conventional FPGA when an 8 bits ripple carry adder is implemented. $V_{DD}$ of the conventional FPGA is kept at 3.3V. For the proposed FPGA, supply voltage $V_{DD}$ of the clustered block is either $V_{DDH}$ or $V_{DDL}$. In the measurement, $V_{DDL}$ is varied from 1.8V to 2.5V, while $V_{DDH}$ is kept at 3.3V. At $V_{DDL}$=1.8V, power consumption of the proposed FPGA is reduced by 86% compared with the conventional FPGA. Therefore, if the speed required is a half of the maximum achievable speed, the power consumption can be reduced by 86%. Even when the clustered block supply voltage $V_{DD}$ is $V_{DDH}$,
strongly dependent on \( V_{DDL} \). Subthreshold voltage is reduced due to the \( VDDL \) of clustered block \( VDD \) is \( VDDL \), the leakage power of the conventional FPGA are also shown.

When the supply voltage is 1V. Fig.10 shows the simulated leakage powerConsumption of the proposed FPGA can be reduced by 86% compared to that of the conventional FPGA. Novel shifter designs are also proposed to make the micro-\( VDD \) hopping more effective. Simulation using 90nm CMOS technology shows that leakage power of the proposed FPGA can be reduced by 95%. The proposed clustered block can be forced into standby mode within one clock. Therefore, the proposed method is effective method to reduce leakage power in the leakage-dominant era.

**ACKNOWLEDGEMENT**

Valuable discussions with Mr. K. Mashiko, A. Hashiguchi, Y. Ueda, M. Nomura, H. Yamamoto from Semiconductor Technology Academic Research Center (STARC) and M. Takamiya are appreciated. The chip fabrication is supported by VLSI Design and Education Center (VDEC), the University of Tokyo with the collaboration by Rohm Corp. and Dai Nippon Printing Corp.

**REFERENCES**


Fig.9  Measured power and delay of the proposed FPGA when \( VDD \) is fixed at 3.3V, \( V_{CCX} \) is changed from 1.8V to 2.5V. Power and delay of the conventional FPGA are also shown.

Fig.10  Leakage power of the proposed FPGA when \( V_{CCX} = V_{CCX} \) and the Zigzag power gating is adopted. Leakage power of the conventional FPGA is also shown.

Fig.11  Chip microphotograph. S and C stand for switch block and connection block respectively.

and has the same value as the nominal supply voltage of 3.3V, the power consumption of the proposed FPGA is smaller than that of the conventional one by 10%, and the delay overhead then is only 3%. This is because interconnect power of the proposed FPGA has been reduced due to the reduced signal swing on inter-block interconnects.

To compare the leakage power of the proposed FPGA with the conventional FPGA, simulations using 90nm CMOS technology with dual \( VDD \) are carried out. The circuit used in the simulations has the same structure as the circuit fabricated using 0.35 \( \mu \)m. Only technology parameters are changed. The nominal supply voltage is 1V. Fig.10 shows the simulated leakage power of the conventional and proposed FPGA. When the supply voltage of clustered block \( V_{DD} \) is \( V_{DDD} \), the leakage power of the proposed FPGA can be reduced to 58%. The leakage power is strongly dependent on \( V_{DDD} \). Subthreshold voltage is reduced due to the Drain Induced Barrier Lowering (DIBL) effect. Therefore, the leakage power is reduced when the supply voltage \( V_{DDD} \) is reduced. If the zigzag power-gating scheme is also used, the leakage power of the proposed FPGA can be further reduced to 95% compared to that of the conventional FPGA.

The wake-up time of the power-gating circuit is 620ps. If the clock frequency of FPGA is 500MHz, the clustered block can be forced into an active mode within one clock cycle.

**SUMMARY**

A low-power FPGA based on fine-grain control is proposed. In the proposed FPGA, fine-grain control of \( V_{DD} \), clock and power gating are integrated to provide low-power solution for an FPGA. As to the micro-\( V_{DD} \) hopping, when speed of the clustered block \( V_{DD} \) is half of the maximum speed, power consumption of the proposed FPGA can be reduced by 86% compared to that of the conventional FPGA. Novel shifter designs are also proposed to make the micro-\( V_{DD} \) hopping more effective. Simulation using 90nm CMOS technology shows that leakage power of the proposed FPGA can be reduced by 95%. The proposed clustered block can be forced into standby mode within one clock. Therefore, the proposed method is effective method to reduce leakage power in the leakage-dominant era.