PAPER Special Section on Advanced Technologies in Digital LSIs and Memories

# A 10T Non-precharge Two-Port SRAM Reducing Readout Power for Video Processing

Hiroki NOGUCHI<sup>†a)</sup>, Nonmember, Yusuke IGUCHI<sup>†</sup>, Hidehiro FUJIWARA<sup>†</sup>, Student Members, Shunsuke OKUMURA<sup>†</sup>, Nonmember, Yasuhiro MORITA<sup>†</sup>, Student Member, Koji NII<sup>†,††</sup>, Hiroshi KAWAGUCHI<sup>†</sup>, Nonmembers, and Masahiko YOSHIMOTO<sup>†</sup>, Member

SUMMARY We propose a low-power non-precharge-type two-port SRAM for video processing that exploits statistical similarity in images. To minimize the charge/discharge power on a read bitline, the proposed memory cell (MC) has ten transistors (10T), comprised of the conventional 6T MC, a readout inverter and a transmission gate for a read port. In addition, to incorporate three wordlines, we propose a shared wordline structure, with which the vertical cell size of the 10T MC is fitted to the same size as the conventional 8T MC. Since the readout inverter fully charges/discharges a read bitline, there is no precharge circuit on the read bitline. Thus, power is not consumed by precharging, but is consumed only when a readout datum is changed. This feature is suitable to video processing since image data have spatial correlation and similar data are read out in consecutive cycles. As well as the power reduction, the prechargeless structure shortens a cycle time by 38% compared with the conventional SRAM, because it does not require a precharge period. This, in turn, demonstrates that the proposed SRAM operates at a lower voltage, which achieves further power reduction. Compared to the conventional 8T SRAM, the proposed SRAM reduces a charge/discharge possibility to 19% (81% saving) on the bitlines. As the measurement result, we confirmed that the proposed 64-kb video memory in a 90-nm process achieves an 85% power saving on the read bitline, when considered as an H.264 reconstructed image memory. The area overhead is 14.4%.

key words: 8T SRAM cell, 10T SRAM cell, low-power SRAM, nonprecharge SRAM, two-port SRAM, video processing

## 1. Introduction

As the ITRS Roadmap predicts, a memory area is becoming larger, and will occupy 90% of an SoC's area by 2013 [1]. Even on a real-time video SoC, this trend is going on. An H.264 encoder for a high-definition television requires, at least, a 500-kb memory as a search-window buffer, which consumes 40% of its total power [2]. As process technology is scaled down, a large-capacity SRAM will be adopted as a frame buffer and/or a restructured-image memory on a video chip, and might potentially dissipate a larger portion of power. To save the power in the real-time video application, we report a low-power two-port SRAM in this paper.

A two-port SRAM is suitable for real-time video processing since it can make one read and one write simultaneously in a clock cycle [2]–[5]. In the conventional eight-transistor (8T) two-port memory cell (MC) depicted in

DOI: 10.1093/ietele/e91-c.4.543



Fig.1 A schematic of the conventional 8T precharge-type two-port memory cell.

Fig. 1, two nMOS transistors (N5 and N6) for a read wordline (RWL) and a local read bitline (LRBL) are added to a single-port 6T MC, which frees a static noise margin (SNM) in a read operation [6]. Meanwhile, a precharge circuit must be implemented on the LRBL so that the two nMOS transistors can sink a bitline charge to the ground.

In addition to the precharge circuit, we have to prepare a bitline keeper on the LRBL in the conventional two-port SRAM. Many MCs connecting to the LRBL draw bitline leakage even if they are not selected as a readout bit. Even when a selected MC does not discharge the LRBL ("1" readout), the LRBL voltage would be decreased by the bitline leakage in such the case if there was no bitline keeper. The bitline keeper compensates this bitline leakage and maintains the voltage level on the LRBL during a "1" readout [7]. Otherwise, we cannot distinguish a readout current from the bitline leakage, which turns out to a readout malfunction.

As process technology advances, a supply voltage and a threshold voltage of transistors decrease. Since the low threshold voltage increases the bitline leakage, we have to upsize the bitline keeper, and then pay area overhead. The large bitline keeper imparts a negative influence on the readout time as well. To make the matters worse, the delay overhead becomes larger as the supply voltage decreases.

Figure 2 depicts simplified operation waveforms in read cycles in the conventional 8T precharge-type SRAM. A precharge scheme is adopted and an LRBL must be

Manuscript received August 18, 2007.

Manuscript revised November 14, 2007.

 $<sup>^{\</sup>dagger} \mathrm{The}$  authors are with Kobe University, Kobe-shi, 657-8501 Japan.

<sup>&</sup>lt;sup>††</sup>The author is with Renesas Technology Corporation, Itamishi, 664-0005 Japan.

a) E-mail: h-nog@cs28.cs.kobe-u.ac.jp

precharged to a supply voltage by the start time of a clock cycle. Therefore, a charge/discharge power is consumed on the LRBL when "0" is read out. In contrast, no power is consumed when "1" is read out because the LRBL keeps the supply-voltage level and we do not have to precharge the LRBL.

In our prior study, which examined saving the charge/discharge power on a read bitline, a majority logic circuit and data-bit reordering were accommodated to write as many "1"s as possible [10] (hereafter, we call the prior SRAM "MJ SRAM" in this paper). The MC structure in the MJ SRAM is same as the conventional 8T SRAM although the read and write circuits differ. Input data comprising eight pixels are reordered into digit groups (from the most-significant-bit group to the least-significant-bit group), and then a flag bit is appended to each group. If the number of "0"s in a group is more than that of "1"s, the "0" data are inverted to "1"s by the majority logic circuit. Thereby, we can maximize the number of "1"s in the input data. The inversion information ("1" means inversion) is stored in the additional flag bit. In a read cycle, the group data are inverted if a flag bit is true, and then they are put back in the original order so that we can read out the original data. This mechanism reduces the power of the read bitline because we can statistically increase the possibility that "1" is read where no power is dissipated.

For further power reduction, we propose a novel nonprecharge-type SRAM in this paper [11]. The proposed SRAM reduces the bitline power in both cases in which consecutive "0"s are read out and consecutive "1"s are read out, since no precharge circuit exists on bitlines. The charge/discharge power is consumed only when a readout datum is changed. On the contrary, in a conventional SRAM, a consecutive-"0" readout requires a large bitline power. In addition to the power reduction with the consecutive readout, the proposed SRAM operates in a shorter cycle time since a precharge period is not required. Besides, we can get rid of the bitline keeper, which improves operation in a low-voltage region. In comparison to the MJ SRAM, our proposed SRAM eliminates the flag bit that causes a power overhead.

The remainder of this paper is organized as follows. Section 2 introduces the proposed 10T non-precharge SRAM. In Sect. 3, we compare the conventional MC layout and the proposed MC layout that has a novel shared wordline structure. Section 4 explains the reduction of the number of charge/discharge times in simulation. Section 5 presents a description of the proposed SRAM's design, particularly how to design a hierarchical bitline structure. In Sect. 6, we verify a 64-kb SRAM test chip in a 90-nm process technology. Section 7 summarizes the main findings of this study.

## 2. The Proposed 10T Memory Cell

## 2.1 Circuit

Figure 3 shows a schematic of the proposed 10T nonprecharge two-port MC. Two pMOS transistors are added to the conventional 8T two-port MC, which results in the combination of the conventional 6T single-port MC, an inverter, and a transmission gate. The additional signal (RWL\_N) is an RWL inversion signal; it controls the appended pMOS transistor at the transmission gate. The additional pMOS transistor (P4) increases an LRBL capacitance, compared to the conventional 8T two-port SRAM. While the RWL and RWL\_N are asserted and the transmission gate is on, a stored node is connected to an LRBL through the inverter. It is not necessary to prepare a precharge circuit since the inverter can independently charge/discharge the LRBL. There is no precharge circuit on either differential write bitline (WBL and WBL\_N) because they are dedicated for a write port.

Figure 4 illustrates operation waveforms in the proposed 10T non-precharge SRAM. A non-precharge scheme is used. Therefore the charge/discharge power on the LRBL is consumed only when the LRBL is changed. Consequently, no power is dissipated on the LRBL if an upcoming datum is same as the previous state.



Fig. 2 Waveforms in the conventional 8T precharge-type two-port SRAM in read cycles.

The proposed SRAM theoretically reduces the power



Fig.3 A schematic of the proposed 10T non-precharge-type two-port memory cell.



Fig. 4 Waveforms in the proposed 10T non-precharge-type two-port SRAM in read cycles.

on the LRBL to half that of a conventional 8T SRAM in a read operation if the readout data are random and the bitline capacitance is equal. The transient probability in a sequence of random data is 50% in the proposed nonprecharge SRAM; in the conventional SRAM, the number of charge/discharge times becomes one as an expected value. In the conventional SRAM, a charge and discharge pair takes place when "0" is readout. The LRBL power is thereby reduced to about 50% in the read operation used for our proposed SRAM.

## 3. Shared Wordline Structure

Figure 5 illustrates the layout patterns of conventional and proposed MCs in a 90-nm process technology. As well, the transistor sizes are shown in this same figure.

Figure 5(a) portrays the conventional MC layout. The schematic is shown in Fig. 1. The cell area is  $3.15 \times 0.76 \,\mu\text{m}^2$ , which is the smallest size of the three. Because this memory cell frees an SNM, the driver transistors' (N1, N2) width can be minimized; then the load transistors' (P1, P2) length can be enlarged in order to extend write margin. Therefore, the operation margin is sufficient at the nominal supply voltage of 1.0 V.

Figure 5(b) shows the proposed 10T MC layout. The schematic is shown in Fig. 3. The cell area is  $3.70 \times 0.935 \,\mu\text{m}^2$ . The height of this MC ( $0.935 \,\mu\text{m}$ ) is higher than the conventional one ( $0.76 \,\mu\text{m}$ ) because the proposed MC requires three wordlines: WWL, RWL, and RWL\_N. The minimum metal pitch to align these three wordlines is larger than a transistor pitch. Therefore the height of this proposed MC is restricted by the metal lines. The coupling noise between the wordlines would be larger than that in the conventional one, because the metal pitch has to be minimized for a small MC area.

We propose a shared WL structure to shrink the area of the proposed MC. Figure 6 illustrates the shared WL structure. A pair of a top and bottom MCs shares an RWL and its complementary signal, RWL\_N. Instead, two LRBLs have to be interconnected in each column, as shown in the figure. For instance, when RWL0 and RWL\_N0 are asserted, the



**Fig. 5** Layouts of (a) the conventional MC, (b) the proposed MC without shared WL structure, and (c) the proposed MC with shared WL structure.

MCs in Row 0 and Row 1 become active. The data stored in Row 0 are read out to the LRBL0 group (LRBL0\_0, ..., LRBL0\_n-1), and the data stored in Row 1 are read out to the LRBL1 group (LRBL1\_0, ..., LRBL1\_n-1). Thus, the additional drivers are prepared to choose which data are read out to the global read bitlines (GRBLs); the GRBL driver selects either LRBL0 group or the LRBL1 group using the selector signal.

Figure 5(c) shows the layout of an MC pair with the shared WL structure. The cell area is  $3.955 \times 0.76 \,\mu\text{m}^2$ . By introducing the shared WL structure, the proposed 10T MC can be designed in the same height as the conventional MC, because the RWL and the RWL\_N are shared by each MC pair. Hence, the numbers of RWLs and RWL\_Ns are reduced to a half that of the WWLs, which reduces the MC area overhead of the proposed 10T SRAM. Although, as



shown in Fig. 5(b), the MC height is restricted by the metal wordlines, the MC height with the shared wordline structure is restricted by the transistor pitch as depicted in Fig. 5(c). Therefore, the metal pitch of the wordlines in Fig. 5(c) is relaxed. The coupling noise between the wordlines can be reduced [12].

In the proposed MC, all transistors are aliened on two lines, so this MC layout improves lithographic quality and is better for manufacturability than the MC without the shared wordline structure. In addition, there is no bent polysilicon pattern in the proposed MC, which can potentially reduce variations in transistors' finished dimensions. In the MC pair, since each bitline is shielded by a VDD line or GND line, it is tolerant of a coupling noise [12].

## 4. Reducing the Number of Charge/Discharge Times

#### 4.1 Application to Video Images

In the proposed SRAM, the charge/discharge power consumed on the LRBLs is proportional to the number of times that a datum flips (i.e., the number of transitions: "0" to "1" and "1" to "0") along the time axis. Therefore, we can exploit the proposed SRAM for video processing as well as the MJ SRAM, because adjacent pixels have strong correlation one another in a video image.

In the H.264 codec, the YUV format is adopted as a pixel datum. An example is in Fig. 7. One pixel is comprised of an 8-bit luna (Y signal) and 4-bit chroma (U and V signals). In this paper, only luna data are considered. The most significant bits (MSBs) in consecutive data tend to be lopsided to either "0" or "1" with high probability, while in the least significant bits (LSBs), the values of the bits are random. In other words, the correlation becomes stronger in a more significant bit, which is well exploited in the MJ SRAM.

As described in Sect. 2, the power reduction on the LRBLs is theoretically expected due to the non-precharge scheme, even if input data are random. Besides, further



**Fig.7** An example of H.264 image data and its mapping onto eight LRBLs.

 Table 1
 Simulation conditions in the H.264 encoder.

| Profile      | Main profile             |
|--------------|--------------------------|
| Frame rate   | 30 fps                   |
| Bit rate     | 7.5 Mbps                 |
| Search range | $\pm 128 \times \pm 128$ |
| Symbol mode  | CABAC                    |
| JM version   | 9.8                      |

power reduction is promising since image data are lopsided to "0"s or "1"s with higher probability in a more significant digit. We exploit these characteristics in the proposed SRAM to reduce the LRBL power as well as the MJ SRAM.

## 4.2 Optimization of Block Size

In this section, we discuss the optimum data mapping that utilizes the spatial correlation in an image. In a video image, the correlations among local pixels are supposed to be different in the vertical and lateral directions. It is important to determine the block size mapped onto an LRBL since a scan path affects the effective use of the spatial correlation and power. Assuming an H.264 encoder, we made a simulation under the condition shown in Table 1 to fix the block size. In the simulation, statistic analyses were carried out with the original images and reconstructed images, extracted from ten high-definition test sequences showed in Fig. 8: "Bronze with Credit," "Building along the Canal," "Church," "Intersections," "Japanese Room," "European Market," "Yachting," "Street Car," "Whale Show," and "Yacht Harbor." The original image is encoded; then its reconstructed image is generated in a local decoding loop and is used for motion estimation and motion compensation. The encoding process is depicted in Fig. 9.

Figure 7 illustrates an example of the block size and its scan path. We set the number of pixels in a block to 256, because the search range is  $\pm 128 \times \pm 128$  in the H.264 encoder



**Fig. 8** HD video sequences. Each sequence comprises 100 frames and  $1920 \times 1024$  pixels.



Fig. 9 H.264 encoding process.

and a burst access over 256 pixels is possible if a full-search algorithm is considered. Hence, in the simulation, a pixel block ( $W \times H$  pixels) has 256 pixels. The scan path from the first pixel to the W-th pixel is mapped onto eight LRBLs.

Figure 10 compares the transition possibilities (the normalized numbers of charge/discharge times) on an LRBL between the conventional 8T SRAM, MJ SRAM, and proposed 10T SRAM when the block size is changed. The values are average ones in the ten sequences. In the both cases of the original image and reconstructed image, the block size of  $256 \times 1$  pixels is optimum in terms of power reduction. The graph indicates that the proposed 10T SRAM saves 73% of a dynamic power on an LRBL compared to the conventional 8T SRAM when the original image is read out.

The maximum power saving is achieved when a reconstructed image that has a stronger correlation than the original image is considered. The saving factor is extended to 81% compared to the conventional 8T SRAM, which indicates that the statistical characteristic of the reconstructed image is well exploited. It can be said that the proposed non-precharge SRAM is suitable for real-time video codec such as MPEG2, MPEG4, and H.264 that require a largecapacity reconstructed-image memory.



**Fig. 10** Transition possibilities (the normalized numbers of charge/ discharge times) on an LRBL between the conventional 8T, MJ, and proposed 10T SRAMs when a block size is changed.

#### 5. Design in 90-nm Process Technology

#### 5.1 Delay Model of Read-Bitline RC Trees

In our proposed 10T SRAM, due to the additional pMOS transistor, MCs can fully charge/discharge each LRBL. However, the sizes of MC transistors are too small to charge/discharge each long RBL quickly. Therefore, in our design, we adjust the length of the RBLs which are charged/discharged by MCs in the hierarchical read-bitline structure (double-bitline structure: the LRBLs and GRBLs). The hierarchical read-bitline structure is effective to avoid a speed overhead of a single-bitline scheme, which is applicable to the 10T SRAM [6]. In our proposed shared WL structure, when an address is asserted, the numbers of the active MCs are different in write and read operations. This is because, only in the read operation, the wordlines are shared. The hierarchical read-bitline structure also solves this addressing problem.

We model the BL structure to minimize a propagation delay from the LRBLs to the GRBLs. Elmore delays are obtainable node-by-node on the bitline: all resistances and all capacitances from the input node to the output node. Figure 11 shows a  $\pi$ -type RC model of the SRAM read port when the total number of bits on each GRBL is set to 512 and each GRBL is divided into LRBLs by a factor N (N is a natural number) [8]. The respective widths of the LRBL and GRBL using the metal-1 and metal-2 lines are set to 0.14  $\mu$ m.

In Fig. 11, M,  $C_L$ ,  $C_G$ ,  $R_L$ ,  $R_G$ ,  $R_{MC}$ , and  $R_D$  respectively represent the number of bits on each LRBL, the ca-



Fig. 11 A  $\pi$ -type RC model of the SRAM read port.

**Table 2** Values of M,  $C_L$ ,  $C_G$ ,  $R_L$ ,  $R_G$ ,  $R_{MC}$  and  $R_D$ , as obtained by the ASPLA 90-nm process parameters.

|                  | Conventional 8T       | Proposed 10T SRAM        |
|------------------|-----------------------|--------------------------|
|                  | SRAM                  | with shared WL structure |
| M [bits]         | 512 / N               | 256 / N                  |
| $C_L$ [fF]       | 0.408                 | 0.748                    |
| $C_G$ [fF]       | $0.0546 \times M + 2$ | $0.1248 \times M + 2$    |
| $R_L [\Omega]$   | 0.0789                | 0.1580                   |
| $R_{G}[\Omega]$  | 0.0789 	imes M        | 0.1580 	imes M           |
| $R_{MC}[\Omega]$ | $1.2 \times 10^4$     | $1.5 	imes 10^4$         |
| $R_D[\Omega]$    | $2.5 \times 10^{3}$   | $2.5 	imes 10^3$         |

pacitance of the LRBL per bit, the capacitance of the GRBL per 512/N bits, the resistance of the LRBL per bit, the resistance of the GRBL per 512/N bits, the output resistance of the each MC to the LRBL, and the output resistance of each GRBL driver.

Table 2 summarizes the values of M,  $C_L$ ,  $C_G$ ,  $R_L$ ,  $R_G$ ,  $R_{MC}$ , and  $R_D$ , obtained by the ASPLA 90-nm process parameters. The parameter of  $C_L$  presents a sum of a wiring capacitance and a drain capacity in an MC. Note that a drain capacitance depends on a MC topology. When considered the conventional 8T MC in Fig. 1, the drain capacitance corresponds to the drain capacity of N6 only. In the same way, when considered the proposed 10T MC in Fig. 3, the drain capacitance is the sum of N6 and P4, so the  $C_L$  in the proposed MC is larger than that in the conventional 8T MC. The parameter of  $C_G$  presents a sum of the wiring capacitance of the GRBLs and the drain capacitance of the GRBL driver. In this model, the drain capacity of the GRBL driver circuits equals 2 fF. The  $R_{MC}$  and  $R_D$  are respectively obtained by analyzing the transistors characteristic that connected to the LRBL and GRBL.

When the total number of bits on each GRBL and each LRBL are set to 512 and M, respectively, and each GRBL is divided into LRBLs by a factor, N, the Elmore delay  $\tau_{\text{Elmore}}(M, N)$  is expressed as follows [9]:



**Fig. 12** Elmore delays by numeric calculation using ASPLA 90-nm process parameters. (a) The conventional SRAM and (b) proposed SRAM with the shared WL structure.

 $\tau_{\text{Elmore}}(M, N)$ 

$$= R_{MC} (M \cdot C_L + N \cdot C_G) + R_L \sum_{k=0}^{M-1} \{k \cdot C_L + N \cdot C_G\}$$
$$+ R_D \cdot N \cdot C_G + R_G C_G \sum_{k=0}^{N-1} k$$
$$= R_{MC} (M \cdot C_L + N \cdot C_G) + R_L \left\{ \frac{M (M-1)}{2} \cdot C_L + M \cdot N \cdot C_G \right\}$$
$$+ R_D \cdot N \cdot C_G + R_G \cdot \frac{N (N-1)}{2} \cdot C_G$$
$$= \frac{1}{2} R_L C_L \cdot M^2 + \left( R_{MC} C_L - \frac{1}{2} R_L C_L \right) \cdot M + \frac{1}{2} R_G C_G \cdot N^2$$
$$+ \left( R_{MC} C_G + R_D C_G - \frac{1}{2} R_G C_G \right) \cdot N + R_L C_G \cdot M \cdot N$$
(1)

The values in Table 2 are substituted for Eq. (1) and  $\tau_{Elmore}(M, N)$  is obtained by numeric calculation. Figure 12 shows  $\tau_{Elmore}(M, N)$ . When the total number of bits on each GRBL is set to 512, the optimum *N* is 8 in both the conventional and the proposed SRAMs.

## 5.2 Chip Overview

Figure 13 shows a block diagram of the proposed SRAM. A hierarchical read-bitline structure, already discussed in the previous subsection, is applied. A GRBL driver drives a GRBL with a block selector signal from the X decoders.

Figure 14 shows a chip micrograph of the proposed non-precharge 64-kb SRAM in a 90-nm process technology. The MC area, which comprises 10 transistors, is  $3.96 \times 0.76 \,\mu\text{m}^2$ . An MC block is 64 words by 64 bits, into which two 256-pixel blocks can be put.

Figure 15 shows operation waveforms of the proposed non-precharge SRAM when "0" and "1" are read out. After a block selector signal is asserted, a GRBL is discharged/charged as Dataout. The access times at the "0" and "1" readouts are 0.93 ns and 1.16 ns, respectively. The "0" readout is faster than the "1" readout because nMOS transistors in the GRBL driver and the read circuit are stronger than the pMOS ones. Figures 15(a) and 15(b) demonstrate



Fig. 13 A block diagram of a memory cell block in the proposed SRAM.



**Fig. 14** A chip micrograph of the proposed 10T SRAM and the conventional 8T SRAM in a 90-nm process technology. The total memory size of each SRAM is 64 kb.

that the proposed SRAM shortens the cycle time to 1.16 ns, because of the precharge-less structure. This access time corresponds to an 862-MHz (= 1/1.16 ns) operation since the proposed SRAM does not require a precharge period.

An area comparison of a conventional SRAM, an MJ SRAM, and the proposed SRAM is shown in Fig. 16. In Fig. 16, the SRAM areas include three parts: the MCs part, the read and write circuits part, and the others part. The MCs part represents all of the MC array area, and the additional flag bits in MJ SRAM. The read and write circuits part contains the write drivers, precharge circuits, sense amplifiers, and the GRBL drivers in the proposed SRAM. The others part contains the address decoders, word line drivers, flipflops for input, data bus, and timing control circuits. The area overhead in the proposed SRAM is 14.4% because two pMOS transistors are added to the conventional 8T MC. However, the read and write circuits are smaller than the conventional SRAM by 1% because of elimination of the precharge and bitline keeper circuits.



Fig. 15 Operation waveforms of the proposed non-precharge SRAM when (a) "0" and (b) "1" are read out in a 90-nm process technology (CC corner, Temp. =  $25C^{\circ}$ ).



**Fig. 16** Area comparison of 64-kb SRAMs in a 90-nm process technology.

## 6. Simulation and Measurement Results

## 6.1 Operating Frequency and Supply Voltage

As described above, there is no precharge period in the proposed SRAM, which can shorten a cycle time compared to other precharge-type SRAMs. This means higher performance in operating frequency. In addition, in the proposed SRAM, the readout speed is essentially improved by eliminating the bitline keeper, as described in Sect. 1. Further-



**Fig. 17** Operating frequencies versus supply voltage in a 90-nm process technology. The dotted lines show the simulation results, and the solid lines show the measurement results.

more, we can set the number of bits on each LRBL to a half of the conventional SRAM using the shared WL structure. Despite the additional pMOS transistors that increase the amount of the LRBL capacitance, the proposed SRAM can operate faster than the conventional one. Figure 17 shows the frequency dependence on supply voltage in simulations. At a supply voltage of 1V, the proposed non-precharge SRAM improves the operating frequency by 315 MHz (65% faster) compared with the conventional precharge SRAM. In other words, the proposed SRAM can run at a lower supply voltage when an operating frequency is same as others. In the conventional SRAM and MJ SRAM, the bitline keepers hinder low-voltage operation as mentioned in Sect. 1. In contrast, the proposed SRAM works at a lower voltage, which achieves much lower power since a dynamic power is proportional to the square of a supply voltage. At an operating frequency of 300 MHz, the proposed SRAM properly operates at 0.69 V while the MJ SRAM does not below 0.85 V.

Figure 17 also shows the frequency dependence on supply voltage in the measurement. The measured operating frequency is not greater than 120-MHz operation because of LSI tester limitations. According to the measured results, at an operating frequency of 120 MHz, the proposed SRAM operates properly at 0.77 V, whereas the conventional SRAM does not work below 0.85 V.

## 6.2 Power

In the proposed SRAM, the power overhead is obviated in a write operation because the 6T structure at the write port is identical to that of the conventional one. On the other hand, in a read operation, the additional pMOS transistor, P4 in Fig. 3, increases the LRBL capacitance by 83%. However, the shared WL structure reduces the number of bits on each LRBL to a half of the conventional SRAM. Therefore, the speed overhead by the LRBL capacitance does not exist. Furthermore, note that the number of charge/discharge times is halved, compared to the conventional case. Thereby,



Fig. 18 Measured readout power at 120 MHz when reading the original images and reconstructed images.



**Fig. 19** (a) Readout power versus operating frequencies in a 90-nm process technology. (b) Magnified view.

the readout power is theoretically reduced in the proposed SRAM even if data are random.

Figure 18 makes comparisons of the measured readout powers when we vary content stored in the SRAMs. For video memory, power reduction in a read operation is important since readout is performed more frequently than write-in. The supply voltages are set to 0.85 V and 0.77 V in a conventional SRAM and the proposed SRAM, respectively, based on Fig. 17. In the conventional 64-kb SRAM, the readout power is measured as 0.764 mW on average of the 10 video sequences mentioned in Sect. 4.2, at the supply voltage of 0.85 V and the frequency of 120 MHz. Our proposed SRAM saves 85% of a total readout power at the lower supply voltage when it is utilizes as a reconstructed image buffer. Its power dissipation is 14.6% on average.

Figure 19 shows a comparison of the readout power in the conventional SRAM and the proposed SRAM when the supply voltage is changed according to the operation frequencies. The proposed SRAM reduces the readout power by 65% compared with the conventional SRAM at the 120-MHz operation in the measurement if random data are considered. That savings factor increases to 79% compared to the conventional SRAM if the memory content is an H.264 original image. In a reconstructed image, we can maximize the power improvement, where we can save 85% of the readout power.

# 7. Conclusion

We have proposed a two-port non-precharge SRAM comprising 10 transistors. This SRAM is suitable for a real-time video image that has statistical similarity. By the simulation, the proposed SRAM can operate at a 65% higher frequency than the conventional 8T SRAM since it has no precharge period. The area overhead is 14.4% in a 90-nm process technology. The measurement demonstrated that the proposed SRAM saves 85% of a readout power, when it is used as an H.264 reconstructed-image memory.

# Acknowledgments

This work was supported by Renesas Technology Corporation.

The VLSI chip in this work was fabricated through the chip fabrication program of the VLSI Design and Education Center (VDEC), the University of Tokyo, in collaboration with STARC, Fujitsu Limited, Matsushita Electric Industrial Company Limited, NEC Electronics Corporation, Renesas Technology Corporation, and Toshiba Corporation.

The authors appreciate Dr. K. Kobayashi and Dr. A. Tsuchiya with Kyoto University and Kyoto VDEC Sub-Center for their supporting measurements of the test chips.

#### References

- International Technology Roadmap for Semiconductors 2005, http://www.itrs.net/Common/2005ITRS/Home2005.htm
- [2] J. Miyakoshi, Y. Murachi, K. Hamano, T. Matsuno, M. Miyama, and M. Yoshimoto, "A low-power systolic array architecture for blockmatching motion estimation," IEICE Trans. Electron., vol.E88-C, no.4, pp.559–569, April 2005.
- [3] Y. Murachi, K. Hamano, T. Matsuno, J. Miyakoshi, M. Miyama, and M. Yoshimoto, "A 95 mW MPEG2 MP@HL motion estimation processor core for portable high-resolution video application," IEICE Trans. Fundamentals, vol.E88-A, no.12, pp.3492–3499, Dec. 2005.
- [4] S. Ishiwata, T. Yamakage, Y. Tsuboi, T. Shimazawa, T. Kitazawa, S. Michinaka, K. Yahagi, A. Oue, T. Kodama, N. Matsumoto, T.

Kamei, M. Saito, T. Miyamori, G. Ootomo, and M. Matsui, "A single-chip MPEG-2 codec based on customizable media embedded processor," IEEE J. Solid-State Circuits, vol.38, no.3, pp.530–540, March 2003.

- [5] Y-W. Huang, T-C. Chen, C-H. Tsai, C-Y. Chen, T-W. Chen, C-S. Chen, C-F. Shen, S-Y. Ma, T-C. Wang, B-Y. Hsieh, H-C. Fang, and L-G. Chen, "A 1.3TOPS H.264/AVC single-chip encoder for HDTV applications," IEEE Int. Solid-State Circuits Conf., pp.128– 129, Feb. 2005.
- [6] K. Takeda, Y. Hagihara, Y. Aimoto, M. Nomura, Y. Nakazawa, T. Ishii, and H. Kobatake, "A read-static-noise-margin-free SRAM cell for low-Vdd and high-speed applications," IEEE J. Solid-State Circuits, vol.41, no.1, pp.113–121, Jan. 2006.
- [7] R.K. Krishnamurthy, A. Alvandpour, G. Balamurugan, N.R. Shanbhag, K. Soumyanath, and S.Y. Borkar, "A 130-nm 6-GHz 256 × 32 bit leakage-tolerant register file," IEEE J. Solid-State Circuits, vol.37, no.5, pp.624–632, May 2002.
- [8] C.-Yu and W.-C. Shiau, "Delay models and speed improvement techniques for RC tree interconnections among small-geometry CMOS inverters," IEEE J. Solid-State Circuits, vol.25, no.5, pp.1247–1256, Oct. 1990.
- [9] W.C. Elmore, "The transient response of damped linear networks with particular regard to wideband amplifiers," J. Appl. Phys., vol.19, pp.55–63, Jan. 1948.
- [10] H. Fujiwara, K. Nii, J. Miyakoshi, Y. Murachi, Y. Morita, H. Kawaguchi, and M. Yoshimoto, "A two-port SRAM for real-time video processor saving 53% of bitline power with majority logic and data-bit reordering," ACM/IEEE Int. Symp. on Low Power Electronics and Design, pp.61–66, Oct. 2006.
- [11] H. Noguchi, Y. Iguchi, H. Fujiwara, Y. Morita, K. Nii, H. Kawaguchi, and M. Yoshimoto, "A 10T non-precharge two-port SRAM for 74% power reduction in video processing," Proc. IEEE Computer Society Annual Symposium on VLSI 2007, pp.107–112, Porto Allegre, Brazil, May 2007.
- [12] R. Arunachalam, E. Acar, and S.R. Nassif, "Optimal shielding/spacing metrics for low power design," Proc. IEEE Computer Society Annual Symposium on VLSI 2003, pp.167–172, Tampa, FL, USA, Feb. 2003.



**Hiroki Noguchi** received a B.E. degree in Computer and Systems Engineering from Kobe University, Hyogo, Japan, in 2006, where he is currently working in the M.E. course. His current research is low-power SRAM designs for multimedia and ubiquitous media digital LSIs. He is a student member of IEEE.



Yusuke Iguchi received a B.E. degree in Computer and Systems Engineering from Kobe University, Hyogo, Japan, in 2007. He is currently working in the M.E. course at that university. His current research is low-power techniques in digital LSIs and Memories.



Hidehiro Fujiwara respectively received B.E. and M.E. degrees in Computer and Systems Engineering in 2005 and 2006 from Kobe University, Hyogo, Japan, where is currently working in the doctoral course. His current research is related to high-performance and low-power SRAM designs. He is a student member of IEEE.

1984. He is currently working in the B.E. course

at Kobe University. His current research is high-

performance, low-power SRAM designs.

was born on Aug 17.

Shunsuke Okumura



**Hiroshi Kawaguchi** received B.E. and M.E. degrees in Electronic Engineering from Chiba University, Chiba, Japan, respectively, in 1991 and 1993. He received a Ph.D. degree in Engineering from the University of Tokyo, Tokyo, Japan, in 2006. He joined Konami Corp., Kobe, Japan, in 1993, where he developed arcade entertainment systems. He moved to the Institute of Industrial Science, the University of Tokyo, as a Technical Associate in 1996, and was appointed as a Research Associate in 2003. In

2005, he moved to the Department of Computer and Systems Engineering, Kobe University, Kobe, Japan, as a Research Associate. Since 2007, he has been an Associate Professor at the Department of Computer Science and Systems Engineering, at Kobe University. He is also a Collaborative Researcher with the Institute of Industrial Science, the University of Tokyo. His current research interests include low-power VLSI design, hardware design for wireless sensor networks, and recognition processors. Dr. Kawaguchi received the IEEE ISSCC 2004 Takuo Sugano Outstanding Paper Award and the IEEE Kansai Section 2006 Gold Award. He has served as a Program Committee Member for IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips), and as a Guest Associate Editor of IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences. He is a member of IEEE and ACM.



Yasuhiro Morita received an M.E. degree in Electronics and Computer Science from Kanazawa University, Ishikawa, Japan, in 2005. He is currently working in the doctoral course at Kobe University, Hyogo, Japan. His current research interests include high-performance and low-power multimedia VLSI designs. He is a student member of IEEE.



**Koji Nii** received B.E. and M.E. degrees in Electrical Engineering from Tokushima University, Tokushima, Japan, respectively, in 1988 and 1990. In 1990, he joined the ASIC Design Engineering Center, Mitsubishi Electric Power Products Inc., Itami, Japan. In 2003, he began work at Renesas Technology Corp. He currently works on the research and development of 45nm Embedded SRAM in the Design Technology Div., Renesas Technology Corp. In addition, he

is a doctoral student at Kobe University, Hyogo, Japan. He is a member of the IEEE Solid-State Circuits Society and IEEE Electron Devices Society.



Masahiko Yoshimoto received a B.S. degree in Electronic Engineering from the Nagoya Institute of Technology, Nagoya, Japan, in 1975 and an M.S. degree in Electronic Engineering from Nagoya University, Nagoya, Japan, in 1977. He received a Ph.D. degree in Electrical Engineering from Nagoya University, Nagoya, Japan in 1998. He joined the LSI Laboratory, Mitsubishi Electric Power Products Inc., Itami, Japan, in April 1977. During 1978–1983 he was engaged in the design of NMOS and

CMOS static RAM, including a 64 K full CMOS RAM with the world's first divided-word-line structure. From 1984, he was involved in research and development of multimedia ULSI systems for digital broadcasting and digital communication systems based on MPEG2 and MPEG4 Codec LSI core technology. Since 2000, he has been a Professor of the Dept. of Electrical and Electronic Systems Engineering at Kanazawa University, Japan. Since 2004, he has been a Professor of the Dept. of Computer and Systems Engineering at Kobe University, Japan. His current activities specifically emphasize research and development of multimedia and ubiquitous media VLSI systems including an ultra-low-power image compression processor and a low-power wireless interface circuit. He holds 70 registered patents. He served on the Program Committee of the IEEE International Solid State Circuit Conference during 1991-1993. In addition, he has served as a Guest Editor for special issues on Low-Power System LSI, IP, and Related Technologies of IEICE Transactions in 2004. He received R&D100 awards from R&D Magazine in 1990 and 1996, respectively, for development of the DISP and development of a real-time MPEG2 video encoder chipset.