## PAPER Special Section on VLSI Design and CAD Algorithms

# A 95 mW MPEG2 MP@HL Motion Estimation Processor Core for Portable High-Resolution Video Application

Yuichiro MURACHI<sup>†a)</sup>, Koji HAMANO<sup>†</sup>, Tetsuro MATSUNO<sup>†</sup>, Junichi MIYAKOSHI<sup>††</sup>, *Student Members*, Masayuki MIYAMA<sup>†</sup>, *and* Masahiko YOSHIMOTO<sup>††</sup>, *Members* 

**SUMMARY** This paper describes a 95 mW MPEG2 MP@HL motion estimation processor core for portable and high-resolution video applications such as that in an HD camcorder. It features a novel hierarchical algorithm and a low-power ring-connected systolic array architecture. It supports frame/field and bi-directional prediction with half-pel precision for 1920 × 1080@30 fps resolution video. The search range is  $\pm 128 \times \pm 64$ pixels. The ME core integrates 2.25 M transistors in 3.1 mm × 3.1 mm using 0.18-micron technology.

key words: low power, motion estimation, MPEG2, HDTV, IP

#### 1. Introduction

Digital video applications are expanding to include highdefinition television (HDTV) resolution. HDTV-resolution monitors are becoming more widely used in the home. As one aspect of this trend, portable HDTV systems will continue to gain in popularity. Figure 1 shows a block diagram of an MPEG2 encoder. An MPEG2 MP@HL encoder requires about 1000 GOPS workload for HDTV-resolution video, assuming that the conventional sub-sampling method [1] is used to perform motion estimation (ME). In this case, the encoder's power consumption is greater than 2400 mW, which is prohibitively high for portable products. Motion estimation occupies more than 90% of its computational complexity. A highly efficient ME processor core is required to realize a portable video application.

This paper describes the first sub-100 mW motion estimation core IP (MEH) that realizes MPEG2 MP@HL Bpicture processing for portable and high-resolution video application like that for an HD camcorder. The IP core has two major features. The first is a novel hierarchical search algorithm that employs a one-dimensional diamond search for the upper layer image and a narrow full search for a lower layer image, allowing a remarkable workload reduction to 7% of that in the sub-sampling method assuming  $\pm 128 \times \pm 64$  pixel maximum search range. The second one is a newly introduced low-power systolic array architecture that realizes few memory access cycles and few computation cycles in the full search, which uses about 35% of the



Fig. 1 A block diagram of MPEG2 encoder.

power necessary for the conventional systolic array.

#### 2. VLSI Algorithm Design

### 2.1 Hierarchical Search

Portable and high-resolution video applications necessitate decreased computational power without degrading picture quality. The gradient descent search (GDS) method can reduce computational power efficiently, but picture quality is thereby sacrificed [2]. In contrast, fine search methods like full search (FS) provide high-quality images using an enormous computational load. To satisfy the above two requirements simultaneously, a hierarchical search method, consisting of coarse search and fine search, has been utilized [3]–[6].

The proposed hierarchical search method of this work is illustrated in Fig. 2. First, a motion vector is searched coarsely in a  $\pm 128 \times \pm 64$  pixel search area in the upper layer image, using a one-dimensional diamond search (1D-DS), as described later. The vector to be detected there indicates a start position of the succeeding FS in a  $\pm 8 \times \pm 8$  pixel search area in a lower layer image with integer-pel accuracy, where the start position is located at the center of the lower layer. Finally we evaluate eight half-pel positions surrounding the integer vector obtained by FS. Results show that computing power can be decreased while maintaining a wide search range. In addition, the memory size can be reduced because the image of the wide search range is reduced by a 1/4 subsampling technique.

Manuscript received March 15, 2005.

Manuscript revised June 16, 2005.

Final manuscript received July 26, 2005.

<sup>&</sup>lt;sup>†</sup>The authors are with the Faculty of Engineering, Kanazawa University, Kanazawa-shi, 920-8667 Japan.

<sup>&</sup>lt;sup>††</sup>The authors are with the Faculty of Engineering, Kobe University, Kobe-shi, 657-8501 Japan.

a) E-mail: murachi@mics.ee.t.kanazawa-u.ac.jp

DOI: 10.1093/ietfec/e88-a.12.3492







#### 2.2 One-Dimensional Diamond Search

As the first coarse search, we propose one-dimensional diamond search (1D-DS). The 1D-DS shown in Fig. 3 is based on a gradient method. As the first step, an initial vector is chosen among the four candidate vectors. Then, the search direction is determined by calculations of the sum of absolute difference (SAD) for the four points surrounding the point indicated by the initial vector. This is followed by the third step, in which one-dimensional (1-D) search is executed in the direction toward the point that has the smallest SAD among them. The SADs are computed at several points in the search direction. The point with the minimum distortion in the 1-D search is a temporal solution. Subsequently, a new search direction is determined at the temporal solution and a new 1-D search begins. The second and third steps are iterated several times until the minimum point is found. The computational complexity of the 1D-DS is extremely low because it adaptively evaluates only the selected points.

#### 2.3 Shared Memory Method

The proposed algorithm also introduces a shared memory method. In this method, the search range is adjusted according to the frame distance. In the case of M = 3 and N = 15, the frame distance at the P-picture prediction is 3, so the search range must remain wide. The frame distances at the B-picture are 1 and 2, so the search ranges should be narrow. Therefore, a search window memory that stores image



data of the search range for the P-picture also stores image data of the two search ranges for the B-picture. Introducing this method reduces the memory size and the amount of transferred data for the ME core.

#### 2.4 Simulation Results

The novel hierarchical search and the conventional algorithms built into MPEG2 software encoders are simulated to analyze the computational complexity, picture quality, memory size, and the amount of transferred data.

Simulation conditions are summarized as:

- Software encoder: MPEG2 software encoder/decoder ver.1.2
- Resolution: 1920 × 1080i
- Frame rate: 30 fps
- Target Bitrate: 15 Mbps bitrate controlled
- Sample picture (Fig. 4)
  - Buildings along the Canal
  - Church
  - Harbor Scene
  - Intersection
  - Street Car
  - Whale Show
  - Yacht Harbor
- Number of frames: 150
- Search range: H:-128,+127.5 / V:-64,+63.5
- Bidirectional prediction
- Frame/Field prediction
- Half-pel precision

Table 1 lists average PSNR of the reconstructed picture generated by the simulation for each sequence. The picture quality of the proposed algorithm is higher than those of the sub-sampling method and the GDS, without exception. The proposed method degrades the picture quality of the Church sequence. However, the degradation is only 0.26 dB, compared to the FS. On the other hand, the proposed method upgrades the picture quality with the Intersection, Whale

Proposed FS GDS Sub-sampling Method 33.65 [dB] 32 [dB] 33.35 [dB] 33.58 [dB] 22 canal church 31.90 [dB] 30.60 [dB] 31.29 [dB] 31.64 [dB] 30.36 [dB] 29.35 [dB] 29.81 [dB] 30.31 [dB] harbor inters 35.66 [dB] 35.32 [dB] 35.65 [dB] 35.67 [dB] 34.81 [dB] 34.33 [dB] 34.55 [dB] 34.59 [dB stcar 27.21 27.51 [dB] 27.72 [dB] [dB] 27 75 [dB] whale 36.43 [dB] 36.28 [dB]

35.60 [dB]

32.09 [dB]

PSNR of reconstructed picture.

32.63 [dB]

36.48 [dB]

32.86 [dB]





and Yacht sequences: quality is better than that achieved by FS. Figures 5 and 6 show average PSNR per frame of Church and Yacht sequences. A noticeable degradation in a specific frame does not appear in the Church sequence. The picture quality of Yacht sequence is stably high with proposed method. The proposed method can provide high picture quality for each frame compared to the other fast methods.

Figure 7 shows a comparison of computational complexity and picture quality among several ME algorithms. The computational complexity of the proposed algorithm (1D-DS+FS) is reduced to 7% of the conventional subsampling algorithm. It requires only a 66 GOPS workload. Regarding the average of seven sequences, the picture quality of the algorithm exceeds that of sub-sampling by 0.77 dB



Fig. 8 Amount of transferred data and memory size.

and that of the GDS Method by 0.23 dB. Introduction of the FS algorithm for the lower layer image improves the picture quality. Figure 8 shows the amount of transferred data and memory size of each method. By virtue of the two features described above, the hierarchical search and shared memory method reduces the amount of transferred data by 50% and reduces the memory size by 85%. Introduction of the shared memory method does not degrade the picture quality.

Analysis of Simulation Results 2.5

Simulation results are analyzed to investigate the reasons for high picture quality of the proposed method. Analysis parameters are summarized as:

- A. Right Answer: The number of macro blocks (MBs) having the minimum SAD compared to the number of all MBs - percentage
- B. Predicted PSNR: PSNR of the predicted picture
- C. Intra MB: The number of intra MBs compared to the number of all MBs - percentage
- D. MV code: The amount of motion vector (MV) code compared to the amount of the total code – percentage
- E. DCT code: The amount of DCT coefficient code compared to the amount of total code - percentage
- F. Q Scale: Average of the Q scale
- G. Reconstructed PSNR: PSNR of the reconstructed picture

vacht

average

Table 1

32.93 [dB]

| [ |                 |             | canal  | church       | harbor       | inters       | stcar  | whale        | yacht        | average      |
|---|-----------------|-------------|--------|--------------|--------------|--------------|--------|--------------|--------------|--------------|
| Α | Right Answer[%] |             | 100.0% | 100.0%       | 100.0%       | 100.0%       | 100.0% | 100.0%       | 100.0%       | 100.0%       |
| В | Predicted       | PSNR[dB]    | 32.44  | 31.17        | 29.09        | 34.23        | 33.38  | 26.13        | 35.19        | 31.66        |
| С | Intra I         | MB[%]       | 0.2%   | 0.0%         | 0.9%         | 0.1%         | 0.8%   | 3.5%         | 0.1%         | 0.8%         |
| D |                 | MV[%]       | 27.1%  | 23.8%        | 30.2%        | 21.5%        | 28.4%  | 44.6%        | 28.6%        | <u>29.2%</u> |
| Е | Code            | DCT[%]      | 44.6%  | <u>48.9%</u> | <u>41.9%</u> | <u>52.4%</u> | 44.0%  | <u>28.7%</u> | <u>44.7%</u> | 43.6%        |
|   |                 | Other[%]    | 28.4%  | 27.3%        | 27.9%        | 26.1%        | 27.6%  | 26.7%        | 26.7%        | 27.2%        |
| F | <b>O</b> S      | cale        | 24.02  | 27.56        | 45.55        | 16.20        | 21.62  | 106.01       | 17.34        | 36.90        |
| G | Reconstruct     | ed PSNR[dB] | 33.65  | 31.90        | 30.36        | 35.66        | 34.81  | 27.72        | 36.43        | 32.93        |

Table 2Analysis of the FS.

Table 3Analysis of the proposed method.

| [ |                 |              | canal        | church       | harbor       | inters        | stcar        | whale        | vacht        | average      |
|---|-----------------|--------------|--------------|--------------|--------------|---------------|--------------|--------------|--------------|--------------|
| Α | Right Answer[%] |              | 79.3%        | 92.1%        | 61.9%        | 88.8%         | 59.2%        | 44.9%        | 76.1%        | 71.7%        |
| В | Predicted       | PSNR[dB]     | 32.07        | 30.66        | 28.70        | 33.60         | 31.27        | 25.09        | 34.99        | 30.91        |
| С | Intra           | MB[%]        | 0.5%         | 0.2%         | 1.3%         | 0.7%          | 3.5%         | 7.0%         | 0.3%         | 1.9%         |
| D |                 | <u>MV[%]</u> | <u>24.9%</u> | 23.6%        | <u>25.7%</u> | <u> 19.0%</u> | 22.7%        | <u>30.9%</u> | <u>24.8%</u> | <u>24.5%</u> |
| Е | Code            | DCT[%]       | 46.6%        | <u>49.1%</u> | <u>45.9%</u> | <u>54.9%</u>  | <u>49.3%</u> | 40.4%        | <u>48.5%</u> | <u>47.8%</u> |
| L |                 | Other[%]     | 28.5%        | 27.2%        | 28.5%        | 26.1%         | 28.0%        | 28.7%        | 26.8%        | 27.7%        |
| F | Q 8             | Scale        | 21.29        | 25.76        | 33.61        | 15.14         | 19.13        | 54.37        | 14.51        | 26.26        |
| G | Reconstruct     | ed PSNR[dB]  | 33.58        | 31.64        | 30.31        | 35.67         | 34.59        | 27.75        | 36.48        | 32.86        |

 Table 4
 Difference between FS and the proposed method.

| _ |             |             |        |        |        |        |        |        |             |         |
|---|-------------|-------------|--------|--------|--------|--------|--------|--------|-------------|---------|
|   |             |             | canal  | church | harbor | inters | stcar  | whale  | yacht       | average |
| Α | Right A     | nswer[%]    | -20.7% | -7.9%  | -38.1% | -11.2% | -40.8% | -55.1% | -23.9%      | -28.3%  |
| В | Predicted   | PSNR[dB]    | -0.37  | -0.51  | -0.39  | -0.63  | -2.12  | -1.04  | -0.20       | -0.75   |
| С | Intra       | MB[%]       | 0.3%   | 0.1%   | 0.4%   | 0.6%   | 2.7%   | 3.5%   | 0.2%        | 1.1%    |
| D |             | MV[%]       | -2.2%  | -0.2%  | -4.6%  | -2.6%  | -5,7%  | -13.6% | -3.9%       | -4.7%   |
| Е | Code        | DCTI%1      | 2.0%   | 0.3%   | 3.9%   | 2.5%   | 5.3%   | 11.7%  | <u>3.8%</u> | 4.2%    |
|   |             | Other[%]    | 0.1%   | -0.1%  | 0.6%   | 0.0%   | 0.4%   | 1.9%   | 0.1%        | 0.4%    |
| F | Q 9         | Scale       | -2.72  | -1.80  | -11.94 | -1.06  | -2.49  | -51.64 | -2.83       | -10.64  |
| G | Reconstruct | ed PSNR[dB] | -0.07  | -0.26  | -0.05  | 0.02   | -0.22  | 0.03   | 0.05        | -0.07   |

Tables 2 and 3 respectively show parameters of the FS and the proposed method. Table 4 shows the difference between these two algorithms. Use of the proposed method degrades the right answer percentage (A) by 28.3% and 0.75 dB in the predicted PSNR (B) in comparison to results obtained using FS. The lower accuracy of motion estimation yields the increment of the intra-MB percentage (C) by 1.1% compared to FS. The increment of the intra-MB percentage (C) increases the amount of DCT code (E). On the other hand, the proposed method yields a smaller amount of MV code (D), because the search starts from an initial vector chosen among the vectors of adjacent MBs. Then the amount of MV code (D) of the proposed method is 4.7%below the FS. The rate control maintains the total amount of MV code (D) and DCT code (E). It increases the amount of DCT code (E) to 47.8%, which is greater than that of FS by 4.2%. Consequently, the proposed method can reduce the Q scale (F) by 10.64 compared with the use of FS.

The above-mentioned results indicate the following advantages of the proposed method. The proposed method decreases the predicted picture quality. The difference between the original picture and the predicted picture increases. The increment of intra-MB percentage increases the amount of DCT code. However, the proposed method yields small MV difference and decreases the amount of MV code. That decrement of MV code is greater than the increment of DCT code. Therefore, a smaller Q scale is allowed; it compensates for the decrease in the predicted picture quality. The proposed method provides high reconstructed picture quality.

#### 3. VLSI Architecture Design

#### 3.1 Multi-Processor Configuration

Figure 9 illustrates a block diagram of the MEH. The MEH includes three types of ME processors: MEH2, MEH1 and MEHH. It performs 1D-DS on the upper layer image with MEH2, FS on the lower layer image with MEH1, and the half-pel search with MEHH. Figure 10 shows a timing chart of MEH. It indicates that MEH2, MEH1, and MEHH can operate concurrently. The MEH2 comprises a 16-way SIMD data path and 3-port (2 read-ports, 1 writeport) SRAMs are used as the search window buffer (SW2) (8 bit/word) and template buffer (TB2) (64 bit/word). The MEH1 introduces a newly developed systolic array architecture [7] constructed using 256 ring-connected processor elements (PEs) and shift resistors (SRs) to execute the FS algorithm. The MEH1 shares the search window buffer (SW1) (128 bit/word) and the template buffer (TB1) (64 bit/word) with the half-pel processing unit (HPPU). The HPPU has a 128-way SIMD data path.

Table 5 shows the amount of data transmission of the MEH and the other modules. A memory bandwidth of the MEH is 13 Gbps, assuming that the memory bus operates at 108 MHz with 128 bit width. The total amount of MEH1



Fig. 9 A block diagram of MEH.



**Fig. 10** A timing chart of MEH.

**Table 5**The amount of transmission of MEH.

|     |             |     | amount of transmission[Gbps] |      |      |  |  |
|-----|-------------|-----|------------------------------|------|------|--|--|
|     | МЕНА        | SW2 | 1.37                         | 1.49 |      |  |  |
| мен | MEH2        | TB2 | 0.12                         | 1.49 | 4.98 |  |  |
| MEH | MEH1 &      | SW1 | 2.99                         | 3.48 | 4.98 |  |  |
|     | MEHH        | TB1 | 0.50                         | 3.48 |      |  |  |
| 01  | ther module | es  | 3.26                         |      |      |  |  |
|     | total       |     | 8.23                         |      |      |  |  |

transmission is 3.5 Gbps and that of MEH2 is 1.5 Gbps. In addition, the total amount of transmission of other modules (DCT, IDCT, etc.) is 3.3 Gbps. Therefore, the bandwidth of memory I/F of MEH is fully secured.

#### 3.2 MEH2 Processor for the 1D-DS Algorithm

MEH2 contains a processing unit (PU2), a template block buffer (TB2), and the search window buffer (SW2). The PU2 and the SW2 are connected by the cross path to sort pixel data in order. Figure 11 illustrates a block diagram



Fig. 11 A block diagram of PU2.

of the PU2. PU2 has a 16-way SIMD data path. The PU2 contains 16 PEs and the adder tree. The PE calculates the absolute difference between a pixel in the search window and a pixel in the template block during each cycle. The adder tree sums up 16 absolute differences with the Wallace-tree and accumulates the summations with the accumulator. The calculation consumes 4 cycles per 1 MB. The PU2 can calculate SADs of 8 pixels  $\times$  2 rows in a 1/4 sub-sampled MB per 1 cycle. The SW2 consists of 8 SRAMs which have 2 read-ports. The same column pixels are stored in the same SRAM. The memory configuration and the data mapping scheme allow PU2 to load the 2 rows of data in one cycle, and thereby maintain 16-pixel throughput.

#### 3.3 MEH1 Processor for FS Algorithm

Even with a narrow range  $(\pm 8 \times \pm 8)$ , the full search in HDTV resolution requires a significantly larger workload. For that reason, it has been very hard to realize a low-power VLSI circuit. The important breakthrough of this paper is a novel architecture to implement the FS algorithm for the lower layer with sufficiently low power consumption. The systolic array is illustrated in Fig. 12. It consists of  $N \times N$  PEs (PEarray), the N × N SRs (SR-array), and an adder tree. In this case, N is equal to 16; the numbers of PEs and SRs are 256. Sixteen pixels (128 bits) in the search window and 16 pixels (128 bits) in the template block are loaded into the PE-array, respectively, from SW1 and TB1. The SR-array receives 16 pixels from the SW1. The key feature is that the PEs and SRs in one row are connected with a ring-buffer scheme; thereby, the PE-SR-array can shift pixels toward the right or left.

The data flow is shown in Fig. 13. Three phases execute the full search. The first is an initializing phase (Init-phase). Pixels of the template block and the search window are loaded into all PEs and SRs during this phase. The second is a calculation phase (Calc-phase). The SADs are calculated by the PE-array in this phase. The pixels that were loaded into PEs and SRs in the previous phase are shifted to the



**Fig. 12** A block diagram of a systolic array (*N*=4).



Fig. 13 Systolic array data flow.

right or left. The third one is an input phase (Input-phase). During this phase, the subsequent pixels are transferred from SW1 to a PE-array and SR-array. The full-search can be executed by iterating the Calc-phase and the Input-phase in sequence after the Init-phase. The Calc-phase and the Input-phase are iterated until reaching a boundary of the search area. The subsequent pixels are loaded by vertical shift operations at that time (Idle-phase). The systolic array receives 16 pixels from SW1 during each Input-phase. It is necessary to swap more-significant 128 bits and less-significant 128 bits. The cross path is eliminated using the SW1 constructed by 3-port SRAMs with 2 read-ports. The pixels can be swapped by controlling the SRAM addresses.

Advantages of the systolic array architecture the fol-



lowing. Reduction of computing cycles. Reduction of memory access cycles. Adaptability for field/frame mode. The benefits pertain because the architecture allows various sum schemes in the adder tree for absolute difference values. The last benefit allows the concurrent operation of HPPU with no extra cache. HPPU can access SW1 and the TB1 during the Calc-phase, in which the SW1 and the TB1 are not accessed by the systolic array.

#### 3.4 MEHH Processor for Half-Pel Precision

Figure 14 illustrates a block diagram of the HPPU, which is the processing unit used in MEHH. The HPPU for halfpel precision has a 128-way SIMD data path. The HPPU contains a half-pel blender, 128 PEs, and 8 adder trees. The SW1 and the HPPU are connected by the cross path to select 18 pixels from among 32 pixels. Eighteen pixels (144 bits) in the search window and 16 pixels (128 bits) in the template block are loaded into the HP blender. The half-pel blender consists of 54 registers to hold 18 pixels  $\times$  3 rows, and 128 two-tap filters to generate half-pel precision pixels. Three cycles are required to load pixels into registers before starting the half-pel generation. The half-pel blender can generate 128 half-pel precision pixels corresponding to each row of 8 MBs surrounding an integer vector per 1 cycle. The 128 PEs have the processing capability to calculate absolute differences between 128 pixels generated by the halfpel blender and 16 pixels in the template block during each cycle. An adder tree sums up 16 absolute differences with the Wallace-tree and accumulates the summations with the accumulator. Eight adder trees are required because calculations for 8 MBs are executed concurrently. The frame prediction, the even field prediction and the odd field prediction are serially executed for frame/field mode. Pixels on even or odd number of rows in the search window and template Table 6



Fig. 15 Chip photomicrograph.

block are loaded into the MEHH, to generate half-pel precision pixels for each field prediction. At the frame prediction, SADs of 8 MBs with half-pel precision surrounding an integer vector consume 16 cycles for calculation and 2 cycles for initialization. At the field prediction, SADs of 8 MBs with half-pel precision surrounding an integer vector consume 8 cycles for calculation and 2 cycles for initialization. The HPPU can execute even and odd field predictions in 20 cycles because each field calculation requires 10 cycles.

#### 4. VLSI Implementation Design

In order to realize fast operation with low voltage and low power consumption, the memory part (SRAMs) is designed with full custom method. The logic part of the MEH is designed with HDL, logic synthesis and place & route tools. Post placement optimization is employed to improve the timing performance of the MEH.

Table 6 shows the MEH specification. The MEH was designed and fabricated with 0.18-micron CMOS tech-The MEH integrates 192 I/O signals and 2.25 nology. million transistors in  $3.1 \text{ mm} \times 3.1 \text{ mm}$ . It operates using 108 MHz@1.0 V supply voltage. Figure 15 shows a chip photomicrograph of the MEH.

The power consumption is shown in Fig. 16. The power consumption of the MEH is 95 mW@1.0V which is 3.8% of that of VLSI1@1.8V with the sub-sampling method and 41% of VLSI2@1.0 V with the GDS technique. In the case of using 1.8 V supply voltage which is commonly used with 0.18-micron CMOS technology, the power



consumption of the MEH is estimated to be 308 mW. So, the power consumption is reduced to 13% of that of VLSI1 by using the proposed architecture and it's additionally reduced to 3.8% by using lower supply voltage. The figure also indicates that introducing the proposed systolic array architecture for the FS in the lower layer allows 70% lower power consumption of the MEH compared to the most basic one-dimensional systolic array [8] (1D-SA0). Therefore, the power consumption is reduced to 55% of that with the one-dimensional systolic array using a template-pixel mapping method with broadcasting of search window data [9] (1D-SA1).

#### 5. Conclusion

This paper presents a 95 mW MPEG2 MP@HL motion estimator that can execute frame/field and bidirectional prediction with half-pel precision using  $1920 \times 1080@30$  fps resolution video. The search range is up to  $\pm 128 \times \pm 64$ . The MEH is expected to be applicable to SOCs with MPEG codec function for energy-aware portable video applications.

#### Acknowledgments

The authors would like to thank the Semiconductor Technology Academic Research Center (STARC) for allowing development of the LSI. The VLSI chip in this study was fabricated in the chip fabrication program of the VLSI Design and Education Center (VDEC) and MOSIS.

#### References

- [1] T. Matsumura, S. Kumaki, H. Segawa, K. Ishihara, A. Hanami, Y. Matsuura, S. Scotzniovsky, H. Tanaka, A. Yamada, S. Murayama, T. Wada, H. Ohira, T. Shimada, K. Asano, T. Yoshida, M. Yoshimoto, K. Tsuchihashi, and Y. Horiba, "A single-chip MPEG2 422@ML video, audio, and system encoder with a 162 MHz media-processor and dual motion estimation cores," IEICE Trans. Electron., vol.E84-C, no.1, pp.202-211, Jan. 2001.
- [2] M. Miyama, O. Tooyama, N. Takamatsu, T. Kodake, K. Nakamura, A. Kato, J. Miyakoshi, K. Imamura, H. Hashimoto, S. Komatsu, M. Yagi,

M. Morimoto, K. Taki, and M. Yoshimoto, "An ultra low power motion estimation processor for MPEG2 HDTV resolution video," IEICE Trans. Electron., vol.E86-C, no.4, pp.561–569, April 2003.

- [3] A. Harada, S. Hattori, T. Kasezawa, H. Sato, T. Matsumura, S. Kumaki, K. Ishihara, H. Segawa, A. Hanami, Y. Matsuura, K. Asano, T. Yoshida, M. Yoshimoto, and T. Murakami, "An architectural study of an MPEG-2 422P@HL encoder chip set," IEICE Trans. Fundamentals, vol.E83-A, no.8, pp.1614–1623, Aug. 2000.
- [4] M. Ikeda, T. Kondo, K. Nitta, K. Suguri, T. Yoshitome, T. Minami, H. Iwasaki, K. Ochiai, J. Naganuma, M. Endo, Y. Tashiro, H. Watanabe, N. Kobayashi, T. Okubo, T. Ogura, and R. Kasai, "SuperENC: MPEG-2 video encoder chip," IEEE Micro, vol.19, no.4, pp.56–65, July–Aug. 1999.
- [5] F.S. Rovati, D. Pau, E. Piccinelli, L. Pezzoni, and J.-M. Bard, "An innovative, high quality and search window independent motion estimation algorithm and architecture for MPEG-2 encoding," IEEE Trans. Consum. Electron., vol.46, no.3, pp.697–705, Aug. 2000.
- [6] T. Onoye, G. Fujita, M. Takatsu, I. Shirakawa, and N. Yamai, "Single chip implementation of motion estimator dedicated to MPEG2 MP@HL," IEICE Trans. Fundamentals, vol.E79-A, no.8, pp.1210– 1216, Aug. 1996.
- [7] J. Miyakoshi, Y. Murachi, K. Hamano, T. Matsuno, M. Miyama, and M. Yoshimoto, "A low-power systolic array architecture for blockmatching motion estimation," IEICE Trans. Electron., vol.E88-C, no.4, pp.559–569, April 2005.
- [8] S. Uramoto, A. Takabatake, M. Suzuki, H. Sakurai, and M. Yoshimoto, "A half-pel precision motion estimation processor for NTSC-resolution video," IEICE Trans. Electron., vol.E77-C, no.12, pp.1937–1943, Dec. 1994.
- [9] T. Minami, T. Kondo, K. Suguri, and R. Kasai, "A proposal of a onedimensional array architecture for the full-search block matching algorithm," IEICE Trans. Inf. & Syst. (Japanese Edition), vol.J78-D-I, no.12, pp.913–925, Dec. 1995.

Koji Hamano



**Tetsuro Matsuno** received a B.E. degree in Electrical and Electronic Engineering from Kanazawa University in 2004. He is currently a Master's student at Kanazawa University. His interests include low-power VLSI systems.



Junichi Miyakoshi was born in 1980. He received a B.S. degree from Kanazawa University in 2002. He received an M.S. degree from Kanazawa University in 2004. He is currently enrolled in the Doctoral course in Kobe University. His research focus is low-power VLSI techniques for image processing.



Masayuki Miyama was born on March 26, 1966. He received a B.S. degree in Computer Science from the University of Tsukuba in 1988. He joined PFU Ltd. in 1988. He received an M.S. degree in Computer Science from the Japan Advanced Institute of Science and Technology in 1995. He joined Innotech Co. in 1996. He received the Ph.D. degree in electrical engineering and computer science from Kanazawa University in 2004. He is a research assistant at the Department of Electrical and Electronic En-

gineering at Kanazawa University. His present research focus is low-power design techniques for multimedia VLSI.



Yuichiro Murachi was born on November 1, 1980. He received a B.S. degree from Kanazawa University in 2003. He received an M.S. degree from Kanazawa University in 2005. He is currently enrolled in the Doctoral course in Kobe University. His research interests are VLSI systems and implementation of multimedia communication systems.

1981. He received a B.E. degree in Electri-

cal and Electronic Engineering from Kanazawa

University in 2004. He is currently a Master's

student at Kanazawa University. His research

interests are low-power VLSI systems.

was born on November 5,



Masahiko Yoshimoto received a B.S. degree in electronic engineering from Nagoya Institute of Technology, Nagoya, Japan, in 1975, and an M.S. degree in electronic engineering from Nagoya University, Nagoya, Japan, in 1977. He received a Ph.D. degree in Electrical Engineering from Nagoya University, Nagoya, Japan in 1998. He joined the LSI Laboratory, Mitsubishi Electric Corp., Itami, Japan, in April 1977. From 1978 to 1983 he was engaged in the design of NMOS and CMOS static RAM in-

cluding a 64 K full CMOS RAM with the world's first divided-word-line structure. From 1984, he was involved in research and development of multimedia ULSI systems for digital broadcasting and digital communication systems based on MPEG2 and MPEG4 Codec LSI core technology. Since 2000, he has been a Professor of the Dept. of Electrical and Electronic Systems Engineering at Kanazawa University, Japan. Since 2004, he has been a Professor of the Dept. of Computer and Systems Engineering at Kobe University, Japan. His current activity is focused on research and development of multimedia and ubiquitous media VLSI systems including an ultra-low-power image compression processor and a low power wireless interface circuit. He holds 70 registered patents. He served on the Program Committee of the IEEE International Solid State Circuit Conference from 1991 to 1993. In addition, he has served as a Guest Editor for special issues on Low-Power System LSI, IP, and Related Technologies of IEICE Transactions in 2004. He received the R&D100 awards from R&D Magazine for development of the DISP and development of a realtime MPEG2 video encoder chipset in 1990 and 1996, respectively.