JPEG image hardware decoding low power technology solution
In order to realize real-time data processing and low-power applications, this paper proposes a parallel, full-flow JPEG decoder implementation with clock management mechanism.
At present, China is preparing for the construction of the Internet of Things, which challenges the development of sensor technology and the massive data generated by digital image sensors for the storage capacity, transmission bandwidth and power consumption of real-time communication systems. In areas such as medical and remote sensing image communication where image quality is required to be restored, the demand for image encoders/decoders with low power consumption, good compression/decompression performance, and real-time processing capabilities has become increasingly urgent. The JPEG still image compression/decompression standard has excellent compression/decompression performance, and requires low memory and relatively low complexity, making it suitable for hardware implementation.
1 JPEG decoding algorithm
JPEG (Joint Photographic Experts Group) is a wide range of static image data compression standards. JPEG compression is a lossy compression that takes advantage of human visual system characteristics and uses a combination of quantization and lossless compression coding to remove redundant information from the visual and redundant information of the data itself. The JPEG decoder includes Huffman decoding, inverse quantization (IQ), and IDCT transform. In JPEG, the decoding of an image is performed in blocks. The entire image is divided into a number of 8x8 data blocks (MCUs), each of which corresponds to an 8x8 pixel array of the original image. The codec order of each line is from top to bottom, and the codec order in the line is from left to right [1].
2 parallel Huffman decoder
The length of the code after Huffman encoding is inconsistent. If the decoder is implemented by serial technology, the number of cycles required to solve one codeword is different because of the inconsistent code length. For real-time systems, serial technology is less efficient. In addition, if the data is interrupted by noise during the propagation process, the entire set of data becomes worthless. In response to these two problems, this paper proposes the following solutions. Figure 1 shows the main components and algorithm flow of Huffman decoding.
Algorithm flow: 32-bit compressed image data is acquired from the input end, the input data stream is analyzed, the code length is determined, the input data is shifted, and new data is added from the input end. The input data is translated into the original data through the Huffman table, and the symbol bits embedded in the data stream are extracted. After a series of division and subtraction operations, the frequency data before encoding is obtained, and the previously obtained symbol bits are combined and sent to the output buffer.
The algorithm used in this paper flexibly utilizes the characteristics of the Huffman table, and eliminates the multiplication operation in the algorithm. It takes only one cycle to complete the judgment of the code length. The data of the code table is arranged according to the code length classification from small to large, and the data with the same code length is arranged from small to large according to the size of the code word. Each table stores the decoding result DR (Decoding Results) corresponding to the code word in the ROM in the arranged order. This is beneficial for table lookups, and the ROM required is also minimal, meeting low power requirements. The address generator of the lookup table obtains a base address from the code length to which the "length matching" module is transmitted. The code length intercepts consecutive bits of the same number of bits from the input data as offset addresses, and two addresses. The addition is the address saved by the DR [2].
Since the position of the key bit is in the last digits of the codeword, the input data is shifted according to the code length, so that the last bit of the key bit appears at the nth bit, and the result of the shift only outputs the nth bit. For a few bits, such a circuit requires only one barrel shift register that is only controlled by the code length. In addition, for each table, one 1 string 0 plus one string 1 address correction string is generated. There are several key bits and several 1s. This part of the circuit is simple in logic and does not occupy much circuit. Using this address correction string and the output of the barrel shift register to do an AND logic operation, the correct offset address is obtained. Since the longest bit required by the Huffman table is 9 bits and the code length is up to 19 bits, this paper designs a 19-bit input and 9-bit output barrel shift register. The improved circuit area is reduced to about 50% before the improvement.
3 IDCT processor inverse discrete cosine transform IDCT (Inverse Discrete Cosine Transform) circuit overall implementation block diagram and its 2D IDCT block diagram shown in Figure 2. The DCT coefficient is processed by the inverse quantization and inverse scanning circuit and input to the buffer of the IDCT. The global control circuit controls the input to the 2D IDCT unit and sends the finally transformed data to the output buffer, and sends the Ready signal to the motion compensation unit. The unit is notified that the IDCT data can be read. The 2D IDCT unit performs 2 1D IDCT operations. First, the row-based 1D IDCT is performed. Then, the intermediate result of the first IDCT is transposed and buffered by the transposed memory, and then the column-based 1D IDCT transform is performed to obtain the final IDCT transformation results [3].
IDCT design uses zero-value judgment logic circuit, gated clock, parallel pipeline and other technologies, which makes the whole circuit greatly reduce power consumption on the basis of meeting the processing speed and accuracy requirements.
3.1 Zero-value judgment logic circuit In the whole image decoding process, about 90% of the data in each 8×8 data block has zero DCT coefficients, and it is meaningless to perform IDCT conversion on these zero values. Therefore, this design adds zero-valued decision logic to eliminate unnecessary multiplication operations. The zero value judgment logic circuit is composed of an 8×8 accumulator array, a zero value judgment logic module, and a checker MUX. It is judged by the zero-value logic module that when the operands are not all zero, the enable signal goes high, the operand is taken into the register, and then multiplied. If the operands are all zero, the array is blocked and the 0 is output directly through the MUX. The zero value judgment logic can effectively reduce power consumption, and the circuit is simple, and the area and delay time are almost negligible.
3.2 Latch-based gated clock The input clock of the control circuit can cause a part of the circuit to lower the operating frequency or stop working, thereby reducing the power consumption of the entire circuit. The circuit of 2D DCT/IDCT is mainly composed of 3 parts: 1D DCT/IDCT unit, transposition memory, input and output processing unit.
The transposed memory portion is updated only at the end of each 1D DCT/IDCT process, and the input and output processing unit operates only when data is input and output. Therefore, controlling the input clocks of these parts of the circuit to stop working most of the time can effectively reduce power consumption. The design results show that the system power consumption can be reduced by 13% with only 2% increase in area.
The latch-based gated clock can achieve the above functions. It has the advantages of not requiring a data selector, a small area, reducing the capacitance on the clock network, and reducing the internal power consumption of the gate register. The latch gated clock circuit and timing are shown in Figure 3.
3.3 Parallel Pipeline This design uses the addition and shift operations instead of the floating-point multiplication unit in the IDCT fast algorithm to speed up the data processing with a highly parallel pipelined VLSI architecture, which processes data in less than one-fifth of the serial structure. Therefore, the clock frequency can be reduced to about 1/5 of the serial structure, thereby reducing the power consumption of the system. For example, two 16×8 multipliers are used to simultaneously calculate the high-order portion and the low-order portion in parallel, respectively obtaining a high-order partial product and a low-order partial product, and then performing shift addition. When the circuit operation is realized, time overlap, resource reuse and resource sharing are realized, and the parallelism of the system is improved, thereby improving the running speed and efficiency of the multiplication circuit.
4 Simulation and synthesis results This paper selects a JPEG image of 1 920×1 080 size. The waveform of Modelsim after RTL level simulation is shown in Fig. 4. In the figure, JPEG_DATA is the code stream data, and OutR, OutG, and OutB are the decoding simulation results [4]. The decoding core module is synthesized at a frequency of 100 MHz [5], and the results are shown in Table 1.
This article is different from the previous software to achieve JPEG decoding, but in the hardware to achieve JPEG decoding, improve the hardware structure, through a variety of easy to operate methods to reduce hardware decoding energy consumption. Verification by EDA tools can fully meet the hardware decoding requirements of JPEG images.
Guangzhou Winson Information Technology Co., Ltd. , https://www.winsonintelligent.com