Design of Speech Recognition Chip Structure Based on C Language Design Flow Optimization

This article refers to the address: http://

It is predicted that the demand for voice control applications will increase dramatically in the market, and its driving force comes from the telephone market. The phone will be more controlled with voice commands. Other applications include toys and handheld devices such as calculators, voice-controlled security systems, home appliances and in-vehicle devices (stereo, windows, environmental controls, lights and navigation controls). This paper introduces various considerations for the structure design of speech recognition chip from the perspective of reusable and optimized chip space, and its idea is conducive to the development of a series of other speech recognition chips.



Singapore's Columns started early in the application of portable voice control products, one of which was the “voice-controlled European currency converter” that exchanged between the euro and other European currencies. The design requirements of the Euro Converter include: 1. Low power, battery life of at least 1 year; 2. Low price, retail price of the product does not exceed 9 US dollars; 3. Strong flexibility, can be accurately identified in multiple languages And synthesize speaker-related speech; 4. The entire voice control core product should have reusable features.

This article describes the entire process of developing Euro Converter ASIC products using Frontier Design's design tools. The requirements for implementing complex DSP algorithms in ASICs are often extremely demanding, but the RTL description can be quickly optimized using Frontier's structural synthesis tool, the A|RT Designer tool, which also allows free choice of spare structures to optimize application design.

By applying a C-based design flow, new features can be designed and hardware optimized during the structural design phase, which can reduce the silicon area by 50%. By speeding up the design of the C language prototype hardware, the performance of the design can be further extended to meet Users have strict requirements on product specifications.

Algorithm

The efficiency of the Euro Converter depends to some extent on the comparison of voice commands to the storage database and the ability to execute commands. Developing an algorithm that meets the requirements of the final product is critical to the success of the design, because no one wants to see that speech control devices cannot consistently recognize commands, and people need algorithms that achieve more than 98% recognition accuracy from start to finish. Therefore, the current challenges include detecting and clearing background noise, distinguishing between real command words and other noises (breathing sounds, tiny electrostatic interference sounds, and microphone sounds), determining the start and end of command words, and the sounds that will be input and stored. The texture database and subsequent command word recognition (Figure 1) are compared.

The following advanced computationally intensive DSP algorithms are suitable for solving the above problems: 1. Mel frequency spectrum (cepstral) coefficient (MFCC) algorithm, MFCC algorithm consists of fast Fourier transform (FFT) function spectrum, Mel calibration and log ii; 2. Inverse Discrete Cosine Transform (iDCT); 3. Continuous estimation and estimation of continuous noise level estimation procedures for background sound and speech noise using multiple estimation and selection algorithms; 4. Implementation of sound levels during and near the effective period of the command word Incomplete and accurate command word boundary detection algorithm for detailed analysis; 5. Dynamic time warp for comparing a series of vectors of unequal length and comparing durations between these vectors.

The algorithm is programmed in floating-point C. In order to adjust and optimize the parameters, the compilation and simulation of floating-point C code is fast enough to verify the performance of the algorithm. Finally, C code must be able to run on a traditional PC, and the performance of speech recognition and synthesis algorithms can be tested in real-world environments. The final speech recognition algorithm was tested on a 450MHz Pentium, and when tested with the company's internal voice recording library, 99% recognition accuracy was achieved.

Floating point algorithm to fixed point algorithm

Chip implementations need to convert floating-point algorithms to fixed-point algorithms, ensuring dynamic range and precision and preventing dynamic limits from being exceeded after conversion. Frequently, the non-optimized range of point operands may cause the operand to wrap around (such as (max+1) to get (min)) and cause severe clipping and bit errors. The accuracy of the fixed point is equally important, especially in repeated signal processing operations. When the accuracy is not enough, repeated signal processing algorithms will lead to fault propagation and error accumulation, and the final signal may gradually degenerate into white noise, which is undoubtedly a catastrophic error for voice control products.

The Frontier tool has a C++ class library called the A|RT library, which is a tool for analyzing the fixed-point performance of C code. This library supports multiple fixed-point data types, provides bit-true modeling for multiple overflow behaviors such as saturation and wraparound, and provides multiple quantization models such as truncation and rounding. The original 32-bit floating-point speech recognition algorithm supports data input at 8 KHz, with a typical signal bandwidth of 32 bits and a memory capacity requirement of several kilobytes. The output of a typical voice user interface is measured at a rate of a few bytes per second.

Code merge to achieve the final product

Analysis shows that global data-types and arrays require only 16 bits (1 sign bit, 10 dynamic bits, 5 precision bits) to maintain the accuracy of the algorithm without noise. However, the highly repetitive FFT subroutine requires 8 dynamic bits, 7 precision bits, and 1 sign bit. Typically this analysis can be used globally with a 19-bit word width to meet the maximum requirements for dynamic and precision bits of any operation. Since the A|RT library allows dynamic change of word width, and the global data type defines 1 sign bit, 10 dynamic bits and 5 precision bits, the FFT MAC result is assigned 1 sign bit, 8 dynamic bits and 5 The precision bit, so the designed word width (including the bus) remains at 16 bits. This can save a lot of silicon area.

After the fixed point C algorithm conversion is completed, the C code can be compiled with a conventional C++ compiler and run on a PC (also run on an HP or SUN machine). The bit-true definition of all signals guarantees the correct indexing of hardware maps and direct interfaces to other digital components such as HDL compilers and emulators. Combine the fixed point identification code with the C code of the Euro Converter application to get the complete executable final code.

System design considerations

To achieve cost targets, a single-chip SoC solution is the only viable solution. The SoC must integrate the following resources onto a chip of no more than 25,000 gates: 1. Speech Recognition and Synthesis (SRS) recognition core; 2. Speech Recognition and Synthesis (SRS) program and Euro Converter code (up to 30 KB); 3. Voice Synthesis example (up to 30KB); 4. RAM for storing voice print and used as intermediate result memory (up to 30K bytes); 5. AD/DA converter; 6. Microphone interface; 7. Speaker interface.

Power consumption is also an important issue to consider, and battery life should be at least one and a half years. To meet these demanding power requirements, the system must have a power-saving mode, store voiceprints in RAM, a processor with a low clock frequency, and high efficiency audio amplifiers.

SRS processor structure

To give the necessary processing and low power constraints, selecting the target clock frequency is a top priority. Based on estimates of initial power consumption and processing calculations, we believe that the 2 to 4 MHz clock frequency is sufficient. The 3.579 MHz was chosen because it is the basis of the NTSC video system and the cost of quartz is low.

The algorithm needs to detect and remove background noise. In order to get a 3.5MHz clock from the 450MHz clock of the Pentium, and keep the number of cores of the chip less than 25,000, SRS should use a dedicated structure.

Designing a dedicated processor is time-consuming and laborious, and the HDL language is used to rewrite the algorithm to get the best solution. The A|RT Designer tool combines a controller-based architecture and is based directly on the high-performance C-language algorithms. Design engineers analyze and optimize and then convert to Verilog or VHDL code.

The design engineer uses the A|RT Designer tool to synthesize the appropriate structure for the speech recognition algorithm, followed by the RTL description. The tool allocates the necessary data path resources (multipliers, adders, ALUs, I/O, RAM, ROM, etc.), allocates arithmetic operations to these resources, and schedules the operations. A controller, microcode (to control resource allocation and scheduling), registers, multiplexers, and buses are automatically generated.

The key parameter for mapping the SRS algorithm to the hardware structure is to run the full SRS code at a target clock frequency of 3.5 MHz without exceeding the maximum 25,000 gate constraints. Using A|RT Designer's "load view", design engineers identify several multiple cycle operations that represent performance bottlenecks. The location of the bottleneck on the view will show the relevant C code, allowing the design engineer to identify the cause of the bottleneck and test the alternative solution.

The most obvious bottleneck is the dense FFT calculation in the MEL operation, which occupies 80% of the real-time processing cycle. By adding a two-stage adder and an address calculation unit (ACU), the FFT can be optimized to only account for 10% of the original computation cycle. Although this adds hardware, it pays only 4,000 doors, which is just within the hardware budget. Even with this improvement, the total number of cycles used is too high to reach a 3.5 MHz clock frequency.

Further analysis shows that the calculation of the logarithmic function can be improved. When running the C language algorithm on a RISC DSP (NSC CR16B), this operation takes about 1,000 cycles, which is about 15% of the real-time computing requirement. Adding a dedicated application specific unit (ASU) further reduces the cycle time of these functions to 3 cycles, and only adds 200 gates. The above structural changes result in a minimum clock frequency of 1.5 MHz, which is less than half of the target frequency.

Optimization of the number of gates and the power of the speech recognition core can reduce the number of register flip flops. Triggers are expensive (10 gates each) and consume a lot of power. A|RT Designer's "life-time view" is used to analyze the number of cycles that make up the life of each variable and the frequency at which the variable is used. By storing infrequently used but long-term valid variables in RAM, the total number of registers can be reduced, further reducing the required silicon area and power. This measure saves 50% of the register gates while leaving plenty of overhead for the computational cycle budget.

RAM compression implementation

At the beginning of the design, we have made it clear that the 30KB RAM space is too tight. Referring to the SRS C code, each voiceprint (about 1 second of speech) occupies approximately 1-2 KB, which is equivalent to 30 commands, leaving little room for the intermediate result SRAM. Since 30KB of RAM occupies a considerable area of ​​silicon, no more RAM can be added to the silicon budget (Figure 2).



The entire chip is fabricated using a standard 0.35μm CMOS process, and the only solution to the RAM space problem is to use some form of speech compression.

Voiceprint data can be compressed in two ways: lossless compression or lossy compression. At present, there are several lossless compression methods based on the existing standard C code source program and implemented in C language. Voiceprint sampling data can be used as a reference, and the best lossless algorithm can achieve a compression ratio of 30%. With lossy compression, it can be compressed by another 20% without significantly degrading the quality of the recognition. Lossy compression is fully scalable, resulting in a variable compression ratio that depends on the actual voiceprint length or vocabulary size. The resulting C code algorithm has a total of 500 lines and a 50% compression ratio for the voiceprint. The next step is to integrate the voice compression and speech recognition IP blocks.

Then simply combine the 500 lines of code with the 10,000 lines of SRS code to get a new function subroutine that is called when storing the voiceprint or reading the voiceprint in RAM. However, the amount of calculation of the program is quite large, and about 1.5 million clock cycles are required after the initial calculation, which is equivalent to the time required for SRS processing. Fortunately, the nearly 2.5 MHz set by the effective clock frequency solves this process problem without further optimization. This compression scheme reduces RAM requirements to 20-25KB, leaving at least 5KB for intermediate result memory for the processor.

Speaker interface implementation

Single-cell power management bias networks, digital-to-analog converters (DACs), and analog amplifiers require a large chip area, and the implementation of pulse width modulation (PWM) speaker drivers directly in C can solve this problem.

How to pronounce the speaker? C code can be directly converted to VHDL using the company's A|RT Builder "C-to-HDL" conversion tool. It is then synthesized using Exemplar's Leonardo Spectrum and mapped to Xilinx's Virtex FPGA. With the Xilinx FPGA board, the speaker can be directly connected to the two digital outputs, and the switch can be activated to listen to the sound.

RTL description generation

When engineers are satisfied with the performance and structure of the speech recognition SoC, the A|RT Designer tool can be used to automatically generate an RTL VHDL language description for the final silicon. The tool automatically generates RTL code and microcode, RAM, ROM and data path functions for the controller. In addition, the A|RT Designer tool automatically generates test benchmarks at each stage of the design flow, so the original floating-point algorithm simulation is comparable to the simulations in the floating-point C and HDL scenarios. The VHDL simulation strictly corresponds to the original floating point C code, which means that the SoC has the same precision as the floating point algorithm.

Final structure

All the functions required for the SRS ASIC are integrated on a single chip (Figure 2). In addition, all IP developed for this SoC can be reused. The SRS algorithm is currently applied to the DECT telephone speech recognizer of the CR16B RISC core. The data compression function can also be multiplexed to further enhance the dedicated variable bit rate ADPCM Audio Compression Code (VADPCM). VADPCM can also be used in SRS cores, and PWM algorithms and solutions can still achieve high quality audio output without using analog components. The SRS implementation itself can be modified in the next generation of products.

DIN Cables

There are many different versions of DIN Connectors. The name of each type comes from the number of pins the connector has (3-pin DIN, 4-pin DIN, etc.) Some of these pin numbers come in different configurations, with the pins arranged differently from one configuration to the next.

DIN cable connector 3-pin, 4-pin, 5-pin, 6-pin, 7-pin, 8-pin degree 180, 216, 240, 262, 270

DIN cables, DIN connector, telephone cable, computer cable, audio cable

ETOP WIREHARNESS LIMITED , https://www.oemmoldedcables.com