High Performance DSP Architectures
Published on Aug 15, 2016
Digital Signal Processing is carried out by mathematical operations. Digital Signal Processors are microprocessors specifically designed to handle Digital Signal Processing tasks. These devices have seen tremendous growth in the last decade, finding use in everything from cellular telephones to advanced scientific instruments. In fact, hardware engineers use "DSP" to mean Digital Signal Processor, just as algorithm developers use "DSP" to mean Digital Signal Processing.
DSP has become a key component in many consumer, communications, medical, and industrial products. These products use a variety of hardware approaches to implement DSP, ranging from the use of off-the-shelf microprocessors to field-programmable gate arrays (FPGAs) to custom integrated circuits (ICs).
Programmable "DSP processors," a class of microprocessors optimized for DSP, are a popular solution for several reasons. In comparison to fixed-function solutions, they have the advantage of potentially being reprogrammed in the field, allowing product upgrades or fixes. They are often more cost-effective than custom hardware, particularly for low-volume applications, where the development cost of ICs may be prohibitive. DSP processors often have an advantage in terms of speed, cost, and energy efficiency.
DSP Algorithms Mould DSP Architectures
From the outset, DSP algorithms have moulded DSP processor architectures. For nearly every feature found in a DSP processor, there are associated DSP algorithms whose computation is in some way eased by inclusion of this feature. Therefore, perhaps the best way to understand the evolution of DSP architectures is to examine typical DSP algorithms and identify how their computational requirements have influenced the architectures of DSP processors.
The FIR filter is mathematically expressed as a vector of input data, along with a vector of filter coefficients. For each "tap" of the filter, a data sample is multiplied by a filter coefficient, with the result added to a running sum for all of the taps . Hence, the main component of the FIR filter algorithm is a dot product: multiply and add, multiply and add.
These operations are not unique to the FIR filter algorithm; in fact, multiplication is one of the most common operations performed in signal processing convolution, IIR filtering, and Fourier transforms also all involve heavy use of multiply-accumulate operations. Originally, microprocessors implemented multiplications by a series of shift and add operations, each of which consumed one or more clock cycles. As might be expected, faster multiplication hardware yields faster performance in many DSP algorithms, and for this reason all modern DSP processors include at least one dedicated single- cycle multiplier or combined multiply-accumulate (MAC) unit.
Multiple Execution Units
DSP applications typically have very high computational requirements in comparison to other types of computing tasks, since they often must execute DSP algorithms in real time on lengthy segments of signals sampled at 10-100 KHz or higher. Hence, DSP processors often include several independent execution units that are capable of operating in parallel for example, in addition to the MAC unit, they typically contain an arithmetic- logic unit (ALU) and a shifter.
Efficient Memory Accesses
Executing a MAC in every clock cycle requires more than just a single-cycle MAC unit. It also requires the ability to fetch the MAC instruction, a data sample, and a filter coefficient from memory in a single cycle. To address the need for increased memory bandwidth, early DSP processors developed different memory architectures that could support multiple memory accesses per cycle. Often, instructions were stored in the memory bank, while data was stored in another. With this arrangement, the processor could fetch an instruction and a data operand in parallel in every cycle.
Since many DSP algorithms consume two data operands per instruction, a further optimization commonly used is to include a small bank of RAM near the processor core that is used as an instruction cache. When a small group of instructions is executed repeatedly, the cache is loaded with those instructions, freeing the instruction bus to be used for data fetches instead of instruction fetches thus enabling the processor to execute a MAC in a single cycle. High memory bandwidth requirements are often further supported via dedicated hardware for calculating memory addresses.
These address generation units operate in parallel with the DSP processor's main execution units, enabling it to access data at new locations in memory without pausing to calculate the new address. Memory accesses in DSP algorithms tend to exhibit very predictable patterns; for example, for each sample in an FIR filter, the filter coefficients are accessed sequentially from start to finish for each sample, then accesses start over from the beginning of the coefficient vector when processing the next input sample.