8. Implementation Aspects Flashcards by Lies Dillen

What is a real-time system?

When the sampling rate at the input and at the output of the scheme is the same. Such a processing unit is called a real-time system if it fulfills the timeliness property. This means that the system is sufficiently fast to ensure that always fs^(out) valid output samples y[k] are produced per second, whatever the signal applied to the input or the operation mode the processing scheme is in.

(p. 229)

How well did you know this?

Not at all

Perfectly

What is a batch processing system?

Real-time systems contrast with batch processing systems, i.e. with systems that process a finite set of (prerecorded) data samples (typically stored in the memory of a microprocessor or on hard disk) in an off-line way. Batch processing systems are usually not dictated by time constraints : the user simply has to wait till all the samples have been processed.

(p. 230)

How well did you know this?

Not at all

Perfectly

Comment on the correctness of the code.

When porting a signal processing solution onto dedicated hardware, it does not suffice to simply debug the code until all the warnings and (allocation) errors have disappeared: a thorough validation of the correctness of the code is mandatory. This is typically done by processing representative, random test signals both with the original (reference) code in the higher programming language and with the code written in the lower programming language. Only if the difference is small, i.e. around machine precision, correctness can generally be assumed.

(p. 230)

How well did you know this?

Not at all

Perfectly

Make sure you can calculate the number of arithmetic operations per sample and the number of arithmetic operations per second that are needed to implement an algorithmic scheme (for example the FIR filtering operation of equation 8.1).

Observe that the number of (arithmetic) operations required to implement an (N + 1)-taps FIR filter amounts to N + 1 multiplications (assuming that all bn 6= 0 and bn 6= ±1) and N additions (assuming that all terms bnx[k − n] 6= 0) per output sample y[k]. As typically, both a multiplication and an addition of two real-valued numbers can be performed in one machine cycle, the number of (arithmetic) operations equals 2N + 1 op. per output sample. If fs is the rate at which filter input x[k], and hence, output y[k] is sampled, the required number of (arithmetic) operations per second is given by (2N + 1)fs ops.

(p. 231)

How well did you know this?

Not at all

Perfectly

What are MIPS and MOPS?

Processing power is typically expressed in MIPS (Millions of Instructions per Second) or MOPS (Millions of Operations per Second), and can be found in the data sheet of the processor.

Keep in mind, though, that as each chip manufacturer uses its own definition of MIPS and MOPS, these numbers need to be interpreted with care.

(p. 231)

How well did you know this?

Not at all

Perfectly

What is the disadvantage of cost estimates?

Recall that a cost estimate merely counting the number of multiplications and additions is a rough approximation and underestimate for the total execution time only. Nevertheless, it is a useful instrument to highlight the dependence of the total cost on each of the algorithm parameters and to compare and benchmark different algorithmic schemes.

(p. 232)

How well did you know this?

Not at all

Perfectly

Make sure you can calculate the required amount of memory that is needed to implement an algorithmic scheme in real time (for example the FIR filtering operation of equation 8.1).

For example, it is easily verified that for a real-time implementation of the FIR filter of equation 8.1 in total 2N + 1 data words need to be stored : N + 1 memory locations are required for the filter coefficients bn (assuming that all bn 6= 0 and bn 6= ±1) and N locations for delayed input samples, i.e. for x[k−1], . . . , x[k−N]. Depending on the specific implementation, a few more storage registers (memory cells) are needed to store additional intermediate variables and the current input x[k] and current output y[k]. The amount of memory required for the program code depends on the software encoding and the programming style.

(p. 232)

How well did you know this?

Not at all

Perfectly

Which parts does the input/output delay of a real-time implementation scheme consist of?

a delay due to the A/D and the D/A converter
a latency caused by the operating system
a delay inserted by the block processing
possibly, an algorithmic delay introduced (on purpose) inside the signal processing scheme
(p. 232 - 234)

How well did you know this?

Not at all

Perfectly

What is block processing?

What is sample-based processing?

Block processing is a popular data handling technique used inside many signal processing schemes. The idea is to transfer data and process this data in frames of P samples rather than on a sampleby-sample basis. Sample-based processing is in fact a special case of block processing with P = 1.

(p. 232 - 233)

How well did you know this?

Not at all

Perfectly

How long is the delay introduced by a block processing system with frame length P?

Make sure you understand figure 8.1.

Unfortunately, block processing inevitably introduces a delay, which is at least as large as block length, also called frame length P.

(p. 234)

How well did you know this?

Not at all

Perfectly

What is in trade-off with the processing delay?

Typically, there is a trade-off between processing delay on the one hand, and computational complexity and/or algorithm performance on the other hand.

(p. 234)

How well did you know this?

Not at all

Perfectly

Make sure that you understand figures 8.2 and 8.3, and that you can give the corresponding difference equation, transfer function, filter order, impulse response, frequency response and the poles and zeros.

You may furthermore be asked to discuss the stability and to calculate the number of arithmetic operations, the expected block processing delay and the amount of memory required for a real-time implementation.

(p. 235 - 237)

How well did you know this?

Not at all

Perfectly

What is a multiply-accumulate operation? What is it used for?

Many DSP processors provide hardware-encoded so-called multiply-accumulate (MAC) instructions to efficiently perform such a combined multiplication and addition.

Hence, if MAC instructions can be used, the number of operations per second needed
for the FIR filtering operation of equation 7.1 reduces to (N + 1)fs ops.

(p. 235 - 236)

How well did you know this?

Not at all

Perfectly

What is a circular buffer?

What is it used for?

If a linear buffer is used, the first cell contains the current input x[k] and in the last cell the least recent input sample x[k − N] is stored. Disadvantage of this approach is that each time a new
input sample arrives the entire delay line has to be updated (mass shift of data). This explains why often a circular buffer is preferred, where at each sample instance only the oldest sample x[k − N], which is not needed any further, is replaced with the newest input sample x[k+ 1] and all other data in memory are left unchanged.

(p. 236)

How well did you know this?

Not at all

Perfectly

When are fast convolution techniques needed?

Be aware that FIR filtering operations often require a large number of calculations, and that hence, cost efficient implementation schemes based on fast convolution techniques are called for.

(p. 237)

How well did you know this?

Not at all

Perfectly

What is a fast convolution scheme?

Study These Flashcards

A fast convolution scheme is a block processing approach that implements an FIR filtering operation in the frequency domain by relying on the convolution theorem.

(p. 237)

Explain in words why only the last M −N circular convolution results are meaningful, and hence, why the inequality P ≤ M − N must hold.

Study These Flashcards

It is typical for a linear convolution operation in fact that each time the index n of filter coefficient w[n] is increased by 1 (see equation 8.11), the data signal index k − n is decremented by 1. Observe from equations 8.6–8.10 that this is the case for ˜y^(r)[N] till ˜y^(r)[M −1], but not for ˜y^(r)[0] till ˜y^(r)
[N −1], where a sudden increment of the data signal index by M − 1 occurs owing to the periodic repetition of the signal. Hence, although ˜y^(r)[0] till ˜y^(r)[N − 1] are valid circular convolution results, they do not correspond to a linear convolution operation. They are therefore useless for the intended application, which is linear FIR filtering. In other words, the circular convolution ˜w ⋆⊙x˜^(r) results in a sequence ˜y^(r) of length M, of which only the last M − N entries are meaningful and correspond to a real FIR filtering operation.

During each iteration through the different processing steps P new data samples are input into the scheme, and hence, P validly filtered data samples have to be delivered at the output. Only in this way the input data rate matches the output data rate and realtime operation can be ensured. As of the M samples equation 8.5 produces, only M − N are useful, i.e. correspond to a valid linear convolution, the mentioned inequality.

(p. 240)

The 6 algorithmic steps that are shown on page 240 implement the overlap-save fast convolution scheme, and hence perform an FIR filtering operation in the frequency domain. Make sure you understand each of the 6 steps.

How many calculations are needed?
What are the memory requirements?
Give a practical application where the scheme can be used.

Study These Flashcards

p. 240 - 241

What is special about calculating the circular convolution in the frequency domain when the w and x are real valued?

Study These Flashcards

Recall that only [M/2] + 1 out of the M products W˜ [m] · X˜(r) [m] need to be computed in case w and x are real valued.

(p. 241)

Make sure you can make a cost estimate for the overlap-save fast convolution scheme and compare with a standard FIR filtering operation. You are not supposed to prove or memorize the cost formula for the M-point real-input FFT (2M log2 M − 4M + 6) and the real-output IFFT (2M log2 M) you find on page 242. These formulas come from a book and were merely used to generate figure 8.5. They correspond to one particular implementation scheme for the real-input FFT and the real-output inverse FFT. There exist many other realizations with a different (yet comparable)
computational complexity.

At the exam you will be either given the cost formulas (2M log2 M − 4M + 6 and 2M log2 M) or are supposed to give a more simplified complexity estimate yourself. Recall from section 4.3 that a simplified yet realistic complexity estimate both for the M-point (real-input) FFT and the M-point (real-output) IFFT is M log2 M.

Study These Flashcards

(p. 242)

What is the price to pay for the considerable cost reduction that is obtained with fast convolution?

Study These Flashcards

Unfortunately, there is no such thing as a free lunch. Despite the remarkably low implementation cost for large N¯ , the biggest disadvantage of the fast convolution approach is the increased processing delay compared to the standard implementation of figure 8.2.

In fact, as the fast convolution scheme is a block processing approach with frame length P, a delay of (up to) 2P samples is introduced (see section 8.1.5). For long filters w, the block processing delay 2P ≈ 2N¯ may be too large to qualify for a real-life application. Recall that the standard implementation of figure 8.2 typically introduces a block processing delay of 1 sample only.

(p. 243)

Comment on the different realization structures of IIR filters.

Study These Flashcards

Keep in mind that there exist different realization structures for IIR filters. They all produce the same filter output y[k] (when making abstraction of possible round-off errors and finite precision), but differ in terms of computational complexity, memory requirements, block processing delay and numerical stability (robustness to finite word length effects).

(p. 244)

Make sure that if you are given a block diagram of an IIR realization structure such as figures 8.6–8.11, you can understand and interpret the scheme, you recognize the type of realization structure (direct form I, direct form II, cascade or parallel) and you can give the corresponding difference equation, transfer function, filter order, frequency response and the poles and zeros.

Furthermore, you may be asked to discuss the stability and to calculate the number of arithmetic operations, the expected block processing delay and the amount of memory that is required for a real-time implementation. You are not supposed to draw block diagrams yourself.

Study These Flashcards

(p. 244 - 249)

Why do finite word length effects show up in digital signal processing?

Study These Flashcards

Finite word length effects tend to show up in digital signal processing schemes such as digital filters that are implemented on finite-precision hardware. They occur because data samples and algorithm parameters (e.g. filter coefficients) are represented by a finite number of bits and calculations are done using (fixed-point) finite-precision arithmetic.

(p. 248)

Which undesired effects are caused by the quantization of the filter coefficients on finite-precision hardware?

- If the filter is implemented on finite-precision hardware, the filter coefficients need to be requantized, i.e. rounded to the nearest quantization level to fit into hardware that typically uses 8, 12, 16 or 24 bit (fixed-point) data representation only. Because of the rounding the filter characteristics (e.g. frequency response) are likely to be altered. Usually, less attenuation is obtained in the stopband(s) than predicted by the filter design tool. Notice that the fewer bits are used, the larger is the deviation from the expected response. - IIR filters, which show a sharp transition between passband(s) and stopband(s), typically have some of their poles close to the unit circle. If the filter coefficients are represented by a limited number of bits these poles easily shift outside the unit circle, which causes the filter to become unstable. (p. 249 - 250)

Which undesired effects are caused by the application of finite-precision fixed-point arithmetic?

- Finite-precision arithmetic leads to the introduction of round-off errors. - If finite-precision (fixed-point) arithmetic is used there is also a possible risk of overflow. Overflow results in clipping or zeroing of the data or in a sign reversal. (p. 250 - 253)

What are round-off errors? What is a limit cycle?

On many fixed-point hardware platforms, numbers, e.g. signal samples or filter coefficients are represented as two’s-complement fractions (also called two’scomplement fractional numbers) xQ[k]. If two N-bit two’s-complement fractions are multiplied, a two’s-complement fraction results with (at most) 2N − 1 bits. In other words, multiplications considerably enlarge the word length. As numbers are represented by a fixed finite number of bits, the least significant bits of the multiplication result have to be discarded. Rounding and (magnitude) truncation introduce a round-off error, unfortunately, which can create undesired artifacts at the output of the processing scheme/filter. An example of such an artifact are limit cycles. Consider the first-order IIR filter and suppose that the input to the filter suddenly drops to zero. In the absence of any signal at the input one expects the filter output to go to zero following the exponential law. On finite-precision hardware, however, the output is not guaranteed to go to zero. Remark that in some cases the output goes to zero, but in other cases not, giving rise to a so-called limit cycle. It is hard to predict when limit cycles occur as it depends on the value of the filter coefficients, on the initial state (of the filter) and on the number of bits and the type of rounding/truncation that are used. (p. 250 - 252)

What is overflow? What is an overflow oscillation?

As the sum of two positive non-zero N-bit integers has length N + 1, the summation result may exceed the admissible number of bits and cause overflow if the data representation is N bits wide only. If two N-bit two’s-complement fractions are added together a number may result that is no longer an N-bit two’s-complement fraction because it exceeds the representation range and hence causes the accumulator to overflow. Overflow creates a hard nonlinearity, which can give rise to peculiar artifacts such as overflow oscillations and the creation of subharmonics. (p. 252 - 253)

Where do word length efects occur the most?

Finite word length effects especially occur in fixed-point hardware that uses a small number of bits. In general, IIR filters suffer more from finite word length effects than FIR filters. (p. 253)

What can be done to minimize the effects caused by the quantization of the filter coefficients?

Another solution is to employ appropriate filter realization structures that are (largely) insensitive to finite word length effects. In summary, the following ranking can be put forward as far as the sensitivity to coefficient quantization errors in IIR filters is concerned : lattice > cascade > parallel > direct form I or II (p. 254)

What can be done to minimize the effects caused by round-off errors?

To reduce the effect of round-off errors, processors are used that employ an increased number of bits internally. This means that the data busses and the memory, and hence, the data that is transferred and stored in these peripherals, is represented by N bits, but that the accumulator inside the ALU is, say, M > N bits wide, and that calculations are done with M-bit precision. (p. 255)

What can be done to minimize the effects caused by overflow?

To avoid overflow errors extra headroom can be provided inside the processor. This means that the accumulator inside the ALU has a number of extra high significant bits to handle numbers that exceed the representation range, which is [−1, 1 − 2^−(N−1)] in the case of N-bit two’s-complement fractions. As long as the data stays on the accumulator, it is allowed to exceed the maximum or minimum representation level to a certain extent (as far as the extra headroom permits). Once numbers are copied from the accumulator to memory or to a hardware peripheral, however, they undergo clipping (or possibly zeroing or sign reversal) to comply with the representation range [−1, 1 − 2^−(N−1)]. (p. 256)

8. Implementation Aspects Flashcards

(32 cards)