2.5 Finite wordlength effects
There are hardware and software FIR filter realizations. Regardless of which of them is used, a problem known as the finite wordlength effect exists in either case. One of the objectives, when designing filters, is to lessen the finite wordlength effects as much as possible, thus satisfying the initiative requirements (filter specifications).
On software filter implementation, it is possible to use either fixedpoint or floatingpoint arithmetic. Both representations of numbers have some advantages and disadvantages as well.
The fixedpoint representation is used for saving coefficients and samples in memory. Most commonly used fixedpoint format is when one bit denotes a sign of a number, i.e. 0 denotes a positive, whereas 1 denotes a negative number, and the rest of bits denote the value of a number. This is mostly used to represent numbers in the range 1 to +1. Numbers represented in the fixedpoint format are equidistantly quantized with the quantization step 1/2N1, where N is the number of a bit used for saving the value. As one bit is a sign bit, there are N1 bits available for value quantization. The maximum error that may occur during quantization is 1/2 quantization step, that is 1/2N. It can be noted that accuracy increases as the number of bits increases. Table 251 shows the values of quantization steps and maximum errors made due to quantization process in the fixedpoint presentation.
BIT NUMBER 
RANGE OF NUMBERS 
QUANTIZATION STEP 
MAX. QUANTIZATION ERROR 
NUMBER OF EXACT DECIMAL POINTS 

4 
(1, +1) 
0.125 
0.0625 
1 
8 
(1, +1) 
0.0078125 
0.00390625 
2 
16 
(1, +1) 
3.0517578125*105 
1.52587890625*105 
4 
32 
(1, +1) 
4.6566128730774*1010 
2.3283064365387*1010 
9 
64 
(1, +1) 
1.0842021724855*1019 
5.4210108624275*1020 
19 
Table 251. Quantization of numbers represented in the fixedpoint format
The advantage of this presentation is that quantization errors tend to approximate 0. It means that errors are not accumulated in operations performed upon fixedpoint numbers. One of disadvantages is a smaller accuracy in coefficients representation. The difference between actual sampled value and quantized value, i.e. the quantization error, is smaller as the quantization level decreases. In other words, the effects of the quantization error are negligible in this case.
The floatingpoint arithmetic saves values with better accuracy due to dynamics it is based on. Floatingpoint representations cover a much wider range of numbers. It also enables an appropriate number of digits to be faithfully saved. The value normally consists of three parts. The first part is, similar to the fixedpoint format, represented by one bit known as the sign bit. The second part is a mantissa M, which is a fractional part of the number, and the third part is an exponent E, which can be either positive or negative. A number in the floatingpoint format looks as follows:
where M is the mantissa and E is the exponent.
As seen, the sign bit along with mantissa represent a fixedpoint format. The third part, i.e. exponent provides the floatingpoint representation with dynamics, which further enables both extremely large and extremely small numbers to be saved with appropriate accuracy. Such numbers could not be represented in the fixedpoint format. Table 252 below provides the basic information on floatingpoint representation for several different lengths.
BIT NUMBER 
MANTISSA SIZE 
EXPONENT SIZE 
BAND 
NUMBER OF EXACT DECIMAL POINTS 

16 
7 
8 
2.3x1038 .. 3.4x1038 
2 
32 
23 
8 
1.4x1045 .. 3.4x1038 
67 
Table 252. Quantization of numbers represented in the floatingpoint format
It is not possible to determine the quantization step in the floatingpoint representation as it depends on exponent. Exponent varies in a way that the quantization step is as small as possible. In this number presentation, special attention should be paid to the number of digits that are saved with no error.
The floatingpoint arithmetic is suitable for coefficient representation. The errors made in this case are considerably less than those made in the fixedpoint arithmetic. Some of disadvantages of this presentation are complex implementation and errors that do not tend to approximate 0. The problem is extremely obvious when the operation is performed upon two values of which one is much less than the other.
Example
FIR filter coefficients:
{0.151365, 0.400000, 0.151365}
Coefficients need to be represented as 16bit numbers in the fixedpoint and floatingpoint formats. If we suppose that numbers range between 1 and +1, then quantization level amounts to 1 / 2^16 = 0.0000152587890625. After quantization, the filter coefficients have the following values:
{0.1513671875, 0.399993896484375, 0.1513671875}
Quantization errors are:
{0.0000021875, 0.000006103515625, 0.0000021875}
If filter coefficients are represented in the floatedpoint format, it is not possible to determine quantization level. In this case, the coefficients have the following values:
{0.151364997029305, 0.400000005960464, 0.151364997029305}
Quantization errors produced while representing coefficients as 16bit numbers in the floatingpoint format are:
{0.000000002970695, 0.000000005960464, 0.000000002970695}
As seen, a coefficient error is less in the floatingpoint representation.
Floatingpoint arithmetic can also be expressed in terms of fixedpoint arithmetic. For this reason, the fixedpoint arithmetic is more often implemented in digital signal processors.
The finite wordlength effect is the deviation of FIR filter characteristic. If such characteristic still meets the filter specifications, the finite wordlength effects are negligible.
As a result of greater error in coefficients representation, the finite wordlength effects are more prominent in fixedpoint arithmetic.
These effects are more prominent for IIR filters for their feedback property than for FIR filters. In addition, coefficient representation can cause IIR filters to become instable, whereas it cannot affect FIR filters that way.
FIR filters keep their linear phase characteristic after quantization. The reason for this is the fact that the coefficients of a FIR filter with linear phase characteristic are symmetric, which means that the corresponding pairs of coefficients will be quantized to the same value. It results in the impulse response symmetry remaining unchanged.
After all mentioned, it is easy to notice that finite word length, used for representing coefficients and samples being processed, causes some problems such as:

 Coefficient quantization errors;

 Sample quantization errors (quantization noise); and

 Overflow errors.
2.5.1 Coefficient Quantization
The coefficient quantization results in FIR filter changing its transform function. The position of FIR filter zeros is also changed, whereas the position of its poles remains unchanged as they are located in z=0. Quantization has no effect on them. The conclusion is that quantization of FIR filter coefficients cannot cause a filter to become instable as is the case with IIR filters.
Even though there is no danger of FIR filter destabilization, it may happen that transfer function is deviated to such an extent that it no longer meets the specifications, which further means that the resulting filter is not suitable for intended implementation.
The FIR filter quantization errors cause the stopband attenuation to become lower. If it drops below the limit defined by the specifications, the resulting filter is useless.
Transfer function changes occurring due to FIR filter coefficient quantization are more effective for highorder filters. The reason for this is the fact that spacing between zeros of the transfer function get smaller as the filter order increases and such slight changes of zero positions affect the FIR filter frequency response.
2.5.2 Samples Quantization
Another problem caused by the finite word length is sample quantization performed at multiplier’s output (after filtering). The process of filtering can be represented as a sum of multiplications performed upon filter coefficients and signal samples appearing at filter input. Figure 251 illustrates block diagram of input signal filtering and quantization of result as well.
Figure 251. Signal samples filtering
Multiplication of two numbers each N bits in length, will give a product which is 2N bits in length. These extra N bits are not necessary, so the product has to be truncated or rounded off to N bits, producing truncation or roundoff errors. The later is more preferable in practice because in this case the midvalue of quantization error (quantization noise) is equal to 0.
In most cases, hardware used for FIR filter realization is designed so that after each individual multiplication, a partial sum is accumulated in a register which is 2N in length. Not before the process of filtering ends, the result is quantized on N bits and quantization noise is introduced, thus drastically reduced.
Quantization noise depends on the number of bits N. The quantization noise is reduced as the number of bits used for sample and coefficient representation increases.
Both filter realization and position of poles affect the quantization noise power. As all FIR filter poles are located in z=0, the effect of filter realization on the quantization noise is almost negligible.
2.5.3 Overflow
Overflow happens when some intermediate results exceed the range of numbers that can be represented by the given wordlength. For the fixedpoint arithmetic, coefficients and samples values are represented in the range 1 to +1. In spite of the fact that both FIR filter input and output samples are in the given range, there is a possibility that an overflow occurs at some point when the results of multiplications are added together. In other words, an intermediate result is greater than 1 or less than 1.
Example:
Assume that it is needed to filtrate input samples using a secondorder filter.
Such filter has three coefficients. These are: {0.7, 0.8, 0.7}.
Input samples are: { ..., 0.9, 0.7, 0.1, ...}
By analyzing the steps of the input sample filtering process, shown in the table 253 below, it is easy to understand how an overflow occurs in the second step. The final sum is greater than 1.
FILTER COEFFICIENTS 
INPUT SAMPLE 
INTERMEDIATE RESULT 

0.7 
0.9 
0.63 
0.8 
0.7 
0.63 + 0.56 = 1.19 
0.7 
0.1 
1.19 + 0.07 = 1.26 
Table 253. Overflow
As the range of values, defined by the fixedpoint presentation, is between 1 and +1, the results of the filtering process will be as shown in the table 254.
FILTER COEFFICIENTS 
INPUT SAMPLE 
INTERMEDIATE RESULT 

0.7 
0.9 
0.63 
0.8 
0.7 
0.63 + 0.56  2 = 0.81 
0.7 
0.1 
0.81  0.07 = 0.88 
Table 254. Overflow effects
As mentioned, an overflow occurs in the second step. Instead of desired value +1.19, the result is an undesirable negative value 0.81. This difference of 2 between these two values is explained in Figure 252 below.
Figure 252. Signal samples filtering
However, if some intermediate result exceeds the range of presentation, it does not necessarily cause an overflow in the final result. The apsolute value of the result is less than 1 in this case. In other words, as long as the final result is within the wordlength, overflow of partial results is not of the essence. This situation is illustrated in the following example.
Example:
The secondorder filter has three coefficients. These are: {0.7, 0.8, 0.7}
Input samples are: { ..., 0.9, 0.7, 0.5, ...}
The desired intermediate results are given in the table 255.
FILTER COEFFICIENTS 
INPUT SAMPLE 
INTERMEDIATE RESULT 

0.7 
0.9 
0.63 
0.8 
0.7 
0.63 + 0.56 = 1.19 
0.7 
0.5 
1.19  0.35 = 0.84 
Table 255. Desired intermediate results
As seen, some intermediate results exceed the given range and two overflows occur. Refer to the table 256 below.
FILTER COEFFICIENTS 
INPUT SAMPLE 
INTERMEDIATE RESULT 

0.7 
0.9 
0.63 
0.8 
0.7 
0.63 + 0.56  2 = 0.81 
0.7 
0.5 
0.81  0.35 + 2 = 0.84 
Table 256. Obtained intermediate results
So, in spite of the fact that two overflows have occured, the final result remained unchanged. The reason for this is the nature of these two overflows. The first one has decremented the final result by 2, whereas the second one has incremented the final result by 2. This way, the overflow effect is annuled. The first overflow is called a positive overflow, whereas the later is called a negative overflow.
Note:
If the number of positive overflows is equal to the number of negative overflows, the final result will not be changed, i.e. the overflow effect is annuled.
Overflow causes rapid oscillations in the input sample, which further causes highfrequency components to appear in the output spectrum. There are several ways to lessen the overflow effects. Two most commonly used are scaling and saturation.
It is possible to scale FIR filter coefficients to avoid overflow. A necessary and sufficient condition required for FIR filter coefficients in this case is given in the following expression:
where:
bk are the FIR filter coefficients; and
N is the number of filter coefficients.
If, for any reason, it is not possible to apply scaling then the overflow effects can be lessened to some extend via saturation. Figure 253 illustrates the saturation characteristic.
Figure 253. Saturation characteristic
When the saturation characteristic is used to prevent an overflow, the intermediate result doesn’t change its sign. For this reason, the oscillations in the output signal are not so rapid and undesirable highfrequency components are attenuated.
Let’s see what happens if we apply the saturation characteristic to the previous example:
Example
Again, it is needed to filtrate input samples using a secondorder filter.
Such filter has three coefficients. These are: {0.7, 0.8, 0.7}
Input samples are: { ..., 0.9, 0.7, 0.1, ...}
The desirable intermediate results are shown in the table 277 below.
FILTER COEFFICIENTS 
INPUT SAMPLE 
INTERMEDIATE RESULT 

0.7 
0.9 
0.63 
0.8 
0.7 
0.63 + 0.56 = 1.19 
0.7 
0.1 
1.19 + 0.07 = 1.26 
Table 257. Desirable intermediate results
As the range of values, defined by the fixedpoint presentation, is between 1 and +1, and the saturation characteristic is used as well, the intermediate results are as shown in the table 258.
FILTER COEFFICIENTS 
INPUT SAMPLE 
INTERMEDIATE RESULT 

0.7 
0.9 
0.63 
0.8 
0.7 
0.63 + 0.56 = 1 
0.7 
0.1 
1 + 0.07 = 1 
Table 258. Intermediate results and saturation characteristic
The resulting sum is not correct, but the difference is far smaller than when there is no saturation:
Without saturation: Δ = 1.26  (0.88) = 2.14
With saturation: Δ = 1.26  1 = 0.26
As seen from the example above, the saturation characteristic lessens an overflow effect and attenuates undesirable components in the output spectrum.