A Fixed-Point Introduction by Example

Christopher FeltonApril 25, 201117 comments

Introduction

The finite-word representation of fractional numbers is known as fixed-point.  Fixed-point is an interpretation of a 2's compliment number usually signed but not limited to sign representation.  It extends our finite-word length from a finite set of integers to a finite set of rational real numbers [1].  A fixed-point representation of a number consists of integer and fractional components.  The bit length is defined as:

$$
X_{Nbits} = X_{Integer Nbits} + X_{Fraction Nbits} + 1
$$


fp1

IWL is the integer word length, FWL is the fractional word length, and WL is the word length. For 2's compliment fixed-point representation $ WL = 1 + IWL + FWL $.  With this representation the range of the number is $ \left[ -2^{IWL}, 2^{IWL} \right) $ and a step size (resolution) of $ 2^{-FWL} $ [2].  The fixed-point number is defined by its format wl, iwl, fwl or its properties range, resolution, and bias.

As described, a fixed-point number is defined by its range and resolution, instead of the number of bits.  When designing a fixed-point system it is logical to use range and resolution requirements and algorithmically resolve the number of bits when it comes time to implement the design.  In other words, start with range and resolution when designing / specifying and move to the number of bits required when implementing.

Typically, a short hand is used to represent the format.  The Q-format is common when discussing fixed-point processors.  I will not use the Q-format because it is not as flexible and can be confusing with the notation used in older fixed-point processor documents [3].  Here I will use an explicit notation, W=(wl,iwl,fwl). This notation defines the number of integer bits, fraction bits, and the total bit length.  Example, W=(4,0,3) is a 4-bit number with three fractional bits and a sign bit (the sign bit is implicit).  All fixed-point numbers in this post will be 2's complement representation (sign bit is always present).

Why Use Fixed-Point

Why use fixed-point representation? Why not simply normalize all values to an integer range and deal with integers.  Depending on your point of view this might be obvious or not.  I have found many designers prefer to normalize all values to integers but this produces very unreadable code and documentation.  Fixed-point is normally used for convenience (similar to the use of $ \exp{j\omega} $ ).  Example, if I am looking at source code and I see:

c0 = fixbv(0.0032, min=-1, max=1, res=2**-15)
c1 = fixbv(-0.012, min=-1, max=1, res=2**-15)

I can easily relate this back to the coefficients of a filter but if I see the integer use only:

c0 = intbv(0x0069, min=-2**15, max=2**15)
c1 = intbv(-0x0189, min=-2**15, max=2**15)

additional information (documentation) is required.  The integer only is difficult to understand what the values are intended to represent.  In addition to the representation benefits, tools to handled rounding and overflow are common with fixed-point types.

Now that we have given the basic definition of a fixed-point binary word and given a couple reasons why fixed-point might be used. Let's take a look at some fractional values converted to a fixed-point type.  Table 1 is an example of
fixed-point representations.

Table 1: Fixed-Point Examples

Binary Hex Integer Floating Point Fraction Fixed-Point Fraction Actual
0100000000000000 4000 16384 0.50000000 0.50000000 1/2
0010000000000000 2000 8192 0.25000000 0.25000000 1/4
0001000000000000 1000 4096 0.12500000 0.12500000 1/8
0000100000000000 0800 2048 0.06250000 0.06250000 1/16
0000010000000000 0400 1024 0.03125000 0.03125000 1/32
0000001000000000 0200 512 0.01562500 0.01562500 1/64
0000000100000000 0100 256 0.00781250 0.00781250 1/128
0000000010000000 0080 128 0.00390625 0.00390625 1/256
0000000001000000 0040 64 0.00195312 0.00195312 1/512
0000000000100000 0020 32 0.00097656 0.00097656 1/1024
0000000000010000 0010 16 0.00048828 0.00048828 1/2048
0000000000001000 0008 8 0.00024414 0.00024414 1/4096
0000000000000100 0004 4 0.00012207 0.00012207 1/8192
0000000000000010 0002 2 0.00006104 0.00006104 1/16384
0000000000000001 0001 1 0.00003052 0.00003052 1/32768
0010101010101011 2AAB 10923 0.33333000 0.33334351 0.33333
0101101001111111 5A7F 23167 0.70700000 0.70700073 0.707
0000000000001010 000A 10 0.0003141592 0.00030518 0.0003141592
0000000000000011 0003 3 0.000086476908 0.00009155 0.000086476908

    b7    b6    b5    b4    b3     b2    b1    b0  
     S     F     F     F     F      F     F     F
     0     1     0     1     1      0     0     1
   +-1    1/2   1/4   1/8   1/16   1/32  1/64  1/128
  Value = + 1/2 + 1/8 + 1/16 + 1/128 which equals
        = + 0.5 + 0.125 + 0.0625 + 0.0078125
        = 0.6953125

The previous two examples illustrate how to represent a fractional value using fixed-point nomenclature. In general the value of the binary word is the following if the integer and fractional parts are treated as separate bit-vectors (i.e. don't share a bit index):

$$
{\displaystyle\sum_{k=0}^{IWL-1} b_{I}[k]2^k} + {\displaystyle\sum_{k=0}^{FWL-1}b_{F}[k]2^{-(k+1)}}
$$

In the above equations, it assumes the bit magnitude increases with the bit index.  If visualized the fractional word would be flipped relative to the depiction above. 

Multiplication and Addition Examples

The following are fixed-point examples for multiplication and addition.  Fixed-point subtraction can be calculated in a similar manner to a 2's complement subtraction (addition with a negative). The difference being the "point" bookkeeping required which is the same as addition.  For addition and subtraction the points need to be aligned before the operation (bookkeeping).  The "point" bookkeeping is usually calculated at design time and not run time (I say usually because there is the idea of block fixed-point but that is beyond the scope of this post).

Multiplication

Fixed-point multiplication is the same as 2's compliment multiplication but requires the position of the "point" to be determined after the multiplication to interpret the correct result.  The determination of the "point's" position is a design task.  The actual implementation does not know (or care) where the "point" is located.  This is true because the fixed-point multiplication is exactly the same as a 2's complemented multiplication, no special hardware is required.

The following is an example and illustrates $ 6.5625 (W=(8,3,4)) * 4.25 (W=(8,5,2)) $.

0110.1001  == 6.5625
000100.01  == 4.25
          01101001  
        x 00010001  
      ------------
          01101001
         00000000 
        00000000  
       00000000  
      01101001    
     00000000      
    00000000
   00000000 
 --------------------
  x000011011111001   ==  0000011011.111001  ==  27.890625

The number of bits required for the product (result) is the multiplicand's WL + the multiplier's WL ($WL_{multiplicand} + WL_{multiplier}$).  It is typical for the results of multiplication (and addition) to be resized and the number of bits reduced.  In fixed-point this makes intuitive sense because the smaller fractional bits are discarded and the value is rounded based on the discarded bits.  Additional hardware is commonly included to perform the rounding task and reduce the number of bits to a reasonable data-path bus width.

Addition

Addition is a little more complicated because the points need to be aligned before performing the addition.  Using the same numbers from the multiplication problem:

0110.1001  == 6.5625
000100.01  == 4.25
            0110.1001  
        + 000100.01  
        -------------
          001010.1101  ==  10.8125

When adding (subtracting) two numbers an additional bit is required for the result.  When adding more than two numbers all of the same $WL$ width, the number of bits required for the result is $WL + log_2(N)$ where $N$ is the number of elements being summed.

Conclusion

Fixed-point representation is convienent and useful when dealing with signal processing implementations.  This post is a basic introduction to fixed-point numbers.  For a more comprehensive coverage of the subject see the references for more information.

[1]: Yates, Randy, "Fixed-Point Arithmetic: An Introduction"

[2]: K.B. Cullen, G.C.M. Silvestre, and .J. Hurley, "Simulation Tools for Fixed Point DSP Algorithms and Architectures"

[3]: Texas Instruments Example

[4]: Ercegovac, Milos Lang, Thomas, "Digital Arithmetic"

[5]: Kuo, Lee, Tian, "Real-Time Digital Signal Processing

log:

01-Jul-2015: Fixed typo in the fixed-point binary word equation.
09-Apr-2013: Updated to use tuple format and use mathjax.
04-Apr-2011: Initial post.


Previous post by Christopher Felton:
   USB-FPGA : Introduction
Next post by Christopher Felton:
   MyHDL FPGA Tutorial I (LED Strobe)


Comments:

[ - ]
Comment by eryksunApril 25, 2011
The Qm.n notation is also common, where m is the number of 2s complement integer bits, excepting the sign bit, and n is the number of fractional bits. Thus Q15.16 is a 32-bit fixedpoint number with a range from -2^15 up to 2^15 - 2^-16.
[ - ]
Comment by cfeltonApril 26, 2011
@ eryksun Thanks for the comment. It is good to point out the Q-format but I think the Q-format has a couple pitfalls and I will try and capture the issues. The Q-format has been discussed on the comp.dsp newsgroup and I was convinced that there are better alternatives to define a fixed-point number. First, the Q-format is frequently used but not that well defined in textbooks. In your case you used Qm.n where m is the integer number of bits. But many texts omit the m and only use Qn, i.e. only define the number of fraction bits. A common example in textbooks is Q15*Q15 == Q30. Q30 is actually a 32-bit word. In your Qm.n version this would be Q1.30? I have seen this cause confusion because someone has been trained the Q30 way and is not sure what Q1.30 should be. I have found it easier to use a different (non Q-format) than try to unteach different versions of the Q-format. Second, with the Q-format it can be a little akward to define "point" past the bounds of the integer word length and fractional. There are 3 cases, (from the SystemC LRM) WL < IWL, 0 <= IWL <= WL, and IWL < 0. SystemC handles these cases by defining WL and IWL. VHDL fixed-point package handles this be defining positive and negative index. A fixed-point value can be defined as sfixed (23 downto 16); or sfixed(-7, -16). These are less common examples but still valid and illustrates flexibility of fixed-point.
[ - ]
Comment by kazApril 25, 2011
As an FPGA engineer focussed on dsp functions , I am used to signed 2's complement and always viewed the values on buses as integers. In fact I never needed to think of them as fractions but let truncations do the job. With today's tools I don't have much option but to accept changing my view to fractional notation. It is neat after all and also fits modelling in software nicely away from the scaling effect of integer perspective. Finally just a trivia; the range for negative numbers is one extra deep, e.g. 8 bits have the range -128 ~ +127. Symmetric saturation is commonly used to avoid dc bias by forcing values to swing across -127 ~ +127 Truncation is nowadays more involved and includes non rounding or dc biased rounding or dc unbiassed rounding. kadhiem Ayob
[ - ]
Comment by cfeltonApril 26, 2011
@ kaz Yes, I have worked with many Engineers that were comfortable only using integers. In most cases they were limited to just implementation. They would be given a DSP design and normalize the design to integers. For someone beginning with non floating-point implementations I think it is beneficial for then to use fixed-point and focus on the range and resolution when designing the DSP functions. The bit-widths will fall out from the range and resolution design. This can take some iteration though, when designing the signal processing system blocks to the requirements using range and resolution you might find that there are not enough resources to meet the design. At that point the design is revisited and appropriate trade-offs are made. Thank you for pointing out there are many items to consider. In my experience the DC unbiased rounding is not an issue because the design is unbiased. The "extra" value in the negative range is unused. But as you point out, if someone is simply truncating you need to think about how to truncate. Simply truncating to the max and min can have unwanted effects. Thanks again for the points.
[ - ]
Comment by taksMay 9, 2011
Hey Chris, you're blogging now! Anyway, I just wanted to point out that I agree with your comments regarding Q format. I have always seen it as Q15 or Q30 (or similar) and rarely with the Qm.n. I have, however, implemented the latter just being careful to point out to my VHDL guy how the decimals propagate (I don't personally do VHDL/Verilog.) Mark (CO Springs)
[ - ]
Comment by cfeltonMay 10, 2011
Hey Mark, Thanks for the comments. Hope all is well!
[ - ]
Comment by cfeltonMay 17, 2011
This link also has a couple comments on the Q-notation and some issues with it, fx notation Vissim uses, fx notation, which is, (integer word-length, word-length). Because some earlier uses of "Q" notation had the hidden (implicit) word-lengths. Notation that specifies the word-length are more useful.
[ - ]
Comment by jselvaJuly 29, 2011
What is the complexity of floating-point relative to fixed-point? Do you have any reference on this? Thanks.
[ - ]
Comment by cfeltonAugust 2, 2011
@jselva Are you referring to hardware implementation or microcontroller/cpu implementations? In digital design (ASIC/FPGA) floating-point is considered much more complex. It requires more hardware and control to implement floating-point vs. fixed-point. On a processor, if a floating-point unit is available its probably less complex, from a coding point of view, to use floating-point.
[ - ]
Comment by jselvaAugust 3, 2011
@cfelton I am referring to ASIC/FPGA and to the cost in terms of surface or power consumption. Floating-point is more complex than fixed-point from a generic point of view, though I do not see that much difference in complexity. Perhaps the main problem with floating-point is that the design is difficult. Jesus.
[ - ]
Comment by cfeltonAugust 4, 2011
@jselva A floating-point operation will be greater in both physical area and power, usually considered quite a bit more. But I don't have any quantitative values for a comparison of a single precision floating-point versus 32bit integer. I have never designed a floating-point circuit but when I have seen designs for a floating-point vs. integer multiplication the floating-point is much more complicated. I am a little confused by your statement "floating-point is more *complex* than fixed-point ... though I do not see than much difference in *complexity*" Integer operations are fairly straight-forward when compared to floating-point, here we are taking about implementation of the operations (versus design/using). When implementing floating-point there is much more that needs to be handled, you have to do the operations on the mantissa and exponent and take care when they modify each other. The mantissa is constantly being scaled based on the modifications to the exponent etc.
[ - ]
Comment by jselvaAugust 5, 2011
@cfelton I see now what you mean. Thanks for your comments!
[ - ]
Comment by dsppMay 24, 2013
HI!
it is a very interesting article. i have some fixed point value from a DSP. they are conerted in 16 bit hexa. for exemple:(surprised:x"FFED, 0xFFE7,0x 001A 0x002D ). is there any easy trick to convert them to double value?

[ - ]
Comment by harishmMay 19, 2015
wrong in your equation "?k=0IWL?1bI[k]2k+?k=0FWL?1bF[k]2?k "

for fractional part k starts from 1, i.e (?k=-1FWL?1bF[k]2?k )
[ - ]
Comment by cfeltonMay 19, 2015
@harishm,

Thanks, yes you correct the equation is incorrect. Simply change `k` to start at 1 doesn't completely fix the issue. Need to define how the "fraction" word bits are indexed, there still would be a 0 indexed bit. The equation will be updated ASAP.
[ - ]
Comment by briskerOctober 7, 2015
hey hi,
nice article...really helpful.
Can somebody please tell how to convert very small value like for eg: 0.000000058323780584287955 to fixed point???

thank you
[ - ]
Comment by baid90July 14, 2016
This is en example code from book Digital Signal Processing Using MATLAB 3rd Edition-Slicer

function [y,L,B] = QCoeff(x,N)
% [y,L,B] = QCoeff(x,N)
% Coefficient Quantization using N=1+L+B bit Representation
% with Rounding operation
% y: quantized array (same dim as x)
% L: number of integer bits
% B: number of fractional bits
% x: a scalar, vector, or matrix
% N: total number of bits
xm = abs(x);
L = max(max(0,fix(log2(xm(:)+eps)+1))); % Integer bits
if (L > N)
errmsg = [’ *** N must be at least ’,num2str(L),’ ***’]; error(errmsg);
end
B = N-L; % Fractional bits
y = xm./(2^L); y = round(y.*(2^N)); % Rounding to N bits
y = sign(x).*y*(2^(-B)); % L+B+1 bit representation

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.

Sign up
or Sign in