Understanding Floating-Point Values

Floating-Point Values

Floating-point types are often used in programs to hold real numbers. However, there is a lot of misunderstanding about how they are implemented on a digital system, resulting in code that does not work as expected.

If you were to work out by hand the result of the fraction 4/3 in a decimal number form, you will (hopefully) arrive at the result 1.33333333333 (recurring). If you use a pocket calculator, the result will be the same and displayed to as many digits available on that device's display. So, it can come as quite a shock when you discover that assigning this same fraction to a floating-point variable in your C program might store the value as 1.33333337307, a seemingly poor approximation to the correct value. The critical point to remember when using floating-point values on a digital system is that although they are intended to represent any value to any number of significant digits, these values are stored using a finite number of bits and hence, they have limitations. Unlike integers which only have limitations on their range, floating-point numbers also have limitations on their precision.

Mantissa and Exponent Components

Digitally stored floating-point numbers consist of two components: the mantissa (or significant) and the exponent, with the value given by the following expression:

value = mantissa x baseexponent

In mathematical calculations, the base is often 10 but when these numbers are stored digitally, the base is usually two. Exactly how these components are encoded is not important for this discussion but the number of bits allocated to each component affects the range and precision of the values that can be represented. Using more bits to encode the mantissa allows the value to be represented more precisely, more bits in the exponent means that a larger range of values can be represented.

Floating-Point Values in Compliers

Compilers commonly use an IEEE representation for floating-point values which specifies the number of bits to use for the mantissa and exponent but there are different formats within this standard. The single-precision format uses a total of 32 bits to represent a floating-point number and consists of eight bits of exponent and 24 bits of mantissa (including one sign bit). This format is supported by all three MPLAB® XC compilers. The double-precision format uses a total of 64 bits to store a floating-point number and consists of 11 bits of exponent and 53 bits of signed mantissa. This format is supported by MPLAB XC16 and XC32. By default, the XC8 compiler uses a 24-bit floating-point format that is a truncated form of the 32-bit format and that has eight bits of exponent but only 16 bits of signed mantissa.

The larger IEEE formats allow precise numbers, covering a large range of values to be handled. However, these formats require more data memory to store values of this type and the library routines that process these values are very large and slow. Floating-point calculations are always much slower than integer calculations and should be avoided if at all possible, especially if you are using an 8-bit device. This page indicates one alternative you might consider.

Rounding

If you do use floating-point values in your code, remember that even floating-point constants are rounded. For example, if you are using a 32-bit floating-point format and you want to assign the value 128.0 to a variable with such a type that value can be represented exactly by that variable. However, the next largest number that can be assigned is 128.000015259. If you were to try to assign to the variable a value that fell between these two values, it would be rounded up or down. The larger the magnitude of the value, the more widely spaced become the exactly-representable values. So, the value 8000000 can be represented exactly by a 32-bit floating-point type but the next highest value is 8000000.5.

The following table shows the 32-bit examples we have seen above and their bit sequences that are used to represent those values. These sequences have been split into the exponent and mantissa, so you can see how the similar values differ only by one bit in the mantissa. Again, how these components are encoded is not important when it comes to using these values in a program.

Value Bit Sequence Mantissa Exponent
128.0 0x43000000 0x000000 0x07
128.000015259 0x43000001 0x000001 0x07
8000000.0 0x4af42400 0x742400 0x16
8000000.5 0x4af42401 0x742401 0x16

Consider the following code which assigns one value to a floating-point variable, then immediately compares that variable with a different value. Although the two floating-point constants that appear in the source code are different, they compare as equal when this code is executed.

If you are using the 24-bit format with MPLAB XC8, you lose precision and the discrepancy between the intended and actual values can become large. The execution of the function call in the following example would make it seem like something has gone wrong but this is just a manifestation of rounding.

Here is how 8000000.0 and the next highest representable 24-bit value are encoded:

Value Bit sequence Mantissa Exponent
8000000.0 0x4af424 0x7424 0x16
8000128.0 0x4af425 0x7425 0x16

Beware of operations that involve two floating-point values with a different magnitude. The following code appears to add the value 0.2 to myFloat 100 times.

It is tempting to think that the result should be 8000020.0 but that is not the case. Certainly, the value 0.2 will be rounded but the actual value will still be quite close to the intended value. The problem here is that the value added is not large enough to shift the larger operand to the next representation. The result is that each addition leaves the value of myFloat unchanged and the entire loop has no effect.

Be aware that complex floating-point algorithms are used to perform seemingly simple operations such as addition and multiplication and that rounding can take place during these calculations. Even printing a floating-point value requires a complicated conversion to a string of decimal digits. Never expect the results of complex calculations to exactly match the theoretical result as determined by pen and paper or a high-precision pocket calculator. If you must check the result of a calculation in a program, you might need to ensure that the result is within a range of values rather than being exactly equal to the expected value.

© 2024 Microchip Technology, Inc.
Notice: ARM and Cortex are the registered trademarks of ARM Limited in the EU and other countries.
Information contained on this site regarding device applications and the like is provided only for your convenience and may be superseded by updates. It is your responsibility to ensure that your application meets with your specifications. MICROCHIP MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WHETHER EXPRESS OR IMPLIED, WRITTEN OR ORAL, STATUTORY OR OTHERWISE, RELATED TO THE INFORMATION, INCLUDING BUT NOT LIMITED TO ITS CONDITION, QUALITY, PERFORMANCE, MERCHANTABILITY OR FITNESS FOR PURPOSE. Microchip disclaims all liability arising from this information and its use. Use of Microchip devices in life support and/or safety applications is entirely at the buyer's risk, and the buyer agrees to defend, indemnify and hold harmless Microchip from any and all damages, claims, suits, or expenses resulting from such use. No licenses are conveyed, implicitly or otherwise, under any Microchip intellectual property rights.