Skip to main content

Doubles vs Floats

TL;DR

Just use doubles

Overview

Floating point numbers can represent a far wider range of values than data stored as an integer. A 32-bit float (a variable declared as float) can be as large as 3.4e38 and as small as 1.2e-38. But despite this wide range, sometimes a 32-bit float does not offer enough precision, which is why the Quarto supports both 32-bit floats and double-precision or 64-bit floats (a variable declares as double). This application note will discuss floating point numbers in general and take a look at the limitations of 32-bit floats and discuss why the Quarto generally uses 64-bit precision doubles.

What is a Float?

A floating point number consists of three parts:

  • A sign bit to represent if the number is positive or negative
  • A fraction which represents a values between 1 and 2
  • A exponent which scales the final number

The value of a float is given by

Value=SignFraction2ExponentValue = Sign * Fraction * 2^{Exponent}

In the case of a 32-bit float, the exponent is represented by 8-bits to store a value between -127 and 128. With one bit used to store the sign, that leaves 23 bits to represent the fraction. To understand how the fraction component is stored, let's use an example where we have 3-bits to store the fraction. The value always is bounded between 1 (inclusive) and 2 (exclusive) and the step size is given by the reciprocal of 2 to the number of bits. In this case, that is 23=0.1252^{-3} = 0.125:

Binary ValueInteger ValueFractional Representation
00001.0
00111.125
01021.25
01131.375
10041.5
10151.625
11061.75
11171.875

The general formula is

Fraction=1+Value2NFraction = 1 + Value*2^{-N}

Dynamic Range of a Float

Because a float can be scaled by the 2Exponent2^{Exponent} it can be a very large or very small number. However, the dynamic range of the number is set by number of bits used to represent the fraction. Consider at number 16,777,21616{,}777{,}216. If you were to store that number as a float and add 1 to it, the number would not increment:

16,777,216+1=16,777,21616{,}777{,}216 + 1 = 16{,}777{,}216

To understand why this is, let's look at how 16,777,21616{,}777{,}216 is represented as a float. The first thing to notice is that 16,777,216=22416{,}777{,}216 = 2^{24} . Because the number is a power of two, the fractional part is just 0 in binary and its fractional representation is 1.0. And the exponent is 24. If we were to increase the fraction by its small increment, the exponent would not change, and the fraction could only increase by 2232^{-23} because that's the bit resolution for the fraction part in a 32-bit float. That fractional change would also get multiplied by the 2242^{24} scaling of the exponent, so the smallest increment would be 212^{1} or 2. So a 32-bit float can represent 16,777,21616{,}777{,}216 and 16,777,21816{,}777{,}218, but nothing in between. If you started with a double twice as big, the smallest increment would be 4. So while the range of a 32-bit float is from 3.4e38 to 1.2e-38, the dynamic range is about 1e-23 or approximately 7 decimal places.

If you want to play around with this yourself, here's a simple Quarto program for adding a user-specified input to the number 16,777,21616{,}777{,}216 stored as a float.

#include "qCommand.h"
qCommand qC;

void setup() {
qC.addCommand("add", add);
}

void loop() {
qC.readSerial(Serial);
qC.readSerial(Serial2);
}

void add(qCommand& qC, Stream& S) {
float base = 16777216;
if ( qC.next() == NULL) {
S.printf("Please type as an argument the number to add to %f\n",base);
} else {
float add = atof(qC.current());
float result = base + add;
S.printf("%f + %f = %f\n",base,add,result);
}
}
>> add 1
<< 16777216.000000 + 1.000000 = 16777216.000000
>> add 2
<< 16777216.000000 + 2.000000 = 16777218.000000
>> add 1.5
<< 16777216.000000 + 1.500000 = 16777218.000000
>> add 3
<< 16777216.000000 + 3.000000 = 16777220.000000

PID Servo

Why this matters is that in the PID Servo Example, we do exactly the same math with the line

 integral += (newadc - SETPOINT) * 0.01; // integral gain

which can be rewritten as

 integral = integral + (newadc - SETPOINT) * 0.01; // integral gain

For simplicity, let's assume that the SETPOINT is zero. The newadc variable is from reading from the ADC. If the ADC is configured with a range of ±1.25V, then the smallest ADC value that can be read is 40µV. The DAC output is mostly set by the integral variable, so if that can be as large as 10V, then a 32-bit float can be increased by a value as small as 10×2231.2μV10\times 2^{-23}\approx 1.2\mu V. That seems like it shouldn't be a problem, since the ADC quantization is much larger than 1µV. However, the scaling before performing the addition needs to be taken into account. In this PID Servo example, the integral gain is set to 0.01 so that minimum ADC value of 40µV gets scaled down to 400nV before it is added to the variable integral. And 400nV is less than the 1.2µV minimum increment amount.

What does this mean for the servo performance? If you use 32-bit floats and a low integral gain, then your integrator will not see small ADC values so if your loop should be driving the ADC output to 0V, a value of 40µV could be read over and over again and would never increase the integrator and effectively you have a higher noise floor as you can only respond to ADC readings of about 100µV or so.

Having said that, often a PID Servo will usually want higher integral gain to take advantage of the low-latency and high servo performance of the Quarto and when that integrator gain is not so low, there will be no issues using 32-bit floats to store your integration values. But hopefully this example shows the type of situations where its worth checking if 32-bit floats provide enough dynamic range for your application.

Doubles

Doubles have 64-bits for storing the number and put 52 of those bits into the fraction. That gives a dynamic range of 2522^{-52} or about 16 decimal places. This should be more than enough dynamic range for almost any scenario using the 16-bit analog inputs and outputs. Typically, using doubles instead of floats comes at the cost of speed because the calculations are now done on 64-bit numbers instead of 32-bit numbers. However, because the Quarto has hardware support for 64-bit (and 32-bit) floating point math, the time to do a calculation with a 32-bit or 64-bit float is basically the same. For calculating trigonometric functions or running the PID Servo interrupt routine, switching from floats to doubles introduces almost no additional latency or computation time. For this reason, all the Quarto examples use doubles instead of floats as there is no real cost to using doubles and it avoids any potential issues in applications that need variables with high dynamic range.

Doubles do use twice the storage, so there is a small increase in the program size and memory usage from using doubles. However, in most scenarios this is a very small increase and the Quarto has way more memory and program space than is used. When streaming back data over USB, if you are limited by the speed of USB, sending back data stored as a float instead of a double, you will need half the data throughput.