Floating-point numbers are essential for representing real numbers in computing, but they come with limits on precision and range.
This Learning Path covers the following:
Floating-point numbers are a finite, discrete approximation of real numbers. They allow functions in the continuous domain to be computed with adequate, but limited, resolution.
A floating-point number is typically expressed as:
± d.dddd...d × B^e
where:
The precision of a floating-point format refers to the number of binary digits used to represent the mantissa. This is denoted by p, and a system with p bits of precision can distinguish between \( 2^p \) different fractional values.
If the leading digit is non-zero, the number is said to be normalized (also called a normal number).
Fixing B = 2, p = 24
0.1 = 1.10011001100110011001101 × 2^4
is a normalized representation of 0.1
0.1 = 0.000110011001100110011001 × 2^0
is a non-normalized representation of 0.1
A floating-point number can have multiple non-normalized forms, but only one normalized representation for a given value - assuming a fixed base and precision, and that the leading digit is strictly less than the base.
Given:
B
p
emax
emin
You can create the full set of representable normalized values.
B = 2, p = 3, emax = 2, emin = -1
Significand | × 2⁻¹ | × 2⁰ | × 2¹ | × 2² |
---|---|---|---|---|
1.00 (1.0) | 0.5 | 1.0 | 2.0 | 4.0 |
1.01 (1.25) | 0.625 | 1.25 | 2.5 | 5.0 |
1.10 (1.5) | 0.75 | 1.5 | 3.0 | 6.0 |
1.11 (1.75) | 0.875 | 1.75 | 3.5 | 7.0 |
For any exponent, n, numbers are evenly spaced between 2ⁿ and 2ⁿ⁺¹. However, the gap between them (also called a ULP , which is explained in more detail in the next section) increases with the magnitude of the exponent.
Since there are \( B^p \) possible mantissas and emax-emin+1
possible exponents, then log2(B^p) + log2(emax-emin+1) + 1
(sign) bits are needed to represent a given floating-point number in a system.
In Example 2, 3+2+1=6 bits are needed.
Based on this, the floating-point’s bitwise representation is defined as:
b0 b1 b2 b3 b4 b5
where
b0 -> sign (S)
b1, b2 -> exponent (E)
b3, b4, b5 -> mantissa (M)
However, this is not enough. In this bitwise definition, the possible values of E are 0, 1, 2, 3. But in the system being defined, only the integer values in the range [-1, 2] are of interest.
E is stored as a biased exponent to allow representation of both positive and negative powers of two using only unsigned integers. In this example, a bias of 1 shifts the exponent range from [0, 3] to [−1, 2]:
x = (-1)^S × M × 2^(E-1)
Single precision (also called float) is a 32-bit format defined by the IEEE-754 Floating-Point Standard .
In this format:
The value of a normalized floating-point number in IEEE-754 can be represented as:
x = (−1)^S × (1.M) × 2^(E−127)
The exponent bias of 127 allows storage of exponents from -126 to +127. The leading digit is implicit in normalized numbers, giving a total of 24 bits of precision.
Since the exponent field uses 8 bits, E ranges between 0 and 2^8-1=255. However not all these 256 values are used for normal numbers.
If the exponent E is:
Subnormal numbers (also called denormal numbers) allow representation of values closer to zero than is possible with normalized exponents. They are special floating-point values defined by the IEEE-754 standard.
They allow the representation of numbers very close to zero, smaller than what is normally possible with the standard exponent range.
Subnormal numbers do not have a leading 1 in their representation. They also assume an exponent of –126.
The interpretation of subnormal floating-point in IEEE-754 can be represented as:
x = (−1)^S × 0.M × 2^(−126)
If you’re interested in diving deeper into this subject, What Every Computer Scientist Should Know About Floating-Point Arithmetic by David Goldberg is a great place to start.