计算机存储浮点数
A computer stores floating-point numbers using a standardized format called IEEE 754. This format is designed to represent real numbers in a way that balances range and precision. Here's how it works:
Basic Structure of IEEE 754 Floating-Point Numbers
A floating-point number in a computer is typically represented by three components:
- Sign bit (S): This determines whether the number is positive (
0
) or negative (1
). - Exponent (E): This stores the exponent value, which determines the range of the number (i.e., how large or small it can be).
- Mantissa (or Significand) (M): This holds the significant digits of the number, representing its precision.
The general formula for a floating-point number is:
\[(-1)^{S} \times 1.M \times 2^{(E - \text{bias})} \]Where:
S
is the sign bit (0 for positive, 1 for negative).M
is the mantissa (or significand), typically in normalized form (starting with a leading 1).E
is the exponent, adjusted by a bias.
Common Floating-Point Formats
The two most common floating-point formats are single precision (32-bit) and double precision (64-bit).
1. Single Precision (32-bit Floating-Point):
- 1 bit for sign (S)
- 8 bits for exponent (E)
- 23 bits for mantissa (M)
A 32-bit floating-point number has the following layout:
| S | E (8 bits) | M (23 bits) |
- Range of exponent: The exponent is stored with a bias of 127 (i.e., ( E - 127 )), meaning the actual exponent is calculated as
E - 127
. - Mantissa: The 23 bits store the fractional part. The number is assumed to have a leading
1.
(known as implicit leading 1), which is not stored explicitly. For example, a mantissa of001
would be interpreted as1.001
.
2. Double Precision (64-bit Floating-Point):
- 1 bit for sign (S)
- 11 bits for exponent (E)
- 52 bits for mantissa (M)
A 64-bit floating-point number has the following layout:
| S | E (11 bits) | M (52 bits) |
- Range of exponent: The exponent is stored with a bias of 1023 (i.e., ( E - 1023 )).
- Mantissa: The 52 bits store the fractional part, with an implicit leading 1.
Example of Single-Precision Float Representation
Suppose we want to store the number -6.75
as a 32-bit float:
-
Convert to binary:
6.75
in decimal is110.11
in binary (6 = 110
, and.75 = .11
in binary).
-
Normalize the number:
- In scientific notation, this is ( -1.1011 \times 2^2 ). This shows the sign bit is
1
, the exponent is2
, and the mantissa is1.1011
.
- In scientific notation, this is ( -1.1011 \times 2^2 ). This shows the sign bit is
-
Set the components:
- Sign bit:
1
(since the number is negative) - Exponent:
2 + 127 = 129
in decimal, which is10000001
in binary. - Mantissa: The
1.
is implicit, so we only store1011
, padded to 23 bits:10110000000000000000000
.
- Sign bit:
Thus, the 32-bit representation of -6.75
is:
1 10000001 10110000000000000000000
Precision and Limitations
- Precision: The more bits in the mantissa, the more precise the number. Single-precision floats are accurate to about 7 decimal digits, while double-precision floats are accurate to about 15-16 decimal digits.
- Range: The exponent allows floating-point numbers to represent a vast range, from very small numbers (close to zero) to very large ones.
Special Values
IEEE 754 also defines special cases:
- Zero: Represented by all bits in the exponent and mantissa being zero.
- Infinity: Represented by all bits in the exponent being 1, and the mantissa being all 0.
- NaN (Not a Number): Represented by all bits in the exponent being 1, and the mantissa containing non-zero bits.
Summary
- Floating-point numbers are stored in three parts: sign, exponent, and mantissa.
- Single precision uses 32 bits, while double precision uses 64 bits.
- The IEEE 754 standard defines how these components are laid out and how the numbers are calculated, enabling computers to store a wide range of real numbers with a trade-off between precision and range.