You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Microsoft C++ (MSVC) is consistent with the IEEE numeric standards. The IEEE-754 standard describes floating-point formats, a way to represent real numbers in hardware. There are at least five internal formats for floating-point numbers that are representable in hardware targeted by the MSVC compiler, but the compiler only uses two of them. The *single-precision* (4-byte) and *double-precision* (8-byte) formats are used in MSVC. Single-precision is declared using the keyword **`float`**. Double-precision is declared using the keyword **`double`**. The IEEE standard also specifies *half-precision* (2-byte) and *quadruple-precision* (16-byte) formats, as well as an *double-extended-precision* (10-byte) format, which some C and C++ compilers implement as the **`long double`** data type. In the MSVC compiler, the **`long double`** data type is treated as a distinct type, but the storage type maps to **`double`**. There is, however, intrinsic and assembly language support for computations using the other formats, including the double-extended-precision (10-byte) format, where supported by hardware.
9
+
Microsoft C++ (MSVC) is consistent with the IEEE numeric standards. The IEEE-754 standard describes floating-point formats, a way to represent real numbers in hardware. There are at least five internal formats for floating-point numbers that are representable in hardware targeted by the MSVC compiler. The compiler only uses two of them. The *single-precision* (4-byte) and *double-precision* (8-byte) formats are used in MSVC. Single-precision is declared using the keyword **`float`**. Double-precision is declared using the keyword **`double`**. The IEEE standard also specifies *half-precision* (2-byte) and *quadruple-precision* (16-byte) formats, and a *double-extended-precision* (10-byte) format, which some C and C++ compilers implement as the **`long double`** data type. In the MSVC compiler, the **`long double`** data type is treated as a distinct type, but the storage type maps to **`double`**. There is, however, intrinsic and assembly language support for computations using the other formats, including the double-extended-precision format, where supported by hardware.
In single-precision and double-precision formats, there is an assumed leading 1 in the fractional part, called the *significand* (and sometimes referred to as the *mantissa*), that is not stored in memory, so the significands are actually 24 or 53 bits, even though only 23 or 52 bits are stored. The double-extended-precision format actually stores this bit.
18
+
In single-precision and double-precision formats, there's an assumed leading 1 in the fractional part. The fractional part is called the *significand* (sometimes known as the *mantissa*). This leading 1 isn't stored in memory, so the significands are actually 24 or 53 bits, even though one less bit gets stored. The double-extended-precision format actually stores this bit.
20
19
21
-
The exponents are biased by half of their possible value. This means you subtract this bias from the stored exponent to get the actual exponent. If the stored exponent is less than the bias, it is actually a negative exponent.
20
+
The exponents are biased by half of their possible value. It means you subtract this bias from the stored exponent to get the actual exponent. If the stored exponent is less than the bias, it's actually a negative exponent.
22
21
23
22
The exponents are biased as follows:
24
23
25
24
|Exponent|Biased by|
26
25
|--------------|---------------|
27
26
|8-bit (single-precision)|127|
28
27
|11-bit (double-precision)|1023|
29
-
|15-bit (double-extended-precision)|16383|
30
28
31
-
These exponents are not powers of ten; they are powers of two. That is, 8-bit stored exponents can range from -127 to 127, stored as 0 to 254. The value 2<sup>127</sup> is roughly equivalent to 10<sup>38</sup>, which is the actual limit of single-precision.
29
+
These exponents aren't powers of ten; they're powers of two. That is, 8-bit stored exponents can range from -127 to 127, stored as 0 to 254. The value 2<sup>127</sup> is roughly equivalent to 10<sup>38</sup>, which is the actual limit of single-precision.
32
30
33
-
The significand is stored as a binary fraction of the form 1.XXX... . This fraction has a value greater than or equal to 1 and less than 2. Note that real numbers are always stored in *normalized form*; that is, the significand is left-shifted such that the high-order bit of the significand is always 1. Because this bit is always 1, it is assumed (not stored) in the single-precision and double-precision formats. The binary (not decimal) point is assumed to be just to the right of the leading 1.
31
+
The significand is stored as a binary fraction of the form 1.XXX... . This fraction has a value greater than or equal to 1 and less than 2. Real numbers are always stored in *normalized form*. That is, the significand is left-shifted such that the high-order bit of the significand is always 1. Because this bit is *always* 1, it's assumed (not stored) in the single-precision and double-precision formats. The binary (not decimal) point is assumed to be just to the right of the leading 1.
34
32
35
-
The format, then, for the various sizes is as follows:
33
+
The formatfor floating-point representation is as follows:
`S` represents the sign bit, the `X`'s are the biased exponent bits, and the `M`'s are the significand bits. Note that the leftmost bit is assumed in single-precision and double-precision formats, but is present as "1" in byte 3 of the double-extended-precision format.
40
+
`S` represents the sign bit, the `X`'s are the biased exponent bits, and the `M`'s are the significand bits. The leftmost bit is assumed in single-precision and double-precision formats.
44
41
45
42
To shift the binary point properly, you first unbias the exponent and then move the binary point to the right or left the appropriate number of bits.
46
43
@@ -50,31 +47,31 @@ The floating-point formats include some values that are treated specially.
50
47
51
48
### Zero
52
49
53
-
Zero cannot be normalized, which makes it unrepresentable in the normalized form of a single-precision or double-precision value. A special bit pattern of all zeroes represents 0. It's also possible to represent -0 as zero with the sign bit set, but -0 and 0 always compare as equal.
50
+
Zero can't be normalized, which makes it unrepresentable in the normalized form of a single-precision or double-precision value. A special bit pattern of all zeroes represents 0. It's also possible to represent -0 as zero with the sign bit set, but -0 and 0 always compare as equal.
54
51
55
52
### Infinities
56
53
57
-
The +∞ and −∞ values are represented by an exponent of all ones and a significand of all zeroes. Both positive and negative infinities can be represented by using the sign bit.
54
+
The +∞ and −∞ values are represented by an exponent of all ones, and a significand that's all zeroes. Positive and negative are represented by using the sign bit.
58
55
59
56
### Subnormals
60
57
61
-
It's possible to represent numbers of smaller magnitude than the smallest normalized number. These numbers are known as*subnormal* or *denormal* numbers. If the exponent is all zeroes and the significand is non-zero, then implicit leading bit of the significand is considered to be zero, not one. The precision of subnormal numbers goes down as the number of leading zeroes in the significand goes up.
58
+
It's possible to represent numbers of smaller magnitude than the smallest number in normalized form. They're called*subnormal* or *denormal* numbers. If the exponent is all zeroes and the significand is non-zero, then implicit leading bit of the significand is considered to be zero, not one. The precision of subnormal numbers goes down as the number of leading zeroes in the significand goes up.
62
59
63
60
### NaN - Not a Number
64
61
65
-
It's possible to represent values that are not a real number, such as 0 / 0, in the IEEE floating-point format. A value of this kind is called a *NaN*. A NaN is represented by an exponent of all ones and a non-zero significand. There are two kinds of NaNs, *quiet* NaNs, or QNaNs, and *signaling* NaNs, or SNaNs. Quiet NaNs have a leading one in the significand, and are generally propagated through an expression. They represent an indeterminate value, such as the result of dividing by infinity, or multiplying an infinity by zero. signaling NaNs have a leading zero in the significand. These are used for operations that are not valid, to signal a floating-point hardware exception.
62
+
It's possible to represent values that aren't real numbers, such as 0 / 0, in the IEEE floating-point format. A value of this kind is called a *NaN*. A NaN is represented by an exponent of all ones and a non-zero significand. There are two kinds of NaNs, *quiet* NaNs, or QNaNs, and *signaling* NaNs, or SNaNs. Quiet NaNs have a leading one in the significand, and get propagated through an expression. They represent an indeterminate value, such as the result of dividing by infinity, or multiplying an infinity by zero. Signaling NaNs have a leading zero in the significand. They're used for operations that aren't valid, to signal a floating-point hardware exception.
66
63
67
64
## Examples
68
65
69
66
The following are some examples in single-precision format:
70
67
71
-
- For the value 2, the sign bit is zero, and the stored exponent is 128, or 1000 0000 in binary, which is 127 plus 1. The stored binary significand is (1.) 000 0000 0000 0000 0000 0000, which has an implied leading 1 and binary point, so the actual significand is one.
68
+
- For the value 2, the sign bit is zero. The stored exponent is 128, or 1000 0000 in binary, which is 127 plus 1. The stored binary significand is (1.) 000 0000 0000 0000 0000 0000, which has an implied leading 1 and binary point, so the actual significand is one.
- The value 6. Same exponent, significand is larger by half — it's (1.) 100 0000 ... 0000 0000, which, since this is a binary fraction, is 1 1/2 because the values of the fractional digits are 1/2, 1/4, 1/8, and so forth.
86
+
- The value 6. Same exponent, significand is larger by half. It's (1.) 100 0000 ... 0000 0000, which, since it's a binary fraction, is 1 1/2 because the values of the fractional digits are 1/2, 1/4, 1/8, and so forth.
90
87
91
88
|Value|Formula|Binary representation|Hexadecimal|
92
89
|-|-|-|-|
@@ -110,13 +107,13 @@ The following are some examples in single-precision format:
- 1/10 is a repeating fraction in binary. The significand is just shy of 1.6, and the biased exponent says that 1.6 is to be divided by 16 (it is 011 1101 1 in binary, which is 123 in decimal). The true exponent is 123 - 127 = -4, which means that the factor by which to multiply is 2<sup>-4</sup> = 1/16. Note that the stored significand is rounded up in the last bit — an attempt to represent the unrepresentable number as accurately as possible. (The reason that 1/10 and 1/100 are not exactly representable in binary is similar to the reason that 1/3 is not exactly representable in decimal.)
110
+
- 1/10 is a repeating fraction in binary. The significand is a little less than 1.6, and the biased exponent says that 1.6 is to be divided by 16. (It's 011 1101 1 in binary, which is 123 in decimal.) The true exponent is 123 - 127 = -4, which means that the factor by which to multiply is 2<sup>-4</sup> = 1/16. The stored significand is rounded up in the last bit in an attempt to represent the unrepresentable number as accurately as possible. (The reason that 1/10 and 1/100 aren't exactly representable in binary is similar to the reason that 1/3 isn't exactly representable in decimal.)
0 commit comments