Skip to content

Commit f5782d8

Browse files
author
Colin Robertson
committed
Acrolinx fixes
1 parent e56aa6a commit f5782d8

5 files changed

Lines changed: 72 additions & 75 deletions

File tree

docs/build/ieee-floating-point-representation.md

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -6,41 +6,38 @@ ms.assetid: 537833e8-fe05-49fc-8169-55fd0314b195
66
---
77
# IEEE Floating-Point Representation
88

9-
Microsoft C++ (MSVC) is consistent with the IEEE numeric standards. The IEEE-754 standard describes floating-point formats, a way to represent real numbers in hardware. There are at least five internal formats for floating-point numbers that are representable in hardware targeted by the MSVC compiler, but the compiler only uses two of them. The *single-precision* (4-byte) and *double-precision* (8-byte) formats are used in MSVC. Single-precision is declared using the keyword **`float`**. Double-precision is declared using the keyword **`double`**. The IEEE standard also specifies *half-precision* (2-byte) and *quadruple-precision* (16-byte) formats, as well as an *double-extended-precision* (10-byte) format, which some C and C++ compilers implement as the **`long double`** data type. In the MSVC compiler, the **`long double`** data type is treated as a distinct type, but the storage type maps to **`double`**. There is, however, intrinsic and assembly language support for computations using the other formats, including the double-extended-precision (10-byte) format, where supported by hardware.
9+
Microsoft C++ (MSVC) is consistent with the IEEE numeric standards. The IEEE-754 standard describes floating-point formats, a way to represent real numbers in hardware. There are at least five internal formats for floating-point numbers that are representable in hardware targeted by the MSVC compiler. The compiler only uses two of them. The *single-precision* (4-byte) and *double-precision* (8-byte) formats are used in MSVC. Single-precision is declared using the keyword **`float`**. Double-precision is declared using the keyword **`double`**. The IEEE standard also specifies *half-precision* (2-byte) and *quadruple-precision* (16-byte) formats, and a *double-extended-precision* (10-byte) format, which some C and C++ compilers implement as the **`long double`** data type. In the MSVC compiler, the **`long double`** data type is treated as a distinct type, but the storage type maps to **`double`**. There is, however, intrinsic and assembly language support for computations using the other formats, including the double-extended-precision format, where supported by hardware.
1010

1111
The values are stored as follows:
1212

1313
|Value|Stored as|
1414
|-----------|---------------|
1515
|single-precision|sign bit, 8-bit exponent, 23-bit significand|
1616
|double-precision|sign bit, 11-bit exponent, 52-bit significand|
17-
|double-extended-precision|sign bit, 15-bit exponent, 64-bit significand|
1817

19-
In single-precision and double-precision formats, there is an assumed leading 1 in the fractional part, called the *significand* (and sometimes referred to as the *mantissa*), that is not stored in memory, so the significands are actually 24 or 53 bits, even though only 23 or 52 bits are stored. The double-extended-precision format actually stores this bit.
18+
In single-precision and double-precision formats, there's an assumed leading 1 in the fractional part. The fractional part is called the *significand* (sometimes known as the *mantissa*). This leading 1 isn't stored in memory, so the significands are actually 24 or 53 bits, even though one less bit gets stored. The double-extended-precision format actually stores this bit.
2019

21-
The exponents are biased by half of their possible value. This means you subtract this bias from the stored exponent to get the actual exponent. If the stored exponent is less than the bias, it is actually a negative exponent.
20+
The exponents are biased by half of their possible value. It means you subtract this bias from the stored exponent to get the actual exponent. If the stored exponent is less than the bias, it's actually a negative exponent.
2221

2322
The exponents are biased as follows:
2423

2524
|Exponent|Biased by|
2625
|--------------|---------------|
2726
|8-bit (single-precision)|127|
2827
|11-bit (double-precision)|1023|
29-
|15-bit (double-extended-precision)|16383|
3028

31-
These exponents are not powers of ten; they are powers of two. That is, 8-bit stored exponents can range from -127 to 127, stored as 0 to 254. The value 2<sup>127</sup> is roughly equivalent to 10<sup>38</sup>, which is the actual limit of single-precision.
29+
These exponents aren't powers of ten; they're powers of two. That is, 8-bit stored exponents can range from -127 to 127, stored as 0 to 254. The value 2<sup>127</sup> is roughly equivalent to 10<sup>38</sup>, which is the actual limit of single-precision.
3230

33-
The significand is stored as a binary fraction of the form 1.XXX... . This fraction has a value greater than or equal to 1 and less than 2. Note that real numbers are always stored in *normalized form*; that is, the significand is left-shifted such that the high-order bit of the significand is always 1. Because this bit is always 1, it is assumed (not stored) in the single-precision and double-precision formats. The binary (not decimal) point is assumed to be just to the right of the leading 1.
31+
The significand is stored as a binary fraction of the form 1.XXX... . This fraction has a value greater than or equal to 1 and less than 2. Real numbers are always stored in *normalized form*. That is, the significand is left-shifted such that the high-order bit of the significand is always 1. Because this bit is *always* 1, it's assumed (not stored) in the single-precision and double-precision formats. The binary (not decimal) point is assumed to be just to the right of the leading 1.
3432

35-
The format, then, for the various sizes is as follows:
33+
The format for floating-point representation is as follows:
3634

3735
|Format|byte 1|byte 2|byte 3|byte 4|...|byte n|
3836
|------------|------------|------------|------------|------------|---------|------------|
3937
|single-precision| `SXXXXXXX`|`XMMMMMMM`|`MMMMMMMM`|`MMMMMMMM`|||
4038
|double-precision|`SXXXXXXX`|`XXXXMMMM`|`MMMMMMMM`|`MMMMMMMM`|...|`MMMMMMMM`|
41-
|double-extended-precision|`SXXXXXXX`|`XXXXXXXX`|`1MMMMMMM`|`MMMMMMMM`|...|`MMMMMMMM`|
4239

43-
`S` represents the sign bit, the `X`'s are the biased exponent bits, and the `M`'s are the significand bits. Note that the leftmost bit is assumed in single-precision and double-precision formats, but is present as "1" in byte 3 of the double-extended-precision format.
40+
`S` represents the sign bit, the `X`'s are the biased exponent bits, and the `M`'s are the significand bits. The leftmost bit is assumed in single-precision and double-precision formats.
4441

4542
To shift the binary point properly, you first unbias the exponent and then move the binary point to the right or left the appropriate number of bits.
4643

@@ -50,31 +47,31 @@ The floating-point formats include some values that are treated specially.
5047

5148
### Zero
5249

53-
Zero cannot be normalized, which makes it unrepresentable in the normalized form of a single-precision or double-precision value. A special bit pattern of all zeroes represents 0. It's also possible to represent -0 as zero with the sign bit set, but -0 and 0 always compare as equal.
50+
Zero can't be normalized, which makes it unrepresentable in the normalized form of a single-precision or double-precision value. A special bit pattern of all zeroes represents 0. It's also possible to represent -0 as zero with the sign bit set, but -0 and 0 always compare as equal.
5451

5552
### Infinities
5653

57-
The +∞ and −∞ values are represented by an exponent of all ones and a significand of all zeroes. Both positive and negative infinities can be represented by using the sign bit.
54+
The +∞ and −∞ values are represented by an exponent of all ones, and a significand that's all zeroes. Positive and negative are represented by using the sign bit.
5855

5956
### Subnormals
6057

61-
It's possible to represent numbers of smaller magnitude than the smallest normalized number. These numbers are known as *subnormal* or *denormal* numbers. If the exponent is all zeroes and the significand is non-zero, then implicit leading bit of the significand is considered to be zero, not one. The precision of subnormal numbers goes down as the number of leading zeroes in the significand goes up.
58+
It's possible to represent numbers of smaller magnitude than the smallest number in normalized form. They're called *subnormal* or *denormal* numbers. If the exponent is all zeroes and the significand is non-zero, then implicit leading bit of the significand is considered to be zero, not one. The precision of subnormal numbers goes down as the number of leading zeroes in the significand goes up.
6259

6360
### NaN - Not a Number
6461

65-
It's possible to represent values that are not a real number, such as 0 / 0, in the IEEE floating-point format. A value of this kind is called a *NaN*. A NaN is represented by an exponent of all ones and a non-zero significand. There are two kinds of NaNs, *quiet* NaNs, or QNaNs, and *signaling* NaNs, or SNaNs. Quiet NaNs have a leading one in the significand, and are generally propagated through an expression. They represent an indeterminate value, such as the result of dividing by infinity, or multiplying an infinity by zero. signaling NaNs have a leading zero in the significand. These are used for operations that are not valid, to signal a floating-point hardware exception.
62+
It's possible to represent values that aren't real numbers, such as 0 / 0, in the IEEE floating-point format. A value of this kind is called a *NaN*. A NaN is represented by an exponent of all ones and a non-zero significand. There are two kinds of NaNs, *quiet* NaNs, or QNaNs, and *signaling* NaNs, or SNaNs. Quiet NaNs have a leading one in the significand, and get propagated through an expression. They represent an indeterminate value, such as the result of dividing by infinity, or multiplying an infinity by zero. Signaling NaNs have a leading zero in the significand. They're used for operations that aren't valid, to signal a floating-point hardware exception.
6663

6764
## Examples
6865

6966
The following are some examples in single-precision format:
7067

71-
- For the value 2, the sign bit is zero, and the stored exponent is 128, or 1000 0000 in binary, which is 127 plus 1. The stored binary significand is (1.) 000 0000 0000 0000 0000 0000, which has an implied leading 1 and binary point, so the actual significand is one.
68+
- For the value 2, the sign bit is zero. The stored exponent is 128, or 1000 0000 in binary, which is 127 plus 1. The stored binary significand is (1.) 000 0000 0000 0000 0000 0000, which has an implied leading 1 and binary point, so the actual significand is one.
7269

7370
|Value|Formula|Binary representation|Hexadecimal|
7471
|-|-|-|-|
7572
|2|1 * 2<sup>1</sup>|0100 0000 0000 0000 0000 0000 0000 0000|0x40000000|
7673

77-
- The value -2. Same as +2 except that the sign bit is set. This is true for the negative of all IEEE format floating-point numbers.
74+
- The value -2. Same as +2 except that the sign bit is set. The same thing is true for the negative of all IEEE format floating-point numbers.
7875

7976
|Value|Formula|Binary representation|Hexadecimal|
8077
|-|-|-|-|
@@ -86,7 +83,7 @@ The following are some examples in single-precision format:
8683
|-|-|-|-|
8784
|4|1 * 2<sup>2</sup>|0100 0000 1000 0000 0000 0000 0000 0000|0x40800000|
8885

89-
- The value 6. Same exponent, significand is larger by half — it's (1.) 100 0000 ... 0000 0000, which, since this is a binary fraction, is 1 1/2 because the values of the fractional digits are 1/2, 1/4, 1/8, and so forth.
86+
- The value 6. Same exponent, significand is larger by half. It's (1.) 100 0000 ... 0000 0000, which, since it's a binary fraction, is 1 1/2 because the values of the fractional digits are 1/2, 1/4, 1/8, and so forth.
9087

9188
|Value|Formula|Binary representation|Hexadecimal|
9289
|-|-|-|-|
@@ -110,13 +107,13 @@ The following are some examples in single-precision format:
110107
|-|-|-|-|
111108
|2.5|1.25 * 2<sup>1</sup>|0100 0000 0010 0000 0000 0000 0000 0000|0x40200000|
112109

113-
- 1/10 is a repeating fraction in binary. The significand is just shy of 1.6, and the biased exponent says that 1.6 is to be divided by 16 (it is 011 1101 1 in binary, which is 123 in decimal). The true exponent is 123 - 127 = -4, which means that the factor by which to multiply is 2<sup>-4</sup> = 1/16. Note that the stored significand is rounded up in the last bit an attempt to represent the unrepresentable number as accurately as possible. (The reason that 1/10 and 1/100 are not exactly representable in binary is similar to the reason that 1/3 is not exactly representable in decimal.)
110+
- 1/10 is a repeating fraction in binary. The significand is a little less than 1.6, and the biased exponent says that 1.6 is to be divided by 16. (It's 011 1101 1 in binary, which is 123 in decimal.) The true exponent is 123 - 127 = -4, which means that the factor by which to multiply is 2<sup>-4</sup> = 1/16. The stored significand is rounded up in the last bit in an attempt to represent the unrepresentable number as accurately as possible. (The reason that 1/10 and 1/100 aren't exactly representable in binary is similar to the reason that 1/3 isn't exactly representable in decimal.)
114111

115112
|Value|Formula|Binary representation|Hexadecimal|
116113
|-|-|-|-|
117114
|0.1|1.6 * 2<sup>-4</sup>|0011 1101 1100 1100 1100 1100 1100 1101|0x3DCCCCCD|
118115

119-
- Zero is a special case that uses the formula for the minimum possible representable positive value, which is all zeroes.
116+
- Zero is a special case. It uses the formula for the minimum possible representable positive value, which is all zeroes.
120117

121118
|Value|Formula|Binary representation|Hexadecimal|
122119
|-|-|-|-|

0 commit comments

Comments
 (0)