The IEEE Standards

There are IEEE standards for floating point arithmetic. Like most standards, these are covered by copyright and not available on the net. From an individual's perspective they are quite expensive to purchase. I don't have a copy and can't find a copy in the university libraries.

It is a pity that standards organisations don't make standards documents freely available. They charge in order to recover their costs, but the majority of the costs of the standardisation process is borne by the organisations which are represented on the various committees. It is a pity that the administration costs stifle the free flow of information.

IEEE Good Points

The IEEE floating point standards provide standard floating point formats and floating point arithmetic. Most modern floating point hardware, such as the FPUs of the Intel 80x86 microprocessors, support the IEEE standard.

The standard floating point formats include the following classes of "numbers":

Normalised numbers. These are what it is all about. Most numbers and results of arithmetic operations are normalised numbers.
Zeroes. There are two of these: +0.0 and -0.0. They have the same value and the same effect in most operations.
Infinities. There are two of these: a positive one and a negative one. If you know anything about transfinite numbers, you will know that there are lots of objects which are called infinities. The IEEE infinities have their own set of properties.
NaNs. NaN stands for "Not a Number". There are many of these.
Denormal numbers. These are just like normalised numbers, but they are not normalised! Consequently they have less precision than normalised numbers. Denormal numbers have values (magnitudes) between that of the smallest (in magnitude) normalised number and zero.

IEEE Unfortunate Points

Depending upon your point of view, there is a very unfortunate aspect of the IEEE floating point standards.

The IEEE standards require that (I don't have a copy of the standards but Warren Focke < warren@xtepca.gsfc.nasa.gov > quoted the relevant clauses of ANSI/IEEE std. 754-1985 for me) the default response to a serious floating point fault during a floating point operation should be to return a special value and set a flag. For example, the "C" statement

    a = 1.0 / 0.0;

will result in the variable "a" being set to the value infinity, and a testable flag will be set.

This behaviour has advantages in some types of computation.

IEEE Rounding

One facet of floating point arithmetic which has probably caused more messages on newsgroups and mailing lists than any other floating point topic is that of how rounding should occur in floating point operations.

It is often claimed that the IEEE standards require that the result of a floating point operation on two quantities of precision A must be a quantity which is correctly rounded to that same precision. On the other hand, it has also been claimed that this is not the case.

The 'C' standards are currently being revised. At least some versions of the proposals from the NCEG (the group which worked on the numerical aspects) explicitly allow for floating point results which are not rounded to the same precision.

The Intel 8x86 architecture supports 53 bit precision ('C' doubles) and 64 bit precision ('C' long doubles) floating point numbers. Unlike other architectures, it does not code the precision in the arithmetic instruction. The precision can only be changed by loading the FPU control register. Practically, this will usually mean that several instructions are needed each time the precision is changed, in a sequence like:

       copy the FPU control register contents to X,
       save X to allow the register to be later restored,
       modify the precision control bits in X,
       load the modified X into the FPU control register.

This is obviously expensive if it is to be done often. Consequently, it is usual to adopt a strategy whereby the precision control bits are changed only infrequently. Under Intel Linux, gcc is set up to set the precision control bit to 64 bit precision. This is needed to support long double variables.

Other architectures allow the precision of floating point results to be set relatively cheaply. Consequently, these architectures usually by default support IEEE style rounding. As a result, it is sometimes found that numeric results obtained on Intel Linux are different to those obtained on other architectures. It also sometimes happens that numeric problems which converge on other architectures fail to do so on Intel Linux (however, the converse is also possible).

On way which some people try to overcome this problem is to use the --float-store option of gcc. However, this is not a complete solution because it results in two-stage rounding and as pointed out here the results are sometimes not the same as those obtained by a single stage of rounding.

A better way to solve the problem is to change the setting of the FPU control bits at the start of the program, e.g. with the fesetprecision() function of my wmexcep package. This will work if long double precision is never used by the program (this includes any library functions called explicitly or implicitly by the program).

For an example of the effects of rounding precision, see my test program.