Schmorp's POD Blog a.k.a. THE RANT
a.k.a. the blog that cannot decide on a name

This document was first published 2015-11-21 04:08:38, and last modified 2015-11-27 04:06:59.

Questing for IEEE 754 BINARY16 / Half Precision Float Conversion

From time to time, half floats come up in the most unexpected places (i.e., not in graphics programming). For me, such an unexpected encounter was in CBOR::XS, where I had to support it for decoding.

Towards this end, I added an implementation to libecb, called ecb_binary16_to_float, which was portable, but didn't handle all details (such as NaNs). A corresponding ecb_float_to_binary16 wasn't needed at the time, and since it was nontrivial to provide, I skipped it.

Today I set out to change this, and provide adequate conversion functions.

The basic plan was trivial - provide binary16 to binary32 conversion and vice versa, and rely on the existing ecb_float_to_binary32 and ecb_binary32_to_float conversions for the, well, float conversions. This way, the actual binary16 conversions can be complete, and any platform issues will be handled by the existing ecb functions for float comversion.

This reduced the problem only slightly, as the hard work was in the binary16/binary32 conversion, so I did what all lazy coders do: realise that we are standing on the shoulders of giants and tap into the vast reservoir of human culture (and shared source code).

Or so I thought when I started a survey of existing conversion libraries, to see if I can reuse an existing (and hopefully well-tested) one.

Existing Solutions

The following is basically the conversion functions I found. And to warn you in advance: It didn't look good.

Fast Half Float Conversions

This paper by Jeroen van der Zijp describes a fast algorithm using look-up tables. They are "relatively small" (about 10Kb). This would be inconvenient in libecb, as libecb doesn't have an init function and 10kB of dirty RSS would still be way too much.

Fortunately (I might otherwise be tempted to use it) it also suffers from a few flaws, such as sometimes converting a NaN to infinity. Overall, it's not perfect, but it has some attention to details (such as subnormals and NaNs) but it's not a basis for libecb's implementation.


This is a "IEEE 754-based half-precision floating point library" (also, C++), and it comes with all sorts of bells and whistles, such as support for non-IEEE platforms (a must, really), overloads on steroids, erf, fma, everything is there.

With one exception that is so unbelievable that I am still not sure whether I am just too dumb to see it: You can't get IEEE 754 half-precision floating point numbers out of it. That's right, the thing converts between almost anything and it's abstracted half type, but doesn't allow you to access the actualy binary16 representation.

Apparently, the author expects you to rely on nonportable behaviour and use memcpy to copy out the 16 bits (if it breaks, it's your fault for using an "unreasonable" implementation). Why make the data member inaccessible when you are supposed to use memcpy? And what about that weaseling about keeping intermediate values in single-float? Is it safe to use memcpy when the library does so? We can only keep our fingers crossed.

This library is a masterful example of C++ data hiding and abstraction (gone wrong) - the type feels like a binary16, acts like a binary16, rounds like a binary16, but isn't a binary16.

Also, less importantly, but just as weirdly, it defaults to biased rounding, and the only way to override it is to define a macro before including it, changing behaviour globally. Weird, just weird for such an overdesigned template library.

I actually wanted to use it to test my own implementation, but short of patching it I found no way of accessing it, and it suffers from the same problems as the Jeroen van der Zijp paper, so in the end, I couldn't.

OpenEXR's ilmBase

Almost the original home of the binary16 datatype, the OpenEXR library of course comes with conversion functions (in Half/half.h, Half/half.cpp and Half/toFloat.cpp).

They are extremely well documented, and at first I thought, this is it, but not only does it use a humungous table (256kB), it's also extremely, and I mean extremely, buggy - it thinks of everything (subnormals, NaN truncation, rounding, even simulating fp overflow exceptions when appropriate), but gets the details wrong.

Still, it was useful to get a few things right, mostly because it is so well documented.

Phernost's Stackoverflow code

This impressive-looking code in an answer by Phernost on Stackoverflow is impressive looking code, and impressive-sounding as well - it only requires GCC (for -true == ~0 - what I think he really means is two's complement, as the true values he worries about is the value 1, not any true value), and it handles subnormals, NaNs, infinity, in other words, full IEEE support.

Except, when you look at the code more closely you see that it doesn't actually handle subnormals correctly, nor NaNs. It's also more or less directly translated from SSE code, and is very wasteful to avoid branches (something that is required for SIMD, but not for single conversions).

So, things certainly look a bit bleak, and the outcome I feared most had materialised: I had to understand it, and I had to code it myself, with all the extra overhead that involves. And probably will get it wrong, too.

The libecb Implementation

Ok, let's reiterate the cases that need to be handled (wikipedia has useful, but not great, references for binary32 and binary16).

Normalised Values

Normalised values ought to be the most common, followed by zero and subnormals. Normalised are trivial to comvert from binary16 to binary32, as only a few bit shifts are required.

The other direction is harder, as normalised binary32 values can map to 0, subnormal, normal and infinity binary16 values. Worse, I have to properly round these values, and would have to decide on a rounding mode (or make it configurable). I choose round-to-even, because it is bias-free and arguably the best default mode. Rounding, again, can change a normal number into infinity (infinity being even :).


All the binary32 subnormals thankfully map to 0 in binary16, but the reverse is harder, as I have to renormalise the subnormals to normal binary32 values, which means counting leading zero bits.


Infinity is easy to handle, all you have to take care of is the sign. Of course, converting binary32 to binary16 will overflow most values to infinity as well.


NaNs are relatively easy to handle - merely keep the high order bits (which usually decide whether the NaN is signalling or not) and truncate the low bits. With one exception, a NaN with zero mantisse is not actually a NaN but infinity, so when truncating causes the mantisse to become zero, a non-zero value must be substituted.

The code hopefully handles all these - if you are interested, you can have a look at ecb.h (and search for ecb_binary16_to_binary32 and ecb_binary32_to_binary16).

The Fast Path

Both conversion functions have a fast path, namely normalised binary32 numbers that map to normalised binary16 numbers. Let's have a look at these, and how GCC handles them.


The fast path is this one:

ecb_binary16_to_binary32 (uint16_t x)
  unsigned int s = (x & 0x8000) << (31 - 15);
  int          e = (x >> 10) & 0x001f;
  unsigned int m =  x        & 0x03ff;

  if (ecb_expect_false (e == 31))
  else if (ecb_expect_false (!e))

  /* e and m now are normalised, or zero, (or inf or nan) */
  e += 127 - 15;

  return s | (e << 23) | (m << (23 - 10));

So, two tests and a lot of bit-fiddling. GCC thinks this is an optimised translation (and amd64, and the function is usually inlined):

movl    %edi, %edx
movl    %edi, %r9d
andl    $1023, %edi
shrl    $10, %edx
andl    $32768, %r9d
andl    $31, %edx
sall    $16, %r9d
cmpl    $31, %edx
je      .L8
testl   %edx, %edx
je      .L4
addl    $112, %edx
movl    %edi, %eax
sall    $23, %edx
sall    $13, %eax
orl     %edx, %eax
orl     %r9d, %eax

The branches are not taken and statically predicted to not be taken, so are essentially free. The rest of the code looks quite good, too, probably better than I could do it myself.


Here the fast path is even more of a fast path, as it is an if block that handles everything itself and doesn't fall back to the main function code:

ecb_function_ ecb_const uint16_t
ecb_binary32_to_binary16 (uint32_t x)
  unsigned int s =  (x >> 16) & 0x00008000; /* sign bit, the easy part */
  unsigned int e = ((x >> 23) & 0x000000ff) - (127 - 15); /* the desired exponent */
  unsigned int m =   x        & 0x007fffff;

  x &= 0x7fffffff;

  /* if it's within range of binary16 normals, use fast path */
  if (ecb_expect_true (0x38800000 <= x && x <= 0x477fefff))
      /* mantissa round-to-even */
      m += 0x00000fff + ((m >> (23 - 10)) & 1);

      /* handle overflow */
      if (ecb_expect_false (m >= 0x00800000))
          m >>= 1;
          e +=  1;

      return s | (e << 10) | (m >> (23 - 10));


The overflow due to rounding looks painful, and it's a lot of bit fiddling as well. GCC finds this rather nice translation:

movl    %edi, %edx
movl    %edi, %esi
movl    %edi, %r8d
andl    $2147483647, %edi
shrl    $16, %edx
shrl    $23, %esi
leal    -947912704(%rdi), %ecx
andl    $32768, %edx
movzbl  %sil, %esi
andl    $8388607, %r8d
leal    -112(%rsi), %eax
cmpl    $251654143, %ecx
ja      .L2
movl    %r8d, %ecx
shrl    $13, %ecx
andl    $1, %ecx
leal    4095(%r8,%rcx), %ecx
cmpl    $8388607, %ecx
ja      .L10
sall    $10, %eax
shrl    $13, %ecx
orl     %eax, %ecx
orl     %ecx, %edx
movzwl  %dx, %eax

While a bit longer, it's quite succinct still. And while it is not as fast as a table lookup when the table is in your L1 cache, getting the table into your L1 cache is a painfully slow process as well, so this code might even win out over faster code in important situations.


Right. Fears. The main reason not to write your own conversion routines: no matter how good you are, no matter if it is only a couple dozen lines, if you start from scratch, chances are, there will be bugs, debugging, time wasted.

All I can do is hope that the conversion functions mostly work - I already did a few exhaustive loops to check whether all floating point values are convertible and round-trip properly, and I compared the output with other implementations, explaining the differences.

And after all that testing and verification, while writing this article, I found a bug :)

Anyway, I hope that the conversion routines will come in useful to somebody other than me.