Chapter 5. Floating point SIMD using compiler built-ins

Table of Contents

SSE built-ins
SSE2 built-ins
SSE-3 built-ins
3DNow! built-ins

In contrast to the vector extensions described earlier, built-ins are specifically for SIMD operations. They will not gracefully fall back to regular floating point instructions, nor will they work on a different CPU from for which they were intended.

As mentioned above, these built-ins are currently descibed in the X86 Built-in functions and PowerPC Altivec Built-in functions chapters of the gcc manual.

SSE built-ins

SSE is available on any processor since the Pentium III, including most or all Athlons. To reiterate, SSE mostly deals with vectors of four single precision numbers each. Additionally, bitwise arithmetic is supported on 128 bits at once.

To illustrate, this is part of the main() of example2.c:

int main()
{
  union f4vector a, b, c;

  a.f[0] = 1; a.f[1] =  2; a.f[2] = 3;  a.f[3] = 4;
  b.f[0] = 5; b.f[1] =  6; b.f[2] = 7;  b.f[3] = 8;
  c.f[0] = 9; c.f[1] = 10; c.f[2] = 11; c.f[3] = 12;

  v4sf tmp = __builtin_ia32_mulps (a.v, b.v);   // a * b 
  v4sf e =   __builtin_ia32_addps(tmp, c.v);    // e = (a * b) + c

  printf4vector(&e);

  /* ... */
}

This code could just as well have been written as e = a * b + c. However, not all instructions available to us have an equivalent that can be expressed using regular operators.

For example, to calculate the square root of the four floats comprising c, we could execute:

  e = __builtin_ia32_sqrtps(c.v);
  printf4vector((union f4vector*)&e);

Or to determine the relative maxima of b and c:

  e = __builtin_ia32_maxps(a.v, b.v);  // calculate the maximum for each slot in our vector
  printf4vector((union f4vector*)&e);
	

As mentioned before, multiplication is faster than division which means that we will have to calculate the reciprocal of common factors every once in a while. For some purposes, full precision is not needed and we can get away with a (very) good approximation, 12 bits mantissa instead of 24.

To calculate the difference, we run the following:

  double now=gettime();

  c.f[0] = c.f[1] = c.f[2] = c.f[3] = 1;
  for(n=0;n < 100000000;++n) 
    e = __builtin_ia32_divps(c.v, b.v); // e = 1/b
  printf("manual way took %f seconds for %d iterations, result is ", gettime()-now, n);
  printf4vector(&e);

  now=gettime();

  for(n=0;n<100000000;++n) 
    e = __builtin_ia32_rcpps(b.v); // e = 1/b
  printf("approximate way took %f seconds for %d iterations, result is ", gettime()-now, n);
  printf4vector(&e);

	

On a Pentium M laptop, this outputs:

manual way took 2.632423 seconds for 100000000 iterations, result is 0.200000, 0.166667, 0.142857, 0.125000
approximate way took 0.582726 seconds for 100000000 iterations, result is 0.199951, 0.166626, 0.142822, 0.124969

A rare fivefold speed increase. Note that the speedup disappears once optimization is turned on, but reappears when doing profiled optimization, more about which later.

For further samples, see example2.c and experiment.

SSE2 built-ins

Available on all Pentium 4, Athlon 64 and Opteron processors. These gcc built-ins are actually undocumented! Patch will be submitted soon. SSE2 supports double precision arithmetic too, although only with two numbers at a time instead of four. It also features 64 bit integer support, again with two numbers per vector. Furthermore, 32 bit integers can be multiplied to a 64 bit result.

Not all SSE instructions were extended, notably the approximate reciprocal as used above was removed.

Some examples are in example3.c

SSE-3 built-ins

FIXME: Awaiting a Prescott-generation Pentium 4.

3DNow! built-ins

FIXME: Awaiting for time to spend with my Athlon