Chapter 4. First code example using gcc vector support

The GNU Compiler Collection, gcc, offers multiple ways to perform SIMD calculations. There has always been the possibility of hardcoding assembler instructions within your source, of course. Furthermore, gcc offers so called 'builtin' instructions which directly translate into assembler but which do provide 'glue' to make coding easier. These are described in the X86 Built-in functions and PowerPC Altivec Built-in functions chapters of the gcc manual.

And lastly, gcc has recently gained intrinsic support for some SIMD operations whereby the coder requests a vector of specified dimension and content, and then performs operations on that vector. Depending on compiler flags, these operations translate into either SIMD instructions or regular opcodes. This is described in the Vector Extensions chapter of the gcc manual.

We'll start out with this last variant as it is easiest on the eyes, and portable too:

#include <stdio.h>
typedef int v4sf __attribute__ ((mode(V4SF))); // vector of four single floats

union f4vector 
{
  v4sf v;
  float f[4];
};
      

This in itself does nothing, it only defines a union which is suitable for SIMD operations. The typedef creates a more legible name for a vector of four single precision floats, the union enables us to access the individual contents of the vector. Behind the scenes, the cryptic 'mode' command also takes care of alignment, more about which later.

The next bit actually does a calculation:

int main()
{
  union f4vector a, b, c;

  a.f[0] = 1; a.f[1] = 2; a.f[2] = 3; a.f[3] = 4;
  b.f[0] = 5; b.f[1] = 6; b.f[2] = 7; b.f[3] = 8;

  c.v = a.v + b.v;

  printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);
}

This can be compiled 'as is' by any recent gcc version (3.3 works, 3.4 does too):

	$ gcc -ggdb -c example1.c 
	$ gcc example1.o -o example1
      

When run, it delivers the expected output:

$ ./example1
6.000000, 8.000000, 10.000000, 12.000000

However, we did not tell gcc about our processor and it will probably have assumed the most basic variant available on your platform (80386, or a G3 for example). To verify, run:

$ objdump -dS ./example1.o  | grep -22 c.v | tail -25
  c.v = a.v + b.v;
  8b:	d9 45 e8             	flds   0xffffffe8(%ebp)
  8e:	d8 45 d8             	fadds  0xffffffd8(%ebp)
  91:	d9 5d b8             	fstps  0xffffffb8(%ebp)
  94:	d9 45 ec             	flds   0xffffffec(%ebp)
  97:	d8 45 dc             	fadds  0xffffffdc(%ebp)
  9a:	d9 5d bc             	fstps  0xffffffbc(%ebp)
  9d:	d9 45 f0             	flds   0xfffffff0(%ebp)
  a0:	d8 45 e0             	fadds  0xffffffe0(%ebp)
  a3:	d9 5d c0             	fstps  0xffffffc0(%ebp)
  a6:	d9 45 f4             	flds   0xfffffff4(%ebp)
  a9:	d8 45 e4             	fadds  0xffffffe4(%ebp)
  ac:	d9 5d c4             	fstps  0xffffffc4(%ebp)
  af:	8b 45 b8             	mov    0xffffffb8(%ebp),%eax
  b2:	89 45 c8             	mov    %eax,0xffffffc8(%ebp)
  b5:	8b 45 bc             	mov    0xffffffbc(%ebp),%eax
  b8:	89 45 cc             	mov    %eax,0xffffffcc(%ebp)
  bb:	8b 45 c0             	mov    0xffffffc0(%ebp),%eax
  be:	89 45 d0             	mov    %eax,0xffffffd0(%ebp)
  c1:	8b 45 c4             	mov    0xffffffc4(%ebp),%eax
  c4:	89 45 d4             	mov    %eax,0xffffffd4(%ebp)

  printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);
    

We see a lot of repetitive instructions, indicating that gcc has handcoded the four additions for us. Now let's recompile informing gcc of our CPU, and take another look. Note that this example is Intel specific, substitute your proper CPU name. Results will look different on a G3, but are similar in nature.

$ gcc  -ggdb -march=pentium3 -mcpu=pentium3    -c -o example1.o example1.c
$ gcc  -lm example1.o -o example1
$ objdump -dS ./example1.o  | grep -4 c.v | tail -5
  c.v = a.v + b.v;
  8b:	0f 28 45 e8          	movaps 0xffffffe8(%ebp),%xmm0
  8f:	0f 58 45 d8          	addps  0xffffffd8(%ebp),%xmm0
  93:	0f 29 45 c8          	movaps %xmm0,0xffffffc8(%ebp)

  printf("%f, %f, %f, %f\n", c.f[0], c.f[1], c.f[2], c.f[3]);

Here we see our first SSE instructions:

movaps 0xffffffe8(%ebp),%xmm0

'MOVe four Aligned Packed Single precision'. Copies four single precision floats from a memory location to the register XMM0. This memory location is 'a'.

addps 0xffffffd8(%ebp),%xmm0

'ADD four Packed Single precision'. Adds the contents of the four floats at the specified memory location to the SSE register XMM0. This memory location is 'b'.

movaps %xmm0,0xffffffc8(%ebp)

'MOVe four Aligned Packed Single precision'. Copies four single precision floats from the register XMM0 to an aligned memory location. This location is 'c' in our program.

It is probably a good idea to play around a bit with this program, which is called example1.c on disk.

Suggested changes are inducing division by zero errors and performing timings. Very simple benchmarking can be done by adding for(n=0; n < 1000000000; ++n) before our calculation. For reliable results, do not turn on optimization as gcc may discover the calculation is not actually changing, and only perform it once.

Of special note are the speed diferences between multiplication and division:

$ time ./example1 
5.000000, 12.000000, 21.000000, 32.000000

real	0m0.562s
user	0m0.542s
sys	0m0.001s

$ emacs example1.c ; make ; time ./example1 
0.200000, 0.333333, 0.428571, 0.500000

real	0m2.634s
user	0m2.611s
sys	0m0.002s

When studying the assembler output, the sole difference turns out to be the change from divps to mulps, the latter being a lot faster.