An Experiment with SSE

Updated: 6 February 2012.

Techie people, the good stuff—code, results, more info—is below the fold.


Friends, I’ve been in my head a bit recently. That’s not necessarily bad—it’s a nice neighborhood, really—but one of the dark alleys I’ve had to walk past a lot lately involves way too many imposter-type feelings. As I previously mentioned, we’re hiring, and we’re looking for someone to do many of the same job tasks that I do. Being mostly self-taught at software engineering and computer science, I have to remind myself that I have more than a dozen years of experience; whereas, for most of the people whose résumés cross my desk, they do not.

It’s sometimes hard to silence those voices that say “everyone else is more accomplished than you.” Even though it’s not true, I’ve been meaning to pick up some more skills so that I can (a) try to feel like less of an imposter and (b) write more awesome code to make our product more awesome and help us fend off our competitors and make more loot.

So yesterday afternoon, rather than coming up with (yet another) daunting list of all of the places where I feel like I should learn more, I just picked one that I already knew about: Streaming SIMD Extensions, a.k.a. SSE. For me, the best way to learn is to write a program that does something (theoretically) useful, run into real-life obstacles, and workaround the pitfalls I encountered. Practice, practice, practice.


My dear readers who don’t program, you can now safely navigate away for another day without guilt. The rest of the post is rather technical and describes how I took an example I found online, made it work with gcc, and got a 4x speedup by using SSE. A speedup and happier outlook. Not bad for a few hours on a Saturday!


The first real examples I found online were this post from The Supercomputing Blog and this one from LiraNuna. Together they got me started.

The program computes sqrt(x)/x many times and compares the results and performance. On my 2006-era CoreDuo MacBook Pro, I got these results, when evaluating this equation over the range 0 to 64,000 and computing all of the values 10,000 times:

Without SSE: 26.404 seconds
With SSE, using division: 6.705 seconds (3.9x faster)
With SSE, using reciprocals: 4.204 seconds (6.3x faster)

The example code below uses the following functions and datatypes that were new to me:

_aligned_malloc
posix_memalign

__m128

_mm_add_ps
_mm_div_ps
_mm_mul_ps
_mm_rcp_ps
_mm_set_ps
_mm_set1_ps
_mm_sqrt_ps

I compiled, ran, and timed the code using these commands:

g++ -O2 -msse sqrtxdivx.cpp -o sqrtdivx_ssediv
g++ -O2 -msse sqrtxdivx.cpp -o sqrtdivx_sserecip
g++ -O2 -msse sqrtxdivx.cpp -o sqrtdivx_ssenone
time ./sqrtdivx_ssediv ; time ./sqrtdivx_sserecip ; time ./sqrtdivx_ssenone

For more info on SSE:

Without more delay, the code!

#include <stdio.h>      // printf() #include <xmmintrin.h>  // SSE compiler intrinsics #include <stdlib.h>     // posix_memalign() #include <math.h>       // Non-SSE sqrt() //#define USE_SSE  // Define this if you want to run with SSE. //#define USE_DIV   // Define this if you want to use SSE division (slower, accurate). // We will be calculating Y = sqrt(x) / x, for x = 1 to 64000 int main(int argc, char* argv[]) {   // Compute sqrt(x)/x for the first 64,000 nonnegative integers.   const int length = 64000;     // float *pResult = (float*) _aligned_malloc(length * sizeof(float), 16);  /* MSVC */   float *pResult;   posix_memalign(reinterpret_cast<void **>(&pResult), 16, length * sizeof(float));  /* gcc */   printf("Starting calculation...\n");   for (int stress = 0; stress < 10000; ++stress)   {     #ifdef USE_SSE       __m128 *pResultSSE = reinterpret_cast<__m128*>(pResult);  // Intermediate pointer       __m128 x = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f);  // Initialize x to <1,2,3,4>.       __m128 xDelta = _mm_set1_ps(4.0f);  // Set the xDelta to <4,4,4,4>.       const int sseLength = length / 4;       for (int i = 0; i < sseLength; ++i)       {         __m128 xSqrt = _mm_sqrt_ps(x);         #ifdef USE_DIV           // Use slower, more accurate SSE division.           pResultSSE[i] = _mm_div_ps(xSqrt, x);         #endif         #ifndef USE_DIV           // Use faster, less accurate reciprocal and multiply.           __m128 xRecip = _mm_rcp_ps(x);           pResultSSE[i] = _mm_mul_ps(xRecip, xSqrt);         #endif         x = _mm_add_ps(x, xDelta);  // Advance x to the next set of numbers.       }     #else  // USE_SSE       float xFloat = 1.0f;       for (int i = 0 ; i < length; ++i)       {         pResult[i] = sqrt(xFloat) / xFloat;         xFloat += 1.0f;       }             #endif // USE_SSE   }   // To prove that the program actually worked, look at the first 20.   for (int i = 0; i < 20; ++i)   {     printf("Result[%d] = %f\n", i, pResult[i]);   } }

The raw output. Notice how the first (SSE with division) and the third (no SSE instructions) gave the same answer, while the second set of results (using SSE and a reciprocal calculation) gave decidedly less accurate computations.

Starting calculation...
Result[0] = 1.000000
Result[1] = 0.707107
Result[2] = 0.577350
Result[3] = 0.500000
Result[4] = 0.447214
Result[5] = 0.408248
Result[6] = 0.377964
Result[7] = 0.353553
Result[8] = 0.333333
Result[9] = 0.316228
Result[10] = 0.301511
Result[11] = 0.288675
Result[12] = 0.277350
Result[13] = 0.267261
Result[14] = 0.258199
Result[15] = 0.250000
Result[16] = 0.242536
Result[17] = 0.235702
Result[18] = 0.229416
Result[19] = 0.223607

real	0m6.705s
user	0m6.678s
sys	0m0.008s

Starting calculation...
Result[0] = 0.999756
Result[1] = 0.706934
Result[2] = 0.577209
Result[3] = 0.499878
Result[4] = 0.447104
Result[5] = 0.408149
Result[6] = 0.377872
Result[7] = 0.353467
Result[8] = 0.333252
Result[9] = 0.316151
Result[10] = 0.301470
Result[11] = 0.288605
Result[12] = 0.277282
Result[13] = 0.267196
Result[14] = 0.258136
Result[15] = 0.249939
Result[16] = 0.242469
Result[17] = 0.235645
Result[18] = 0.229365
Result[19] = 0.223552

real	0m4.204s
user	0m4.196s
sys	0m0.007s

Starting calculation...
Result[0] = 1.000000
Result[1] = 0.707107
Result[2] = 0.577350
Result[3] = 0.500000
Result[4] = 0.447214
Result[5] = 0.408248
Result[6] = 0.377964
Result[7] = 0.353553
Result[8] = 0.333333
Result[9] = 0.316228
Result[10] = 0.301511
Result[11] = 0.288675
Result[12] = 0.277350
Result[13] = 0.267261
Result[14] = 0.258199
Result[15] = 0.250000
Result[16] = 0.242536
Result[17] = 0.235702
Result[18] = 0.229416
Result[19] = 0.223607

real	0m26.404s
user	0m26.360s
sys	0m0.033s

Update — 6 February 2012: On Sunday I wanted to see what would happen when I changed the code to work on 64-bit, double-precision floating-point numbers. While I saw the expected 4x speedup when operating on four 32-bit floats at once, I did not see the same 2x improvement with doubles. The SSE and no-SSE versions each took approximately 26 seconds. Why?

It’s either the processor or an OS issue. When I applied the same changes to the code at work, I saw not only a 2x speed up for doubles, I also got an almost 8x speed up for single-precision. That was surprising. When I looked at the differences between the Linux assembly of the “float” and “double” programs, the only things different were the obvious: minor array size and offset differences and the presence of the “_pd” versions of the routines instead of the “_ps” ones. This was similar to the differences I remember seeing when I compared assemblies built on my MacBook Pro.

The major differences between the two environments:

  • 32-bit MacOS 10.6 versus 64-bit Debian 6 Linux (running under VMware)
  • gcc 4.2.something versus gcc 4.4.5
  • An older Intel CoreDuo processor (ca. 2006) versus an 8-core 3.07 GHz Intel Xeon (late 2009)

Clearly streaming SIMD instructions make a difference for some types, but the effect is limited (or helped) by the processor that executes those instructions. (Of course, I still have to check that GCC or the OS aren’t part of the difference.

This entry was posted in Computing, Fodder for Techno-weenies, General. Bookmark the permalink.

One Response to An Experiment with SSE

  1. LiraNuna says:

    The SSE and no-SSE versions each took approximately 26 seconds. Why?

    There could be many reasons involved:
    * Register pressure (It takes x2 SSE registers for a 4d vector instead of one)
    * Single precession floats are more optimized than doubles
    * Your code may not use __m128d well (You did not post code, so I cannot judge)

    By the way, you can use the unified _mm_malloc to allocate aligned memory, it works on all compilers that support SSE intrinsics.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>