Updated: 6 February 2012.
Techie people, the good stuff—code, results, more info—is below the fold.
Friends, I’ve been in my head a bit recently. That’s not necessarily bad—it’s a nice neighborhood, really—but one of the dark alleys I’ve had to walk past a lot lately involves way too many imposter-type feelings. As I previously mentioned, we’re hiring, and we’re looking for someone to do many of the same job tasks that I do. Being mostly self-taught at software engineering and computer science, I have to remind myself that I have more than a dozen years of experience; whereas, for most of the people whose résumés cross my desk, they do not.
It’s sometimes hard to silence those voices that say “everyone else is more accomplished than you.” Even though it’s not true, I’ve been meaning to pick up some more skills so that I can (a) try to feel like less of an imposter and (b) write more awesome code to make our product more awesome and help us fend off our competitors and make more loot.
So yesterday afternoon, rather than coming up with (yet another) daunting list of all of the places where I feel like I should learn more, I just picked one that I already knew about: Streaming SIMD Extensions, a.k.a. SSE. For me, the best way to learn is to write a program that does something (theoretically) useful, run into real-life obstacles, and workaround the pitfalls I encountered. Practice, practice, practice.
My dear readers who don’t program, you can now safely navigate away for another day without guilt. The rest of the post is rather technical and describes how I took an example I found online, made it work with gcc, and got a 4x speedup by using SSE. A speedup and happier outlook. Not bad for a few hours on a Saturday!
The first real examples I found online were this post from The Supercomputing Blog and this one from LiraNuna. Together they got me started.
The program computes sqrt(x)/x many times and compares the results and performance. On my 2006-era CoreDuo MacBook Pro, I got these results, when evaluating this equation over the range 0 to 64,000 and computing all of the values 10,000 times:
Without SSE: 26.404 seconds With SSE, using division: 6.705 seconds (3.9x faster) With SSE, using reciprocals: 4.204 seconds (6.3x faster)
The example code below uses the following functions and datatypes that were new to me:
_aligned_malloc posix_memalign __m128 _mm_add_ps _mm_div_ps _mm_mul_ps _mm_rcp_ps _mm_set_ps _mm_set1_ps _mm_sqrt_ps
I compiled, ran, and timed the code using these commands:
g++ -O2 -msse sqrtxdivx.cpp -o sqrtdivx_ssediv g++ -O2 -msse sqrtxdivx.cpp -o sqrtdivx_sserecip g++ -O2 -msse sqrtxdivx.cpp -o sqrtdivx_ssenone time ./sqrtdivx_ssediv ; time ./sqrtdivx_sserecip ; time ./sqrtdivx_ssenone
For more info on SSE:
- Art of Assembly Language book — See chapter 11.
- Intel Architecture Software Developer’s Manual (PDF) — See chapter 9.
- Wikipedia’s SSE page
Without more delay, the code!
#include <stdio.h> // printf() #include <xmmintrin.h> // SSE compiler intrinsics #include <stdlib.h> // posix_memalign() #include <math.h> // Non-SSE sqrt() //#define USE_SSE // Define this if you want to run with SSE. //#define USE_DIV // Define this if you want to use SSE division (slower, accurate). // We will be calculating Y = sqrt(x) / x, for x = 1 to 64000 int main(int argc, char* argv[]) { // Compute sqrt(x)/x for the first 64,000 nonnegative integers. const int length = 64000; // float *pResult = (float*) _aligned_malloc(length * sizeof(float), 16); /* MSVC */ float *pResult; posix_memalign(reinterpret_cast<void **>(&pResult), 16, length * sizeof(float)); /* gcc */ printf("Starting calculation...\n"); for (int stress = 0; stress < 10000; ++stress) { #ifdef USE_SSE __m128 *pResultSSE = reinterpret_cast<__m128*>(pResult); // Intermediate pointer __m128 x = _mm_set_ps(4.0f, 3.0f, 2.0f, 1.0f); // Initialize x to <1,2,3,4>. __m128 xDelta = _mm_set1_ps(4.0f); // Set the xDelta to <4,4,4,4>. const int sseLength = length / 4; for (int i = 0; i < sseLength; ++i) { __m128 xSqrt = _mm_sqrt_ps(x); #ifdef USE_DIV // Use slower, more accurate SSE division. pResultSSE[i] = _mm_div_ps(xSqrt, x); #endif #ifndef USE_DIV // Use faster, less accurate reciprocal and multiply. __m128 xRecip = _mm_rcp_ps(x); pResultSSE[i] = _mm_mul_ps(xRecip, xSqrt); #endif x = _mm_add_ps(x, xDelta); // Advance x to the next set of numbers. } #else // USE_SSE float xFloat = 1.0f; for (int i = 0 ; i < length; ++i) { pResult[i] = sqrt(xFloat) / xFloat; xFloat += 1.0f; } #endif // USE_SSE } // To prove that the program actually worked, look at the first 20. for (int i = 0; i < 20; ++i) { printf("Result[%d] = %f\n", i, pResult[i]); } }
The raw output. Notice how the first (SSE with division) and the third (no SSE instructions) gave the same answer, while the second set of results (using SSE and a reciprocal calculation) gave decidedly less accurate computations.
Starting calculation... Result[0] = 1.000000 Result[1] = 0.707107 Result[2] = 0.577350 Result[3] = 0.500000 Result[4] = 0.447214 Result[5] = 0.408248 Result[6] = 0.377964 Result[7] = 0.353553 Result[8] = 0.333333 Result[9] = 0.316228 Result[10] = 0.301511 Result[11] = 0.288675 Result[12] = 0.277350 Result[13] = 0.267261 Result[14] = 0.258199 Result[15] = 0.250000 Result[16] = 0.242536 Result[17] = 0.235702 Result[18] = 0.229416 Result[19] = 0.223607 real 0m6.705s user 0m6.678s sys 0m0.008s Starting calculation... Result[0] = 0.999756 Result[1] = 0.706934 Result[2] = 0.577209 Result[3] = 0.499878 Result[4] = 0.447104 Result[5] = 0.408149 Result[6] = 0.377872 Result[7] = 0.353467 Result[8] = 0.333252 Result[9] = 0.316151 Result[10] = 0.301470 Result[11] = 0.288605 Result[12] = 0.277282 Result[13] = 0.267196 Result[14] = 0.258136 Result[15] = 0.249939 Result[16] = 0.242469 Result[17] = 0.235645 Result[18] = 0.229365 Result[19] = 0.223552 real 0m4.204s user 0m4.196s sys 0m0.007s Starting calculation... Result[0] = 1.000000 Result[1] = 0.707107 Result[2] = 0.577350 Result[3] = 0.500000 Result[4] = 0.447214 Result[5] = 0.408248 Result[6] = 0.377964 Result[7] = 0.353553 Result[8] = 0.333333 Result[9] = 0.316228 Result[10] = 0.301511 Result[11] = 0.288675 Result[12] = 0.277350 Result[13] = 0.267261 Result[14] = 0.258199 Result[15] = 0.250000 Result[16] = 0.242536 Result[17] = 0.235702 Result[18] = 0.229416 Result[19] = 0.223607 real 0m26.404s user 0m26.360s sys 0m0.033s
Update — 6 February 2012: On Sunday I wanted to see what would happen when I changed the code to work on 64-bit, double-precision floating-point numbers. While I saw the expected 4x speedup when operating on four 32-bit floats at once, I did not see the same 2x improvement with doubles. The SSE and no-SSE versions each took approximately 26 seconds. Why?
It’s either the processor or an OS issue. When I applied the same changes to the code at work, I saw not only a 2x speed up for doubles, I also got an almost 8x speed up for single-precision. That was surprising. When I looked at the differences between the Linux assembly of the “float” and “double” programs, the only things different were the obvious: minor array size and offset differences and the presence of the “_pd” versions of the routines instead of the “_ps” ones. This was similar to the differences I remember seeing when I compared assemblies built on my MacBook Pro.
The major differences between the two environments:
- 32-bit MacOS 10.6 versus 64-bit Debian 6 Linux (running under VMware)
- gcc 4.2.something versus gcc 4.4.5
- An older Intel CoreDuo processor (ca. 2006) versus an 8-core 3.07 GHz Intel Xeon (late 2009)
Clearly streaming SIMD instructions make a difference for some types, but the effect is limited (or helped) by the processor that executes those instructions. (Of course, I still have to check that GCC or the OS aren’t part of the difference.


The SSE and no-SSE versions each took approximately 26 seconds. Why?
There could be many reasons involved:
* Register pressure (It takes x2 SSE registers for a 4d vector instead of one)
* Single precession floats are more optimized than doubles
* Your code may not use __m128d well (You did not post code, so I cannot judge)
By the way, you can use the unified _mm_malloc to allocate aligned memory, it works on all compilers that support SSE intrinsics.