Faithful readers: This dispatch falls into the bucket of “stuff that’s about work and not diabetes.” (Not that diabetes isn’t its own kind of work.) This is what I do for my 9-5. It’s the stuff I do between thinking about diabetes, my next bike outing, where I want to travel someday, and all those other things. Don’t worry, we’ll get back to all that “good stuff” soon enough; but this post is for the lurking coworkers and people searching on the Google.
Here are a few performance-related articles and presentations that I stumbled upon recently:
Igor Ostrovsky at Microsoft wrote a very useful description of various processor cache effects. Here are the notes I took while reading it:
- Array access is slower than math
- Data alignment can determine how many cache lines are touched
- Keep modifications/accesses within the same cache level
- Be aware of instruction level dependencies and parallelism opportunities
- Avoid touching too many memory locations from the same cache line, causes contention
- There’s a trade-off between the size of what you’re putting in the cache and the number of elements you touch in it. — What does that mean, Jeff? Not exactly sure, but cache assoc. is usually not a huge deal compared to other issues.
- Because cores have their own caches and because memory needs to remain consistent, avoid letting two threads modify data on the same cache line.
- Even when you think you know what you’re doing, there’s other crazy stuff going on.
Isn’t that last one the sad truth about performance optimization?
Igor also has an excellent article about branch prediction. Basically, if you structure your code so that if branches are predictably true (or predictably false) the CPU can start walking down that code path until it’s proved wrong. But if it’s close to random, you’ll see performance hits.
Joe Duffy, also at Microsoft, debunks the “premature optimization is evil” myth. Joe summarizes the article himself quite well, so I’ll just quote him.
In this short article, we’ll look at some important principles that are counter to what many people erroneously believe this [“avoid premature optimization”] statement to be saying. To save you time and suspense, I will summarize the main conclusions: I do not advocate contorting oneself in order to achieve a perceived minor performance gain. Even the best performance architects, when following their intuition, are wrong 9 times out of 10 about what matters. (Or maybe 97 times out of 100, based on Knuth’s numbers.) What I do advocate is thoughtful and intentional performance tradeoffs being made as every line of code is written. Always understand the order of magnitude that matters, why it matters, and where it matters. And measure regularly! I am a big believer in statistics, so if a programmer sitting in his or her office writing code thinks just a little bit more about the performance implications of every line of code that is written, he or she will save an entire team that time and then some down the road. Given the choice between two ways of writing a line of code, both with similar readability, writability, and maintainability properties, and yet interestingly different performance profiles, don’t be a bozo: choose the performant approach. Eschew redundant work, and poorly written code. And lastly, avoid gratuitously abstract, generalized, and allocation-heavy code, when slimmer, more precise code will do the trick.
Follow these suggestions and your code will just about always win in both maintainability and performance.
If you do anything that falls under the labels of “multicore,” “multithreaded,” or “multi-processor,” then you should definitely add Multicore Info and Intel’s software blogs to your RSS feed aggregator.
One recent Intel offering is a recording of “Introducing Intel Parallel Building Blocks”. The Intel folks discusses Intel Cilk Plus, Threading Building Blocks (TBB), and Array Building Blocks. Here are some notes from the 45 minute presentation:
- Intel Cilk Plus is a lower-level way to expose parallelism and the potential for vectorization to C++ code. It currently only works with the Intel compiler.
- TBB is awesome. That’s me saying that. I use it to easily exploit data parallelism in our existing codebase. (Because I use it already, I didn’t take many notes on it. Sorry.)
- Array Building Block (ArBB) provide the highest level of abstraction presenting the ability to specify array data containers (1-D to N-D, including nested containers and [soon] sparse data) and then do vector operations on the data in the containers. Sounds familiar.
- Cilk Plus includes reducers to prevent data races, has #pragma simd to empower the compiler to make vector assumptions, and has a mechanism for specifying a range of vector data.
- Parallel Inspector and Parallel Amplifier are alleged to help squeeze performance out of code that uses TBB.
- ArBB uses a two-part compilation process, one of which is a runtime JIT.
- There are collisions between the schedulers in Intel MKL and Cilk (and TBB, etc.). Beware.
And finally, there’s a recording of a webinar about Intel Parallel Advisor that may be of interest. It’s next in my queue of things to watch.