Sunday, January 29, 2012

Compiler Mythology

One thing that I've grown to dislike over the years is the holier than thou attitude common on coding question forums like stackoverflow and the like with regard to optimization / performance related questions.  Often a new programmer will have some interest in learning about the relative speed of a few operations, and will ask some question like "which is faster, a for loop or a while loop?" and almost all of the time the typical response will contain something one of the following:
#1 Profile it in your app and only optimize it if you need to. 
#2 The compiler will make the best choice, and was written by smarter people than you, so don't worry about those things. 
#3 Why would you ever need to make a program go faster?
#4 Optimization before there's a need is evil, since it is premature....
#5 Optimize your algorithms, not the code itself...

I can't help but want to smack all of these people, even if they do have (partially) valid points to some extent. Most people do not need to optimize heavily, ( or so they think ) and are fine with wasting cycles all over the place since in many applications, this will be lost within the noise of a human based response to an action.  However, the attitude that someone is an idiot for even wanting to know more about this mystical topic called performance just annoys the heck out of me.  We programmers live in a world where input is translated to output based purely on logic, reason, and math, and yet the compiler is treated as a whimsical source of randomness that cannot be predicted, known, or even guessed at unless you have actually looked at what the oracle of compilation has given you for a specific piece of code.

Yes, compilers vary between gcc, msvc, borland, whatever.  Yes, compiling results vary based on your platform and subsets of features. However, you are always compiling for SOME platform, not an abstraction of a platform.  In the game programming world, you generally have a whopping 3 different platforms, and in the pc world, you are generally focused on either x86 or x64.  How could one ever learn about a whole 2 or 3 different compilers and platforms at the same time! Mysticism! 

Either way, there is nothing magical that will prevent even a new programmer from learning little tidbits here and there by telling them what performance actually is about, and how they could begin to gather real knowledge about performance.  The typical response is probably based on the fact that those programmers have gotten this far in their career without needing to know much about performance, and therefore no one else needs to know anything about it as well.  Yet another reason why computer programs have remained at a similar level of responsiveness ( sluggish ) even with Moore's law kicking butt all along. There's really no reason for lag on any non video game type application. None! And yet we tolerate it all over the place.  Dismissing and redirecting the original intent of the question ( to seek knowledge ) and responding with one of these worthless answers just continues this fear of getting to actually know how computers work. There is not one programmer who would not benefit from better understanding of the hardware their code will run on, and yet we have these jerks responding as if knowing anything below a high level language is just a waste of time.

A more useful response would be quite simple: shut up, and let those who actually know the answer provide it, or at least explain the variables involved in the situation, and why it is actually a difficult question to answer.  To me it just seems like those people who think we should teach less math since we have them fancy calculators to do that for us. Without a basic understanding of how code is transformed into something that a processor can handle, a programmer is much more likely to write inefficient/crappy code, focusing on elegant layers upon layers of abstractions instead of solving the real problems at hand.

Here's my quick response to all 5 example response to the initial fake question:
#1 Code is likely to generate similar executable code regardless of context, so you can still learn/predict based solely on what is generated.
#2 The compiler has to live with what you told it to do, the c++ standard, etc, and it will not ALWAYS make the best choice with regard to what you actually intend. Also, compilers are written by regular old humans who occasionally do make mistakes.
#3 Blech.
#4 That's like saying you shouldn't learn to swim until you are in busy drowning in a river.
#5 I would assume that any professional programmer is smart enough to not even have to be told this. Also, I'll probably do a post one day on the silliness of some decisions made solely on reducing the big O notation of some code...

End Rant.

If you are interested in whether or not A is faster than B, you will have some digging to do, and sometimes it actually does depend on more than just the little snippet of code around it. A decent first step would be to disassemble your code and look at the instructions generated for the routines in question.  Look up the instructions in the processor's docs. Check out those handy latency and throughput figures for each one if the scale is small enough. Intel is kind enough to publish all kinds of docs and optimization guides online. One key thing is that on a given processor, code will execute the same way, with the same performance characteristics every time, given the same input (except for the whole OS/threading part, but you'll never have control over that anyway).   Second, time and profile your code.  We don't all have access to excellent tools to help us out with this one, but we all have access to QueryPerformanceTimer or similar timing constructs on your specific platform. A fun little process is writing isolated code to time in release mode and seeing how aggressive the compiler actually is.   Hmmm, why does my huge long routine get optimized into return 3; ?



Sunday, January 15, 2012

Assumed Optimization: Floating Point Multiply instead of Divide

Many of us have heard that divides are painfully slow and should be avoided. If you don't believe this, refer to the handy section at the end of this post to see why.

Lucky for us, a divide is equivalent to a multiply by the inverse of the dividend.

float a = b / 5.0f;
is the same* as 

float a = b * ( 1.0f / 5.0f);
This gives us the same* result, but uses the friendlier FMUL instead of FDIV. Wahoo. 
* Not actually guaranteed to be the same, keep reading...

So, if this is the case, it would seem logical that all divides by constant values, such as float a = b / 3.5f;  would be replaced by the compiler to their equivalent inverse multiplication ( float a = b * ( 1 / 3.5f ) ) and the expensive FDIV would be avoided entirely. 

It turns out that this is not always a safe assumption, and that the compiler will actually check some conditions before replacing an FDIV with an FMUL. First, you told it to do a divide, so it really wants to stick to what you told it.  If it can prove that the operation will result in the same value, it will generally replace it. These situations actually occur quite often, as values like 2.0, 4.0, 8.0, etc, occur quite frequently.  For numbers that can be perfectly stored in floating point, without any loss of precision, it seems that the compiler will go ahead and make the swap for you.  Some other interesting things to note:

float a = b / 0.5; turns into float a = b + b; ( Not a multiply by 2, as FADD is faster )
float a = b / 1.0; turns to float a = b; ( I would sure hope so )

Unfortunately, most values are not able to be perfectly represented in floating point, so the compiler takes the more correct route and issues the fdiv instruction as it will keep more precision and get you a result that will potentially be closer to the correct value. This is what you asked for, so its going to give it to you.

If you are in a situation where you need to absolutely minimize the computation time and are willing to sacrifice a little bit of accuracy, the solution is to force the issue by manually replacing it with a multiply with (1.0 / divisor) as follows:

float a = b / 255.0;  turns to 

float a = b * ( 1.0 / 255.0);
Keeping it in this form shows that you are intending a divide, but also allows the compiler to figure out the correct value without you having to even break out the slide rule. Or Abacus. Or maybe just hitting the calculator button on your keyboard... b * ( 1.0 / 255.0) sure is a lot more understandable then b * 0.003921568627450980392156862745098;

This example actually brought something else up that I had not expected, and will cover in another post.

So, you may be thinking, FMUL and FDIV, who cares?! Those instructions are hardly necessary with SSE and the xmm based instructions.  Well it turns out that the same rules seem to apply with the Div and Mul swaps.  Values that are nicely stored in floating point will turn to MULSD/MULSS, while imprecise values will stay as DIVSD/DIVSS instructions. In my test cases, the SSE2 versions followed similar performance patterns, with speed ups ranging from 20% to 100% achievable by avoiding divides.

So, in summary, if you are writing code that needs to be fast, and having accuracy to 12 decimal points is not that critical, go ahead and replace all possible divisions with multiplies by the inverse.  This type of optimization is never premature, since it does not harm clarity and involves essentially zero extra time to code. The only reservation would be based on the slight loss of accuracy when involving huge (or ridiculously small values).  In many cases ( probably most ) you will save little, or no time at all. You will never double your frame rate, or make a huge change with something like this ( unless you are in a very numeric processing intense project), but you may find certain areas that benefit greatly.

The rest of the post examines why divides should be avoided, and includes some real world examples of how using this technique can improve performance by a significant amount. In summary, routines that use a lot of division can be improved by 20-150% by using this method.

To help clarify why the FMUL is faster, I'll bring in some definitions from Intel:

Latency — The number of clock cycles that are required for the execution core to complete the execution of all of the µops that form an instruction.
Throughput — The number of clock cycles required to wait before the issue
ports are free to accept the same instruction again. For many instructions, the
throughput of an instruction can be significantly less than its latency.



Here's the data for the instructions on the Sandy Bridge processor, which happens to actually greatly improve the FDIV relative to the FMUL compared to older Intel processors:



Latency Throughput
FMUL 5 2
FDIV 6 5

In the most flattering case (Sandy Bridge processors) for the FDIV, it is 17% slower to execute, but with heavy usage, the instructions on average will take more than double the time.  


For the SSE Equivalents, it is even worse.

Latency Throughput
DIVSD 22 22
MULSD 5 1
DIVSS 14 14
MULSS 5 1

About 2.8x to 14x faster for single precision work. Joy. 

The compiler will generally schedule these operations in a manner to get around the slow throughput of the divide, but when the divide is the main operation, it will influence performance heavily.

To see if it is actually worth worrying about, I wrote up a quick color conversion routine that tried to reflect something reasonable a game programmer may do.  It involved taking an array of 32 bit color values (RGBA), turning them into an array of Float4s, while also doing some color transform based on some input value.  During the RGBA to float4 conversion, one might cast the individual rgba values to floats and divide by 255, like so:

inline FloatColor ConvertToFloat( unsigned char r, unsigned char g, unsigned char b, unsigned char a )
{
    FloatColor result;
    result.r = (float) r / 255.0f;
    result.g = (float) g / 255.0f;
    result.b = (float) b / 255.0f;
    result.a = (float) a / 255.0f;
    return result;
}
The rest of the loop just does some random transform on the colors that is essentially meaningless. The key was to give the loop some other work to do so it was a more meaningful example, instead of just a straight FDIV to FMUL comparison.


for ( int i = 0; i < count; ++i)
{
 processed[i] = ConvertToFloat( original[i].r, original[i].g, original[i].b, original[i].a );
 processed[i].a *= 0.5f; //do some random color scaling...
 processed[i].r *= inputValue;
 processed[i].g *= 0.1;
 processed[i].b *= 0.3;
}
With small count sizes, ( 10 or so ), the multiply only version performed 20% faster than the divide version. Not too bad! However, if you are in the habit of doing nice batching and going over large amounts of data all at once ( >100 ), this 20% turns can turn to a huge 150%! Granted, the method is dominated by the conversion to float, but it is not an unreasonable approach to some simple color conversion. This was not a hand optimized piece of perfection, but rather an attempt to reproduce what a normal programmer may do to solve a rather straight forward problem.  With just a simple swap that takes no detailed thought, the cost can be reduced by 3 times. 

If you want code samples or whatever, just let me know.

Reference:
MSVC Floating Point Model
http://msdn.microsoft.com/en-us/library/aa289157%28v=vs.71%29.aspx
IEEE-754 Analysis
http://babbage.cs.qc.edu/IEEE-754/
Intel Optimization Guide ( see Appendix C for latency / throughput info )
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
What Every Computer Scientist Should Know About Floating-Point Arithmetic
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Saturday, January 14, 2012

Introduction

Hello all.  This blog aims to share some of the more interesting things I've run in to with programming while also upholding the ideal of keeping things simple.  I'm in the game / simulation programming side of things so performance critical code is the norm for me.  I am not claiming to be an expert by any means, so feel free to correct/debate me at will.  One of the primary purposes is to push myself to look more closely/dig deeper to get a better understanding of what code is actually doing. Expect lots of disassembly, lots of preaching, and other random stuff.

Over the years, I've done a variety of programming from good old QBASIC to Python, Java, and pure embedded assembly written for an 8 bit processor. My heart has always been in game programming though, so C++ will be what is generally seen here.  Sure, you can write games in Flash, Ruby On Rails, etc, and it may be great, but pretty much every major title for PC, XBOX, PS3 in the past 10 years have been based on C or C++, and there is a good reason for that.  

If I actually put time into this, and anyone actually reads it, I will try to have a few themes/subjects to hit on regularly. 

Simplicity vs Complexity:
If you can make it simpler, do so.  There are many times problems are complicated and require complicated solutions, but strive for simplicity when possible.  Also keep in mind that not all programmers are as brilliant as you, so when a junior programmer is tasked with adding some new functionality to your clever, complicated solution and turns it in to a pile of crap, you share some of the blame.  

Assumed Optimizations:
Many times I have heard, and assumed myself "oh, the compiler will optimize that", generally with a hint of hopefulness and apprehension. Many times the coder is right, but there are many times where things such as the C++ standard, logic and reason, or silly assumptions prevent this from happening.

OOP for OOPs sake:
Modularity does not equal Object Oriented! Applying a nice set of interfaces and class structures for something that really doesn't need it is wasteful and annoying.  There are many issues with OOP, and unfortunately its taught to students as if it is the only real way to program.  OOP is a way to make it easier on the programmer, NOT the processor.  See Data Oriented Design for more processor friendly methods.