Sunday, January 15, 2012

Assumed Optimization: Floating Point Multiply instead of Divide

Many of us have heard that divides are painfully slow and should be avoided. If you don't believe this, refer to the handy section at the end of this post to see why.

Lucky for us, a divide is equivalent to a multiply by the inverse of the dividend.

float a = b / 5.0f;
is the same* as 

float a = b * ( 1.0f / 5.0f);
This gives us the same* result, but uses the friendlier FMUL instead of FDIV. Wahoo. 
* Not actually guaranteed to be the same, keep reading...

So, if this is the case, it would seem logical that all divides by constant values, such as float a = b / 3.5f;  would be replaced by the compiler to their equivalent inverse multiplication ( float a = b * ( 1 / 3.5f ) ) and the expensive FDIV would be avoided entirely. 

It turns out that this is not always a safe assumption, and that the compiler will actually check some conditions before replacing an FDIV with an FMUL. First, you told it to do a divide, so it really wants to stick to what you told it.  If it can prove that the operation will result in the same value, it will generally replace it. These situations actually occur quite often, as values like 2.0, 4.0, 8.0, etc, occur quite frequently.  For numbers that can be perfectly stored in floating point, without any loss of precision, it seems that the compiler will go ahead and make the swap for you.  Some other interesting things to note:

float a = b / 0.5; turns into float a = b + b; ( Not a multiply by 2, as FADD is faster )
float a = b / 1.0; turns to float a = b; ( I would sure hope so )

Unfortunately, most values are not able to be perfectly represented in floating point, so the compiler takes the more correct route and issues the fdiv instruction as it will keep more precision and get you a result that will potentially be closer to the correct value. This is what you asked for, so its going to give it to you.

If you are in a situation where you need to absolutely minimize the computation time and are willing to sacrifice a little bit of accuracy, the solution is to force the issue by manually replacing it with a multiply with (1.0 / divisor) as follows:

float a = b / 255.0;  turns to 

float a = b * ( 1.0 / 255.0);
Keeping it in this form shows that you are intending a divide, but also allows the compiler to figure out the correct value without you having to even break out the slide rule. Or Abacus. Or maybe just hitting the calculator button on your keyboard... b * ( 1.0 / 255.0) sure is a lot more understandable then b * 0.003921568627450980392156862745098;

This example actually brought something else up that I had not expected, and will cover in another post.

So, you may be thinking, FMUL and FDIV, who cares?! Those instructions are hardly necessary with SSE and the xmm based instructions.  Well it turns out that the same rules seem to apply with the Div and Mul swaps.  Values that are nicely stored in floating point will turn to MULSD/MULSS, while imprecise values will stay as DIVSD/DIVSS instructions. In my test cases, the SSE2 versions followed similar performance patterns, with speed ups ranging from 20% to 100% achievable by avoiding divides.

So, in summary, if you are writing code that needs to be fast, and having accuracy to 12 decimal points is not that critical, go ahead and replace all possible divisions with multiplies by the inverse.  This type of optimization is never premature, since it does not harm clarity and involves essentially zero extra time to code. The only reservation would be based on the slight loss of accuracy when involving huge (or ridiculously small values).  In many cases ( probably most ) you will save little, or no time at all. You will never double your frame rate, or make a huge change with something like this ( unless you are in a very numeric processing intense project), but you may find certain areas that benefit greatly.

The rest of the post examines why divides should be avoided, and includes some real world examples of how using this technique can improve performance by a significant amount. In summary, routines that use a lot of division can be improved by 20-150% by using this method.

To help clarify why the FMUL is faster, I'll bring in some definitions from Intel:

Latency — The number of clock cycles that are required for the execution core to complete the execution of all of the µops that form an instruction.
Throughput — The number of clock cycles required to wait before the issue
ports are free to accept the same instruction again. For many instructions, the
throughput of an instruction can be significantly less than its latency.



Here's the data for the instructions on the Sandy Bridge processor, which happens to actually greatly improve the FDIV relative to the FMUL compared to older Intel processors:



Latency Throughput
FMUL 5 2
FDIV 6 5

In the most flattering case (Sandy Bridge processors) for the FDIV, it is 17% slower to execute, but with heavy usage, the instructions on average will take more than double the time.  


For the SSE Equivalents, it is even worse.

Latency Throughput
DIVSD 22 22
MULSD 5 1
DIVSS 14 14
MULSS 5 1

About 2.8x to 14x faster for single precision work. Joy. 

The compiler will generally schedule these operations in a manner to get around the slow throughput of the divide, but when the divide is the main operation, it will influence performance heavily.

To see if it is actually worth worrying about, I wrote up a quick color conversion routine that tried to reflect something reasonable a game programmer may do.  It involved taking an array of 32 bit color values (RGBA), turning them into an array of Float4s, while also doing some color transform based on some input value.  During the RGBA to float4 conversion, one might cast the individual rgba values to floats and divide by 255, like so:

inline FloatColor ConvertToFloat( unsigned char r, unsigned char g, unsigned char b, unsigned char a )
{
    FloatColor result;
    result.r = (float) r / 255.0f;
    result.g = (float) g / 255.0f;
    result.b = (float) b / 255.0f;
    result.a = (float) a / 255.0f;
    return result;
}
The rest of the loop just does some random transform on the colors that is essentially meaningless. The key was to give the loop some other work to do so it was a more meaningful example, instead of just a straight FDIV to FMUL comparison.


for ( int i = 0; i < count; ++i)
{
 processed[i] = ConvertToFloat( original[i].r, original[i].g, original[i].b, original[i].a );
 processed[i].a *= 0.5f; //do some random color scaling...
 processed[i].r *= inputValue;
 processed[i].g *= 0.1;
 processed[i].b *= 0.3;
}
With small count sizes, ( 10 or so ), the multiply only version performed 20% faster than the divide version. Not too bad! However, if you are in the habit of doing nice batching and going over large amounts of data all at once ( >100 ), this 20% turns can turn to a huge 150%! Granted, the method is dominated by the conversion to float, but it is not an unreasonable approach to some simple color conversion. This was not a hand optimized piece of perfection, but rather an attempt to reproduce what a normal programmer may do to solve a rather straight forward problem.  With just a simple swap that takes no detailed thought, the cost can be reduced by 3 times. 

If you want code samples or whatever, just let me know.

Reference:
MSVC Floating Point Model
http://msdn.microsoft.com/en-us/library/aa289157%28v=vs.71%29.aspx
IEEE-754 Analysis
http://babbage.cs.qc.edu/IEEE-754/
Intel Optimization Guide ( see Appendix C for latency / throughput info )
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
What Every Computer Scientist Should Know About Floating-Point Arithmetic
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

No comments:

Post a Comment