Does using dot() on float4 save instructions?

AhSai · Jan 25, 2020

I have read that using dot() with 1 is a faster way to add up four numbers together with only 1 instruction.
Like this:

Code (CSharp):

//original

float result = a + b + c + d;

//using dot

float result = dot(float4(a, b, c, d), 1);

However, is it also faster for when you are adding up four float4 values as well?
Like this:

Code (CSharp):

//Original

//a b c d are float4

float4 x = a + b + c + d;

//Using dot()

x.x = dot(float4(a.x, b.x, c.x, d.x), 1);

x.y = dot(float4(a.y, b.y, c.y, d.y), 1);

x.z = dot(float4(a.z, b.z, c.z, d.z), 1);

x.w = dot(float4(a.w, b.w, c.w, d.w), 1);

The code looks very ugly and seems like a lot of set up of a float4, will this make the code run faster?
If so, how many instructions does it really save?

hippocoder · Jan 25, 2020

Bit of a dangerous assumption in this thread.

The number of instructions has little bearing on the performance of the shader. You should get out of that kind of thinking. Sure, you can reduce instructions but instructions do not have an equal cost and the majority of shader performance problems are to do with bandwidth.

Instructions don't all have the same cost!

Also, you'll find some hardware is penalised for the same instructions more than others!

Consider an 20 instruction shader and 30 instruction shader? The 30 could have a texture lookup but be considerably quicker to execute on mobile. This is a scenario I see time and time again.

Instead benchmark it, ideally in a proper game scenario so the GPU is pressured. You can try a transparent screen sized quad and overdraw repeatedly too to get a rough idea but it will be only testing it against itself and assumes the bandwidth will be spare.

Also select the shader, then in inspector compile and show the output to see what your shader actually looks like.

BattleAngelAlita · Jan 26, 2020

Maybe on something like geforce 3 dot might be faster, but not on modern GPUs. On modern hardware dot is just a intrinsic instruction, so dot(float4(a, b, c, d), 1) is basically just a * 1 + b * 1 + c * 1 + d * 1. Inside gpu this became MUL MAD MAD MAD. On other side a + b + c + d became ADD ADD ADD, or even only one SOP on PowerVR.

bgolus · Jan 26, 2020

There’s kind of two questions there. Is using a dot product the fastest way to add 4 numbers together, and is it 1 instruction.

The answer is “sort of” to both.

Is it the fastest? As mentioned above, on some hardware, yes! On some hardware, no. You’ll find a lot of articles online (including my own) talking about how using a dot product is fewer instructions than adding them together. The thing is on more modern hardware actually doing a dot product will be slower than adding them together for the reasons @BattleAngelAlita mentioned. But many modern GPUs will see that dot product in the shader with a hard coded 1 and know not to do an actual dot product if it is indeed slower and instead will just add the components together making it exactly the same cost either way. However not all platforms are that smart and some might end up doing it the “slow” way instead.

Now as @hippocoder mentioned, instructions are not all equal. The question you want to ask is “how long does this take on the GPU”. Is it one instruction? Yes, in a compiled shader it might be, but that instruction might take 4 times longer than another to actually run than another single instruction.

So is it the fastest? Yes, but most of the time just adding them together is also the fastest. Is it one instruction? Technically yes, but that doesn’t always mean what you think it does.

Search Unity

Does using dot() on float4 save instructions?

AhSai

hippocoder

Digital Ape

BattleAngelAlita

bgolus

Search Unity

Unity ID

Useful Searches

Does using dot() on float4 save instructions?

AhSai

hippocoder

Digital Ape

BattleAngelAlita

bgolus