Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Optimizing Unity Physics - Article

Discussion in 'Scripting' started by Todd-Wasson, Jan 13, 2015.

  1. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    I edited a post of mine and posted it to my web site that I think many around here might find useful. It concerns optimizing stuff for Unity. I got physics computational code to run about 1000% faster in my boat game this way. If you're doing anything with heavy math or seem to be pushing the boundaries of Unity's physics, you might find this useful.

    http://www.performancesimulations.c...ses-in-unitys-physics-or-any-math-heavy-code/
     
  2. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    Do you have test numbers to show the improvements this made?

    I'm pretty surprised that Mono isn't inlining calls to basic math operations.
     
    BenZed likes this.
  3. Kiwasi

    Kiwasi

    Joined:
    Dec 5, 2013
    Posts:
    16,860
    There was a Unite video that showed the same thing a while back. There performance increase was similar. Google is currently hiding the video from me, but if I track it down I'll post it.
     
  4. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    At the bottom of one of the source code plots are a couple numbers for one function that gained about 10 times speed. Maybe it would be a good idea for me to post more numbers, or rewrite the article with some speed tests and results?

    Inline functions: As far as I know this isn't something that you can do in C#. I can't imagine how it would be possible for Unity to go through and somehow convert all that on its own. I don't know where one would even begin to do something like that.
     
  5. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    Inlining functions isn't something that I'd expect a programmer to do. It's something that C/C++ compilers do which I assumed C# would also just do transparently behind the scenes. What I didn't consider prior to my last post, however, is that in Unity our scripts are compiled to a DLL and the API functions we call are already compiled elsewhere. You can't inline across that.

    For what it's worth, my understanding is that inlining is relatively straightforward to do at compile time under the right circumstances, because all it has to do is substitute a call for the actual contents of the function, exactly like you were doing in your examples.
     
  6. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Interesting, thanks. I didn't know that. So basically even if C# did inline functions, this wouldn't work in Unity because of the separate DLL's? Makes sense.

    Most of my years programming (VRC, VRC Pro) were with PowerBasic, similar to C but with Basic keywords. At the time there was no OOP at all. That language didn't have inline functions either, but it had function macros which from a programmer's perspective are pretty much the same thing. It's just a text replacement done by the compiler which eliminates the actual function call. I used it a little bit, but it was somewhat tricky for me. With math heavy code like VRC's physics engine which I wrote, I couldn't imagine trying to do some function macro every time I add a couple of vectors together. So I did pretty much what you see in the code in my link, hand writing every single dot product, matrix multiplication, and everything. In the later years I started using function macros at least for the matrix multiplications, but that was one of few exceptions.

    When I first started using Unity, I figured at least the overloaded stuff like adding two vectors together would internally do an SSE instruction or similar so the additions are done on all three components simultaneously. I've got a friend that also wrote a physics engine for his car racing simulator, and he hand coded ASM to do just that. Needless to say he got some huge performance improvements. That was all C++ where he used inline functions for things. I could be mistaken, but I think this was quite some time ago when I think you had to write them explicitly that way in order to be inline functions.

    Anyway, what's amusing to me is I'm finding more and more that my C# code in Unity is looking more and more like my old PowerBasic code. Go figure. :rolleyes:
     
  7. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    To me, the idea of actually writing out the same code dozens or hundreds of time is terrible. I wouldn't do it if I could avoid it at all. For starters, it both increases the likelihood of a typo-induced bug at the same time as making them far harder to find. Yuck.

    On the topic of inlining, I just did a little reading and it sounds like the C# compiler does indeed do it. It's completely transparent, because the compiler does it based on its own rules. (Which I'm told is more or less true in C++ as well these days - there's a related keyword, but it's used as a "hint" at best.) There's an attribute to prevent it, and in recent versions of .NET there's an attribute to hint that a method should be "inlined if possible".

    So... I do have another test to suggest. Make your own math library - so that it's compiled along with your own script code - and, rather than calling Unity's functions, call your own. Do a speed test that vs. manually inlined code vs. Unity's own calls. I'm not sure about Mono's inlining criteria, so it could be worth trying with a few different methods. It would also be worth testing per-platform, since Mono may JIT differently on different CPUs.
     
  8. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Interesting on the inlining. I was must have been misinformed. I don't recall my C# book mentioning it, but perhaps I just missed it.

    I can understand that from a design view. It's not pretty, but realistically, VRC wouldn't have been possible without doing that. Neither would the boat sim I'm doing now. I don't see trading away 10 or so times the performance in critical areas just for the sake of pretty looking code. That's just me.

    That kind of misses the whole point of the article though, doesn' t it? The idea is to avoid calling functions to do simple math operations, whether they're Unity's or Microsoft's or my own. My function for adding two vectors together would be the same as Unity's or anyone else's, probably. Did I miss something?
     
  9. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Just spent a couple minutes looking for C# inlining. I'm not finding anything. Do you have any links? From here:

    http://stackoverflow.com/questions/473782/inline-functions-in-c

    This is from 2009 though. Has something in C# changed in this area since then? What I'm talking about doing is something like the first example which he says has no equivalent in C#.
     
  10. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    There's no equivalent keyword. The compiler does or does not do this for us behind the scenes at its whim. The first attribute on this page is about aggressive inlining. This page has an attribute that ensures inlining of a method does not happen. It's not mentioned much in the docs because, like I said, it's not a language feature so much as it's a compiler feature. We're not meant to have to think about it.

    The idea is that you write your method just like you would any other method. At compile time, if they are compiled at the same time, the compiler may choose to inline the called method into the calling method based on some set of criteria. People are suggesting that the criteria includes the called method fitting into 32 bytes of IL, but that's all I found in my short search, and .NET and Mono may be different in any case, especially across platforms. Hence suggesting a test.

    It doesn't miss my point. ;)

    If we're only missing out on inlining because of the interop between our game DLL and the engine, then having a math library compiled along with our game DLL potentially gets the best of both worlds - we'll get some or all of the performance boost from inlining at a much lower productivity overhead (importing and remembering to use a 3rd party math library, rather than hand-writing every math function every time we use it).
     
    Kiwasi likes this.
  11. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    I must be missing something then, perhaps my knowledge isn't quite there. Just to clarify, what I'm talking about is this:


    Code (csharp):
    1.  
    2. Vector3 firstTerm=whatever;
    3. Vector3 secondTerm=whatever;
    4. Vector3 answer;
    5. answer=firstTerm+secondTerm;
    6.  
    ...not being nearly as fast as this:

    Code (csharp):
    1.  
    2. Vector3 firstTerm=whatever;
    3. Vector3 secondTerm=whatever;
    4. Vector3 answer;
    5. answer.x=firstTerm.x+secondTerm.x;
    6. answer.y=firstTerm.y+secondTerm.y;
    7. answer.z=firstTerm.z+secondTerm.z;
    8.  
    I'd have to double check to be sure, but I seem to remember the second version being several times faster than the first. So what I'm talking about doing is inlining an add function or overload in place of the Vector3 one in Unity. Can you demonstrate how to do that using the links you've provided?

    I don't understand how my replacing the Vector3 class with my own version of that would make any difference. An addition overload/function is still being called, no? The idea of inlining it is to remove the function call completely which makes it the same as the second code block. You're saying this can be inlined, but I don't understand how to go about that. Are you saying that doing it in a separate library would change that, that the JIT compiler (or whatever it is, I just read that for the first time 2 seconds ago) would then automatically start inlining it and that speed difference might disappear?
     
  12. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    The first is still making a method call. As Vector3 is not a primitive data type, the "+" operator is implemented as a method, so using it still incurs a method call - essentially equivalent to "Vector3 Vector3.Add(Vector3 lhs, Vector3 rhs)". On the other hand, float is a primitive data type, and in the second example you're just adding three floats - no method call.

    No. I can't demonstrate how to do it because it's not something programmers do.

    Unlike certain older languages there is no way to manually tell C# to inline a method. As I said before, it's something done by the compiler, if and when it chooses to do so.

    I don't know. That's what I'm curious about. Sure the code for an additional function will be there. The question is whether Unity's Mono will inline it for us.

    I don't think I can explain it more than I already have without getting into a lengthy explanation of how compilers work, which I'd rather not do since I'm not an expert on that.

    All I'm saying is that, since it's a compiler thing, and compilers behave differently in different circumstances, it'd be worth seeing what the Mono compiler in Unity does in those three different cases and whether or not it has performance implications. I'm not saying that it will, I'm just curious to know if it does. I only suggested that you do the testing because you've already been doing a bunch, so it seemed to be you'd be interested. I'm happy enough to do it myself later on, since this is arising from my own curiosity.
     
  13. Kiwasi

    Kiwasi

    Joined:
    Dec 5, 2013
    Posts:
    16,860
    There is an intermediate way to test this quickly, without going to the effort of finding or building a different math library. Simply tack the addition method on to the bottom of one of your classes that you previously benchmarked, and see if there is a performance difference there.
     
  14. fire7side

    fire7side

    Joined:
    Oct 15, 2012
    Posts:
    1,819
    I would have guessed it would be opposite because the Vector3 comps were probably using c++, also, like someone said, this kind of thing is supposed to be done at a compiler level. That's the whole idea. It's like making us into compilers.
     
  15. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Thanks for the clarification. I think I understand what you're saying now.

    That's an interesting idea. So what you're suggesting is I could test this by making my own Vector3Todd class, put a dot product or add method in it and compare speed with Vector3 to see if Mono inlines the Vector3Todd calls on its own? Sounds easy enough. I've done a little bit more reading since then about the automatic inlining stuff. This was something I knew nothing about until yesterday. Thanks for the education.

    I'll be really surprised if there's any difference (why would it inline my class function calls but not any others?), but it's worth a try. I don't think I've ever seen a case where doing a method call to do anything was as fast as skipping the call, but there's only one way to know for sure. That's what inlining is supposed to do, after all.

    If it does indeed work, I'm wondering if it might have to unroll the loops in order to work. Like maybe it might work for a couple lines, but putting it in a loop that's 10,000 iterations might make it stop because it'd have to unroll it all? I don't know very much about this, maybe I have totally the wrong idea there.

    Thanks everyone the suggestions. I'll try it and report back what I find. It's a bummer that we can't do inline assembly in C# (or can we?) so we could take advantage of SSE instructions and so forth to do vector computations in parallel. I suppose a dll could be made to do that with functions, although I have no idea if overrides could be done that way so you could keep the same syntax. I'd prefer to use + and - over add() and subtract().

    I wouldn't think so. This is C# code, I'd think doing what you're talking about would mean the compiler would be translating parts of it into C++ and leaving the rest in C#. I can't imagine that's the case.
     
  16. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Ok, I just tried the tests and the results were what I was expecting. It didn't inline my version of the class method calls.

    There are five tests here with addition and dot products including results from the profiler. This is all done in FixedUpdate with enough loops to fill up about 80% of the CPU time in an empty scene, but not enough to take longer than the time between FixedUpdate() calls.

    I wanted to try operator overloading with an addition op in my Vector3Alternate class, but couldn't figure out how to do it and didn't want to spend all night on this. Given these results I wouldn't expect it to be much different anyway, unrolling the computations and skipping the method calls was still almost 6 times faster (600%) for Dot product and 20 times faster (2000%) for additions on Vector3s.

    TestVector3Dot() - This calls the Vector3.Dot method.

    TestVector3AlternateDot() - This is my version of a simple Vector3 class that has its own Dot method. This ran slightly faster (3% or so) than the Vector3 version. My guess would be because my x,y,z are public floats rather than properties accessed through getters/setters. If it were being inlined it should be closer to 600% faster, I'd think.

    TestUnrolledDot() - Here we skip the method call on the Vector3.Dot, unroll it manually, and get a ton more speed. In this test it was in the neighborhood of 500-600% faster.

    TestVector3Addition() - Addition operator overload on Vector3 class (adding two vectors without breaking them into components).

    TestAdditionUnrolled() - Adding two Vector3s by unrolling it and skipping the operator overload (breaking them into components). This was a whopping 2000% faster than using the overload.

    Results from one sample in the profiler, eyeballed to be about an average value:
    TestVector3Dot() - 2.66ms
    TestVectorAlternateDot() - 2.59ms
    TestDotUnrolled() - 0.46 ms

    TestVector3Addition() - 5.28 ms
    TestAdditionUnrolled() - 0.27 ms (WOW)

    Code (csharp):
    1.  
    2. using UnityEngine;
    3. using System.Collections;
    4.  
    5. public class Vector3Alternate : MonoBehaviour
    6. {
    7.     public float x;
    8.     public float y;
    9.     public float z;
    10.  
    11.     public static float Dot(Vector3Alternate arg1, Vector3Alternate arg2)
    12.     {
    13.         return arg1.x * arg2.x + arg1.y * arg2.y + arg1.z * arg2.z;
    14.     }
    15. }
    16.  
    17. public class TestMath : MonoBehaviour
    18. {
    19.  
    20.     private Vector3[] vect = new Vector3[3];
    21.     private Vector3Alternate[] vectAlternate = new Vector3Alternate[3];
    22.     const int numberOfLoops = 35000;
    23.  
    24.     // Use this for initialization
    25.     void Start()
    26.     {
    27.         vect[0].x = 1;
    28.         vect[0].y = 1;
    29.         vect[0].z = 1;
    30.  
    31.         vect[1].x = 2;
    32.         vect[1].y = 2;
    33.         vect[1].z = 2;
    34.  
    35.         vect[2].x = 3;
    36.         vect[2].y = 3;
    37.         vect[2].z = 3;
    38.  
    39.         vectAlternate[0] = new Vector3Alternate(); //Why do I have to do this here but not in a Vector3?  Is something missing in Vector3Alternate class?
    40.         vectAlternate[0].x = 1;
    41.         vectAlternate[0].y = 1;
    42.         vectAlternate[0].z = 1;
    43.  
    44.         vectAlternate[1] = new Vector3Alternate(); //Why do I have to do this here but not in a Vector3?  Is something missing in Vector3Alternate class?
    45.         vectAlternate[1].x = 2;
    46.         vectAlternate[1].y = 2;
    47.         vectAlternate[1].z = 2;
    48.  
    49.         vectAlternate[2] = new Vector3Alternate(); //Why do I have to do this here but not in a Vector3?  Is something missing in Vector3Alternate class?
    50.         vectAlternate[2].x = 3;
    51.         vectAlternate[2].y = 3;
    52.         vectAlternate[2].z = 3;
    53.     }
    54.  
    55.     void TestVector3Dot()
    56.     {
    57.         float dot;
    58.  
    59.         for (int i = 0; i < numberOfLoops; i++)
    60.         {
    61.             dot = Vector3.Dot(vect[0], vect[1]);
    62.         }
    63.     }
    64.  
    65.     void TestVector3AlternateDot()
    66.     {
    67.         float dot;
    68.  
    69.         for (int i = 0; i < numberOfLoops; i++)
    70.         {
    71.             dot = Vector3Alternate.Dot(vectAlternate[0], vectAlternate[1]);
    72.         }
    73.     }
    74.  
    75.     void TestUnrolledDot()
    76.     {
    77.         float dot;
    78.  
    79.         for (int i = 0; i < numberOfLoops; i++)
    80.         {
    81.             dot = vect[0].x * vect[1].x +
    82.                   vect[0].y * vect[1].y +
    83.                   vect[0].z * vect[1].z;
    84.         }
    85.     }
    86.  
    87.     void TestVector3Addition()
    88.     {
    89.         for (int i = 0; i < numberOfLoops; i++)
    90.         {
    91.             vect[2] = vect[0] + vect[1];
    92.         }
    93.     }
    94.  
    95.     void TestAdditionUnrolled()
    96.     {
    97.         for (int i = 0; i < numberOfLoops; i++)
    98.         {
    99.             vect[2].x = vect[0].x + vect[1].x;
    100.             vect[2].y = vect[0].y + vect[1].y;
    101.             vect[2].z = vect[0].z + vect[1].z;
    102.         }
    103.     }
    104.  
    105.     // Update is called once per frame
    106.     void FixedUpdate()
    107.     {
    108.         TestVector3Dot();
    109.         TestVector3AlternateDot();
    110.         TestUnrolledDot();
    111.  
    112.         TestVector3Addition();
    113.         TestAdditionUnrolled();
    114.     }
    115. }
    116.  
    So yeah, call unrolling everything terrible practice if you want, but to me it's worth doing to get a nearly 2000% speed improvement in addition operations and over 500% in dot products. These are really massive gains, too big for me to ignore or trade away for the sake of prettier looking code. Like I wrote in my article, the boat simulator I'm doing simply wouldn't be possible on the CPU at those resolutions without going through the trouble to do this kind of thing. I don't think VRC Pro would have worked either.

    It's unfortunate, but it would appear there is no inlining going on with any of this.
     
    Last edited: Jan 13, 2015
  17. Kiwasi

    Kiwasi

    Joined:
    Dec 5, 2013
    Posts:
    16,860
    Unrolling everything is bad practice.

    Bad practice in hot code is good game development.
     
  18. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    It's not necessarily bad practice. I'd definitely call it terrible code, but that's based on what I think it'd be like to work with, not what I think about the person who wrote it. I've written some shocking code in my time, but being a pragmatic coder sometimes means writing icky code to get the job done in less-than-ideal circumstances. A compiler not inlining for you where you need the performance would definitely fall in that bucket.

    I'm pretty disappointed at the lack of inlining. I'd answer some of your questions above, Todd, but am on a tablet right now and it's not my fastest/favourite typing experience. Might come back later if I remember.
     
  19. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    It's not a unity physics article, it's a function overhead article.
     
  20. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Haha, ok, I can live with that. :)

    Yes, that's true of course. I'll probably keep the article that way because I had Unity developers in mind when writing it. It should apply elsewhere too of course.

    One more test, this time a cross product with a dot. What I see a lot of is people packing lots of cross and dot products into one line like this:

    Code (csharp):
    1.  
    2. void TestVector3CrossAndDot()
    3.     {
    4.         float dot;
    5.  
    6.         for (int i = 0; i < numberOfLoops; i++)
    7.         {
    8.             dot = Vector3.Dot(Vector3.Cross(vect[0], vect[1]), vect[2]);
    9.         }
    10.  
    11.     }
    12.  
    13.  
    14.     void TestVector3CrossAndDotUnrolled()
    15.     {
    16.         float dot;
    17.         Vector3 cross;
    18.  
    19.         for (int i = 0; i < numberOfLoops; i++)
    20.         {
    21.             cross.x = vect[0].y * vect[1].z - vect[0].z * vect[1].y;
    22.             cross.y = vect[0].z * vect[1].x - vect[0].x * vect[1].z;
    23.             cross.z = vect[0].x * vect[1].y - vect[0].y * vect[1].x;
    24.  
    25.             dot = cross.x * vect[2].x +
    26.                   cross.y * vect[2].y +
    27.                   cross.z * vect[2].z;
    28.         }
    29.  
    30.     }
    31.  
    The unrolled version ran somewhere around 1100% to 1240% faster, so it appears likely that this just gets worse as you pack more stuff into a single line. Of course to be really thorough I should just do a Cross by itself, but I think the general idea is good anyway. There's no inlining happening anywhere.

    What I see a lot of in the forums here are people advising folks to revisit their basic algorithms when they're getting into their big bottlenecks. This is good of course, but what I don't see anyone doing is telling them to unroll their math and quit packing a million things into a giant line filled with cross and dot products and "new Vector3" statements with math overloads and so on. As though one line of code runs faster than two or ten. There is often more to be gained in simply unrolling things to be more CPU friendly than there is in rethinking the algorithms. Of course it's best to do both, but this whole area seems missing from the conversations here which I've always found rather mystifying.
     
  21. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    29,723
    To be fair, your situation is most certainly a niche case. It's all good findings but the numbers would have to be more than a hundred calls or so, which would be the upper end maximum per frame for typical games. In your situation, you are probably dealing with thousands, which is far from normal, and the practises you outline (manually unrolling) aren't at all surprising.

    But I really don't recommend people do this unless they know they're dealing with a larger amount of calls.
     
  22. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    I guess this one comes down more to opinion and priorities than anything else. In VRC Pro every car has 76 separately animated parts that matched what the physics engine was doing exactly, something I've never seen anyone else do. They usually fake their suspension animations which aren't particularly linked to what the physics engine is actually doing. This is true even in iRacing, arguably the grand-daddy of racing sims these days.

    My personal philosophy is that I'm trying to create something, and I'll do whatever I can to create it. What people think is good or bad code to me is not so much an issue. If some guy's "good code" means he can't do an animated suspension system at all, or it runs at 1/10th the speed of my "bad code," I'd suggest their own ideas about good and bad need rethinking.

    I don't want to get into a big philosophical discussion about programming. People can call it what they want, if method A is 10 or 20 times faster than method B, I'd call method A "good" and method B "bad," especially if method B is not even capable of doing the job at hand. Method B looks nice and is easier to work with, well, who cares if it can't do what it's supposed to do anyway? What's good about that?

    Anyway, we're probably better of in this thread discussing specific optimizations than having some abstract conversation about good and bad programming practices.
     
    Last edited: Jan 14, 2015
  23. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Oh, for sure. Like I said in the article, if you're just doing Angry Birds or something like that, then no, don't bother. It's only worth looking at if it's becoming a bottleneck and you're looking to eek out more performance. We're in agreement on that one.
     
  24. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    Yes, if code doesn't do what it's meant to do then it's not "good". As I said, when I mentioned "terrible" I was referring to what it'd be like to work with, not the quality of the output. As a pragmatic coder, getting the job done is first and foremost.

    Next is robustness and maintainability, though. If I can increase those things without decreasing my ability to get the job done then I will, because that increases my ability to get future jobs done better, faster, and/or to higher standards.

    Also, it's worth noting that even you said yourself in the article that an increase in code speed does not necessarily correlate to an increase in application speed. That's well worth remembering. I'm not going to hand-code a function in dozens of separate places to increase the execution time of a piece of code if it doesn't significantly increase the speed of my app. Nobody cares how fast the code is, all they care about is that the user experience is smooth and looks/sounds good.
     
  25. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    I don't understand this comment. What's getting worse? The compiler doesn't care about white space or how you break things up over lines.
     
  26. kdubnz

    kdubnz

    Joined:
    Apr 19, 2014
    Posts:
    177
    Just a note regarding your loop counting.

    I'm wondering if they should be constructed like this :
    Code (CSharp):
    1.  for (var i = 0; i < numberOfLoops; i++)
    2. {
    3.      TestVector3CrossAndDot();
    4. }
    5. for (var i = 0; i < numberOfLoops; i++)
    6. {
    7.      TestVector3CrossAndDotUnrolled();
    8. }
    rather than with the loop in the method being tested.

    Regards,
    Kerry
     
  27. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    That would add a method call overhead to every iteration. The point of the tests is to see the benefit of removing that overhead.
     
  28. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Yep, I agree with all that. There's no point of course in complicating code or spending any time trying to get something that represents 0.001% of the CPU load to run twice as fast. It's just not worth the trouble them, and in that case maintainability and so forth would take precedence. In the boat sim I didn't start thinking about and experimenting with this stuff until my end of the physics code had finally hit about 80%. I got it down to about 8% including adding quite a bit of new stuff. I don't bother for stuff that gets calculated one time, only for stuff that will make a noticeable difference. Production time matters too, of course.
     
    angrypenguin likes this.
  29. kdubnz

    kdubnz

    Joined:
    Apr 19, 2014
    Posts:
    177
    Yes, and in practice that is what happens, the method gets called (once) ... and usually with parameters and a return value.
     
  30. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    Measuring how long it takes something small to execute once can't be done accurately. So you measure many iterations (thousands), with as little additional overhead per iteration as possible.

    But also note that this particular case is not about stuff that only gets called once.
     
  31. fire7side

    fire7side

    Joined:
    Oct 15, 2012
    Posts:
    1,819
    Thanks for the demonstration. I'll keep it locked away in case I do need to optimize something. The unrolled method didn't look too bad, anyway.
     
  32. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    I meant packing together multiple functions into one statement via nested function calls with "new" statements and computations included, etc., not that it's literally on one line. Poor choice of word on my part there.
     
  33. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    For instance, in my last example:

    Code (csharp):
    1.  
    2. dot =Vector3.Dot(Vector3.Cross(vect[0], vect[1]), vect[2]);
    3.  
    Here's a simple case of what I meant by cramming stuff into one line (I should have said "statement"). Here's a Cross() inside of a Dot().

    Maybe the compiler is smart enough to figure that out and make it as fast as if it was two functions, I don't know, would have to test the Cross by itself to know, but I can't help but think at some point as people nest them further and further that they're bound to start slowing things down. I haven't tested this kind of thing other than that last bit tonight, so can't really say for sure. What I see a lot of in the forums are things looking something like the following, where many people seem to try to pack as much functionality into one statement as they can:

    Code (csharp):
    1.  
    2. Vector3 finalVector = new Vector(transform.right * 0.3f, transform.up * 0.6f, transform.forward * 1.2f) * Vector3.Dot(Vector3.Cross(vect[0]*a, vect[1]*b), vect[2]*c);
    3.  
    That type of thing just can't be good even without the transform properties and "new" in there. I prototype stuff like this all the time, but eventually unroll it if it's speed critical stuff.

    Another thing I found a few months ago was that Mathf.abs() is very slow for some reason. I'd have to double check what I did exactly, but I seem to remember this:

    Code (csharp):
    1.  
    2. float absValue = Mathf.sqrt(value * value);
    3.  
    ...being several times faster than Mathf.abs(). Strange, but that's what the profiler said. A friend pointed me to an unsafe alternative that was almost as fast as the sqrt() trick:

    Code (csharp):
    1.  
    2.     unsafe float myAbs(float x)
    3.     {
    4.         // copy and re-interpret as 32 bit integer
    5.         int casted = *(int*)&x;
    6.         // clear highest bit
    7.         casted &= 0x7FFFFFFF;
    8.  
    9.         // re-interpret as float
    10.         return *(float*)&casted;
    11.     }
    12.  
    I'd have to double check to be sure, but I seem to remember this being about the same speed as the sqrt() trick, only around 10% difference (I forget which was faster now). And people get all stuffy about sqrt() because they think it's so slow. Heck no, it's faster than taking the absolute value of a number, at least with the Mathf class.
     
  34. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    I don't understand what you think the compiler is doing there. That literally is just two functions, and no special smarts are required to recognise that. There's nothing special about a call being a parameter for another call. When the inner call returns it's returned value is passed to the outer call just as if it were any other parameter. I don't believe compilers do anything special to have nested calls, and I don't even think that anything changes for either call to accommodate.

    If I may, I recommend doing some reading on the various stages of how compilers work. It's not magic, and it'll clear up a lot of stuff for you which you seem to be on the cusp of understanding (certainly you're asking questions and looking to understand stuff many people just overlook).

    As for the sqrt thing, I hazard a guess that it's about the cost of branching rather than the calculation itself. In tight loops CPUs are generally way faster when there's no branching. If you think about a computationally cheap way to implement Abs(x), you do something like "if input is less than zero multiply by negative one, return input". That has a branch in it, and if the CPU doesn't correctly predict which branch to take it'll stall. On the other hand, even though the calculation is more expensive, there is no branch in "return the root of input * input", so it will never stall. Check out "Branch Predictor" on Wikipedia.

    And with regard to sqrt being slow, what's really slow is memory access. A couple of decades ago it made sense to store the values of stuff like sqrts in lookup tables or to cache calculated values. CPU speed increases have far exceeded memory access speed increases, though, so now things are different now - it's often faster to re-calculate than to fetch something not in the CPU cache. I remember a PS3 optimisation expert telling me that something like 5 sqrts could be done in the time it took for one non-cached memory fetch.
     
    Last edited: Jan 15, 2015
  35. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    The dot/cross was just the simplest example I could come up with to illustrate the point of nesting functions. What you were supposed to do is imagine extending that to something much more complicated, not get all pedantic on me. ;)

    What I was attempting to communicate was the general idea of nesting functions too deeply in a single statement potentially being slower sometimes with some compilers, not so much the specific example of two functions nested, one being a dot and the other being a cross. What I had in mind was the idea of one line (sorry, "statement," I'm 40, the lingo has changed a bit over the decades) being faster than many. I went on to make the example of a much bigger nested function to remake the point. Granted, I didn't post any test results, so here we go:

    Code (csharp):
    1.  
    2. void TestBigUglyStatement()
    3.     {
    4.         float a = 0.1f;
    5.         float b = 0.2f;
    6.         float c = 0.3f;
    7.         float d = 0.4f;
    8.         float e = 0.5f;
    9.         float f = 0.6f;
    10.  
    11.         for (int i = 0; i < numberOfLoops; i++)
    12.         {
    13.             Vector3 finalVector = new Vector3(a*d, b*e, c*f) * Vector3.Dot(Vector3.Cross(vect[0] * d, vect[1] * e), vect[2] * f);
    14.         }
    15.  
    16.     }
    17.  
    18.     void TestBigUglyStatementBrokenDown()
    19.     {
    20.         Vector3 finalVector;
    21.         Vector3 vectorTerm1;
    22.         Vector3 cross;
    23.         float a = 0.1f;
    24.         float b = 0.2f;
    25.         float c = 0.3f;
    26.         float d = 0.4f;
    27.         float e = 0.5f;
    28.         float f = 0.6f;
    29.         float dot;
    30.  
    31.         for (int i = 0; i < numberOfLoops; i++)
    32.         {
    33.             vectorTerm1.x = a*d;
    34.             vectorTerm1.y = b*e;
    35.             vectorTerm1.z = c*f;
    36.             cross = Vector3.Cross(vect[0] * d, vect[1] * e);
    37.             dot = Vector3.Dot(cross,vect[2] * f);
    38.             finalVector = vectorTerm1 * dot;
    39.         }
    40.     }
    41.  
    42.     void TestBigUglyStatementUnrolled()
    43.     {
    44.         Vector3 vectorTerm1;
    45.         Vector3 cross;
    46.         float a = 0.1f;
    47.         float b = 0.2f;
    48.         float c = 0.3f;
    49.         float d = 0.4f;
    50.         float e = 0.5f;
    51.         float f = 0.6f;
    52.         float dot;
    53.  
    54.         for (int i = 0; i < numberOfLoops; i++)
    55.         {
    56.             vectorTerm1.x = a * d;
    57.             vectorTerm1.y = b * e;
    58.             vectorTerm1.z = c * f;
    59.  
    60.             Vector3 v0;
    61.             v0.x = vect[0].x * d;
    62.             v0.y = vect[0].y * d;
    63.             v0.z = vect[0].z * d;
    64.  
    65.             Vector3 v1;
    66.             v1.x = vect[1].x * e;
    67.             v1.y = vect[1].y * e;
    68.             v1.z = vect[1].z * e;
    69.  
    70.             Vector3 v2;
    71.             v2.x = vect[2].x * f;
    72.             v2.y = vect[2].y * f;
    73.             v2.z = vect[2].z * f;
    74.  
    75.             float c0x = vect[0].x * d;
    76.             float c0y = vect[0].y * d;
    77.             float c0z = vect[0].z * d;
    78.  
    79.             float c1x = vect[1].x * e;
    80.             float c1y = vect[1].y * e;
    81.             float c1z = vect[1].z * e;
    82.  
    83.             cross.x = c0y * c1z - c0z * c1y;
    84.             cross.y = c0z * c1x - c0x * c1z;
    85.             cross.z = c0x * c1y - c0y * c1x;
    86.  
    87.             dot = cross.x * v2.x +
    88.                   cross.y * v2.y +
    89.                   cross.z * v2.z;
    90.  
    91.             dot *= f;
    92.         }
    93.     }
    94.  
    Three tests here. The original function was 3.71ms, the "BrokenDown" version was 3.44ms where it's not in one line anymore. Granted, that speedup is probably because I removed the "new" every cycle and skipped the vector overload op, so it probably doesn't really say anything about the compiler other than it doesn't inline the cross and dot calls. The compiler does a good job with these two nested functions.

    The big hairy one at the end is just a super long version of the original line if I didn't make any mistakes (entirely likely). That one took 0.21ms, some 1700% faster than the original one, but that's just illustrating the function overhead from before and doesn't say anything about the compiler stuff you're talking about. I.e., it's not because the nested function calls were separated, it's because they were removed.

    Anyway, I was just hoping this would evolve into a friendly discussion about speed tips and these kinds of optimizations. I came from a forum where such discussions were quite commonplace. Here in the Unity forums that doesn't appear to be the case so much. I started programming at a time when this kind of deep nesting of function calls into a single line (statement) was to be avoided like the plague. Today it's not as important, I know that, but I still keep an eye on it and at least test it if it's speed critical just to be sure. I don't blindly trust any compiler to always do things the fastest possible way where speed is important. Let's remember that pretty much everybody here assumed their vector3 ops were all getting inlined... So I figure why not check to be sure? :)
     
    Kiwasi likes this.
  36. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    So does that suggest that maybe the Vector3 class has an "if" statment in it? Ok, that makes sense.

    I must admit I have read very little about compilers. Most of what I've learned about compilers has been from veteran game programmers, one of whom has been at this since the '80's and writes assets for Unity now. What I got from him is basically that the compilers are always changing, they're all different from each other, so you can't ever be sure what it's doing unless you test it. So that's what I do when speed matters: Experiment and see what's fastest without making it so stupidly complicated I can't keep track of it. I do reel it in just a tad now and then. Most of the time I can beat the compiler on the speed end. In Unity it's easy of course because of the method calls everywhere.

    The physics engine I wrote for VRC Pro wasn't even done in an objected oriented language, if you can believe that. It was a bit more than 100,000 lines of procedural code, no classes at all so all functions were global. The benefit was that all arrays were accessible from everywhere in the code which made it easy (for me) to add stuff. You had to remember everything that every thing did. Not too hard when you write it yourself, and it was a dll so nobody else ever saw or worked on the code.

    Granted, if I handed it to someone else there is just no way they could do anything with it. I imagine if you saw it you'd barf, it's just filled with computations all broken down like I've got here. So it's nasty to look at, but it's pretty darned fast considering the amount of work it does. That language had a timer in it that could read back individual CPU cycles which made optimizing things fun.



    I've only been programming in an OOP language for a bit over a year now. It's interesting and I enjoy it, but I am still learning the ropes. I don't see going back to my old language.
     
    Last edited: Jan 15, 2015
  37. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    By the way, if you have any good links on this I would appreciate it. You're correct, there's a lot there that I don't know. :)
     
  38. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    Yeah, where performance is important that's always the best approach. Test specific approaches in your specific target environment and choose one based on the results.

    I wasn't trying to get pedantic. I didn't (and still don't) understand what you're getting at, so I'm trying to clarify. And in any case, when talking about micro-optimization like this details are critical.

    On that note, I'm not sure that your code examples for nested vs. not nested are equivalent. (Edit: Or perhaps they are. I can't see what made me think that anymore...) I did the following...

    Code (csharp):
    1.  
    2. System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
    3.  
    4.         Vector3 vec1 = new Vector3(Random.Range(0.0f, 10.0f), Random.Range(0.0f, 10.0f), Random.Range(0.0f, 10.0f));
    5.         Vector3 vec2 = new Vector3(Random.Range(0.0f, 10.0f), Random.Range(0.0f, 10.0f), Random.Range(0.0f, 10.0f));
    6.         Vector3 intermediate;
    7.         float result;
    8.         Debug.Log ("Vector 1: " + vec1 + " Vector 2: " + vec2);
    9.  
    10.  
    11.         sw.Reset ();
    12.         sw.Start();
    13.         for (int i = 0; i < 100000000; i++) {
    14.             intermediate = Vector3.Cross(vec1, vec2);
    15.             result = Vector3.Dot(intermediate, vec2);
    16.         }
    17.         sw.Stop();
    18.         Debug.Log ("Split calls: " + sw.ElapsedMilliseconds);
    19.  
    20.         sw.Reset();
    21.         sw.Start();
    22.         for (int i = 0; i < 100000000; i++) {
    23.             result = Vector3.Dot(Vector3.Cross(vec1, vec2), vec2);
    24.         }
    25.         sw.Stop();
    26.         Debug.Log ("Nested calls: " + sw.ElapsedMilliseconds);
    ...where you'll notice that the only difference in the code is that in the "split" versions I'm doing the Cross into an intermediate variable which is then passed to the second function, where in the nested I'm putting the method call directly inside the Dot call.

    Unlike the pasted code I'm actually running the loops three times each, with consistent results approximately as follows:
    Code (csharp):
    1. Split calls: 2456
    2. Nested calls: 2229
    I also tried with the intermediate variable declared locally in the loop. This may have provided a small speed increase, but the difference in numbers is too small for me to be sure (eg: "Split calls: 2443").

    In any case, the difference is small, and the nested ones come out in front very slightly, probably because of the lack of an additional field.
     
    Last edited: Jan 15, 2015
  39. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    Alas all of my knowledge on the matter came from university, so I couldn't recommend any links you wouldn't be able to find for yourself.
     
  40. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    You're just back to doing a Dot and Cross... Forget it, I give up for today. ;)
     
  41. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    Who cares what they're doing? Your assertion is that nested calls may be slower than calls that are split out. Why make the test any more complicated and/or obfuscated than it has to be?

    Edit: On that note, my tests could probably be improved by using methods that are more trivial, to minimise the time spent doing dot/cross product calculations.
     
    Last edited: Jan 15, 2015
    Kiwasi likes this.
  42. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    You're only going one level deep, the minimum nesting that can be done. The compiler can handle that just fine. For some reason you latched onto that dot/cross example that's only one level deep, which I only posted to illustrate nesting, and I can't seem to get you off that.

    Try going a lot deeper with it, throw in some "new" statements and math like people are doing everywhere in their scripts, then see if you can make it faster by making it more than one statement.
     
  43. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    I'm not sure if this is really the same thing, but worth a look anyway:

    http://www.dotnetperls.com/method-call-depth

    Again, the compiler appears to sort this out fine for our simplest possible dot/cross example. There's no point to be made by that. What I'm suggesting is going a lot further like many people do may not be as good on the performance side.
     
  44. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    That's not the same thing.

    In this thread we've been talking about this:
    Code (csharp):
    1. result = FunctionA(FunctionB());
    The article you've linked to is talking about this:
    Code (csharp):
    1. returnType FunctionA() {
    2.     FunctionB();
    3. }
    4.  
    5. returnType FunctionB() {
    6.     // Some stuff...
    7. }
    8.  
    The latter, from the linked article, means that one function is called from within another function. This of course increases the call stack depth, because that's how the stack pointer works - the calling function can't return until the called function returns.

    The former, what we've been talking about here, simply has the functions called sequentially in innermost to outermost order. This doesn't increase the call stack depth because the functions are executed one at a time - each is fully returned before the next begins.

    Note that in attempts to not be overly pedantic I called the outer method call the "calling function" earlier, which isn't strictly correct.

    Going deeper is fine. Throwing in some "new" etc. is not, because that's overhead in areas other than what's being tested. You can easily enough write an expanded test that goes several levels deep, though note that as I said above you shouldn't get the call stack confused with calls-as-parameters.
     
  45. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Ok, I see the difference. Thanks. So you're saying it doesn't matter at all if you do FunctionA(FunctionB(FunctionC))) for 50 functions? I'm sure at some point you'd hit some kind of limit, that's not what I mean. I mean below that limit, say 10 or 20 functions maybe, performance-wise it's always the same?

    Unfortunately my test had some unrolling even in the simple case, so it didn't really prove anything. Maybe next time I'll try a deeper one. I could have sworn there was a time in history where this mattered quite a bit and depended somewhat on the function call details and the compiler.
     
  46. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    Well, there are different "calling conventions", so between that and the possibility of different compilers not behaving exactly the same way there probably are limits in some circumstances. Certainly in C# I've never had to worry about it, though.

    Well, even my trivial test showed that it's not the same. It's quite possible that the characteristics change under different circumstances or usage styles. If it's critical then it really comes back to testing on a per-case, as-needed basis for the platform at hand.
     
  47. Todd-Wasson

    Todd-Wasson

    Joined:
    Aug 7, 2014
    Posts:
    1,077
    Ok, so I may not be wrong or completely imagining it? Perhaps it's just not as big a deal as it used to be. I am fairly new to C#, just over a year of playing with it now. It may be a somewhat different animal than I'm used to. I appreciate the info.