Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Could Batched Vector Ops Speed Up Unity?

Discussion in 'General Discussion' started by Arowx, Jun 28, 2014.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    OK You've written you're game the bullets are flying the enemies are moving but then you need to speed things up a bit, you pull up the profiler and start optimising the problems.

    But what about the meta view, each frame you are updating or manipulating a load of Vectors or Quaternions. In some cases you might even have a Enemy manager that loops through all of them and updates them.

    As I found out with a simple benchmark, it takes time for each of these updates (It was taking Unity 12ms to get and set 2000 updates to the transform.position on PC using Deep Profile) to occur and part of that time is the overhead of calling the functions and passing the values, doing the calculation and returning the result.

    So could Unity benefit from Batched based getters, setters and operations e.g. TransformArray.positions = positionArray;

    Or Vector3Array+= scalarArrayOrVectorArray;

    Then Unity could under the hood take advantage of platform specific speed ups, e.g. SIMD instructions (on the CPU not the mono SIMD library), optimisations, multi-threading or GPU processing depending on platform.

    Would you use Batch based features for performance optimisation if Unity provided them?
     
    Last edited: Jun 29, 2014
  2. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Update for example I was able to recreate the simple benchmarks 2000 sprites as a particle system with world collision and on my PC the FPS went from 270 to 700 fps.

    More info in this post http://forum.unity3d.com/threads/speed-comparison-2d-unity-vs-monkey.248926/#post-1681930

    It is only a simple example but before you run off and re-write your game using Shuriken (Unity's particle system) the trick to the speedup was removing the Update() method and it's getter and setter overhead, passing all the work over to the Unity Game engines optimised core!

    So what speed ups could we see with batched platform optimised operators?
     
  3. Whippets

    Whippets

    Joined:
    Feb 28, 2013
    Posts:
    1,775
    Whilst in no way do I understand what you're on about, you show some mighty impressive figures - I hope someone smarter than me is also reading this thread.
     
  4. Dantus

    Dantus

    Joined:
    Oct 21, 2009
    Posts:
    5,667
    @Arowx, did you try to set the localPosition in the original script instead of the position?
     
  5. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I did but it did not have any performance impact, in this case the sprites were not child objects so I'm thinking there was no parent level transform to affect them?!

    The code is in the thread give it a whirl and see if you can improve the performance.
     
  6. Dantus

    Dantus

    Joined:
    Oct 21, 2009
    Posts:
    5,667
    The localPosition idea was the last bit I could find. As I found it kind of obvious, I expected that you already tried it out.

    I believe right now it would be a waste of resources to concentrate on that kind of batching, because there is a huge potential in the upcoming il2cpp. In the best case we get SIMD as a side effect of it.
     
  7. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I can see where you're coming from. If you have a less populated world/scene then the move to C++ and native compile will reduce the cost of code to engine function calls.

    But every function call has an impact and as most CPU's and GPUs have features that can speed up batches of calculations, and these features are platform specific, I still believe that batched operators that can tap into these performance optimisations (where they make enough of a performance improvement) could allow developers to really push and show off what Unity can do.

    Now it might be that most of the time or until you crank up to hundreds or thousands of units battling in a huge arena then the performance improvements will not be significant or could even have a negative impact e.g. the overhead of setting up the GPU to process the data.

    Even research in to optimisation and performance improvements in this area could produce increased performance in other areas of the engine.
     
  8. angrypenguin

    angrypenguin

    Joined:
    Dec 29, 2011
    Posts:
    15,500
    The primary point of batching things into arrays is to avoid cache misses, and because of the way Transforms in Unity work I don't think that what you're suggesting would have any impact on that.

    Also, your comparison between Monkey and Unity isn't particularly useful, because they're completely different types of system. With that in mind, since you haven't even posted code (or have you?) we can't even see if you're comparing similar operations within those very different systems.
     
  9. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    @angrypenguin OK this is about Unity, forget Monkey. In the course of optimising code to compete with monkey I came up with the idea of batching common 3D operations.

    I was seeing in Deep Profile mode that every frame when I updated 2000 objects it was taking about 6ms to get the transform positions and another 6ms to set them.

    But if Unity had batched operations I could send an array of transforms and an array of vectors and it would within the engine update them with the aim to reduce the overhead that occurs when our C# scripts call into Unity's C++ core.

    I also thought that as a lot of the time we are also doing calculations, e.g. the Update for an enemy does a look at rotation, and there could be hundreds or thousands of enemies. Then batched operations could speed up things but also and this is the important part, take advantage of platform specific instruction sets or even GPU features.

    I'm wondering if batched operations could also improve memory performance, as intermediate calculation steps would not be created in mono, so maybe it could reduce the impact on the GC?!

    So to test it I removed the Update() function and turned the sprites into a particle system with world collision.

    On my PC the Optimised version can hit about 270 fps and the particle version can hit 700 fps.

    On Android the Optimised version hit about 40 fps and the particle version hit 52 fps.

    Here is the Unity package with the two versions in it https://dl.dropboxusercontent.com/u/19148487/Temp/QuickSpriteTest.unitypackage

    PS> It also contains my 2 hours to make version of Pong! ;)
     
  10. calmcarrots

    calmcarrots

    Joined:
    Mar 7, 2014
    Posts:
    654
    If we could somehow reduce the processing time to 2 ums we could finally have 4K 360fps. With your solution, we are one step closer.
     
  11. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    LOL @calmcarrots, cool 4k at 360 fps or 3 x 4k screens at 120 fps for the full surround screen experience. :p
     
  12. calmcarrots

    calmcarrots

    Joined:
    Mar 7, 2014
    Posts:
    654
    UGH GENIUS! WHY CAN'T I HAVE A BRAIN LIKE YOURS?
     
  13. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194

    :p

    So Unity how about GPU based vector3 ops while the GPU is waiting for rendering commands?