Search Unity

IL2CPP - ReversePInvokeWrapper causing massive slowdowns

Discussion in 'Windows' started by fholm, Aug 4, 2019.

  1. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,052
    So I've spent a lot of time trying to figure out why we were seeing massive slow downs when passing C# delegates to be invoked from a native plugin (using delegates + Marshal.GetFunctionPointerForDelegate). We were seeing something of the order of 6 slowdown when the work is executed from the native callback vs calling it directly from C#.

    Initially I thought it had to do with PInvoke (i.e. when C# passes the delegate down to native side) overhead, but that turned out to be a non-issue.

    So after a lot of debugging and massive amounts of profiling, I managed to drill down: The ReversePInvokeWrapper that IL2CPP generates is taking 6x longer than the actual work that needs to be done. We're looking at about 0.1ms of time per call cosumed by the reverse pinvoke wrapper vs 0.15ms of actual work being done inside the callback.

    I managed to instrument and profile a debug build running in IL2CPP, see the screenshot attached (this is using Orbit Profiler).

    entry 0: RadixSAP_BucketJobProcessorEnki is the actual work being done (~avg 21us)
    entry 1: ReversePInvokeWrapper_RadixSAP_BucketJobProcessorEnki is the pinvoke wrapper. (avg ~138us)

    This means that the reverse p-invoke wrapper is taking around 6x time of the actual method being called.

    To add to the confusion, this only seems to be happening for some methods - if I call an empty method, there's no massive overhead. Im at a loss of how to figure out what's going wrong.

    Edit: So far I have only tested this on Windows and OSX IL2CPP builds, it happens on both of them - not had time to dig into mobile/consoles yet.
     

    Attached Files:

    adammpolak likes this.
  2. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,052
    Did some more digging, it seems like whenever a managed delegates is being invoked from a native thread (without a manged thread equivalent) it goes through something called ScopedThreadAttacher.... which is causing insane slowdowns.
     
    adammpolak likes this.
  3. Tautvydas-Zilys

    Tautvydas-Zilys

    Unity Technologies

    Joined:
    Jul 25, 2013
    Posts:
    10,679
    ScopedThreadAttacher initializes the garbage collector and IL2CPP threading infrastructure for that thread. It should be really fast if it's already initialized, but if it's the first managed frame on that thread, it will do the full initialization (and it will do full uninitialization once it's done).

    You have several options:

    1. Create the thread from managed code, and P/Invoke into your native thread entry point
    2. Before calling managed methods many times where performance matters, do native -> managed -> native call and then call your performance sensitive managed callbacks from there, so there's a managed stack frame on the stack. That way, ScopedThreadAttacher will practically be a no-op.
     
    adammpolak likes this.
  4. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,052
    Hey! Thanks for these ideas... yeah currently this is butchering performance, because the ScopedThreadAttacher destructor detaches the thread again after every pinvoke call is over...

    So both of these ideas are... somewhat cumbersome due to the task and threading library I'm using, is there no way to tell IL2CPP to keep the managed thread infra/gc init to stick around as I will be using it over and over? Or if there's a way of telling IL2CPP to not initialize this at all? As the code I'm invoking doesn't touch any C# references or anything like it, it's both a pure static function (depends on no outside values) and does not touch any C# reference types at all - only touching native memory blocks and working with pointers.
     
  5. Tautvydas-Zilys

    Tautvydas-Zilys

    Unity Technologies

    Joined:
    Jul 25, 2013
    Posts:
    10,679
    Unfortunately there isn't. It needs to detach upon leaving the last managed stackframe as otherwise it can deadlock on exit as it cannot uninitialize threads while they're still running.

    And there isn't any mechanism to tell IL2CPP not to initialize a thread - that would be dangerous and could lead to random crashes if you were to accidentally touch the managed heap or do any operation that calls into IL2CPP runtime.
     
  6. adammpolak

    adammpolak

    Joined:
    Sep 9, 2018
    Posts:
    450
    @fholm I am about to go on a similar journey to yours, using reverse p/invoke to stream native data into managed Unity.

    Any chance you have an example of how you ended up getting it done? :D