IL2CPP - ReversePInvokeWrapper causing massive slowdowns

fholm · Aug 4, 2019

So I've spent a lot of time trying to figure out why we were seeing massive slow downs when passing C# delegates to be invoked from a native plugin (using delegates + Marshal.GetFunctionPointerForDelegate). We were seeing something of the order of 6 slowdown when the work is executed from the native callback vs calling it directly from C#.

Initially I thought it had to do with PInvoke (i.e. when C# passes the delegate down to native side) overhead, but that turned out to be a non-issue.

So after a lot of debugging and massive amounts of profiling, I managed to drill down: The ReversePInvokeWrapper that IL2CPP generates is taking 6x longer than the actual work that needs to be done. We're looking at about 0.1ms of time per call cosumed by the reverse pinvoke wrapper vs 0.15ms of actual work being done inside the callback.

I managed to instrument and profile a debug build running in IL2CPP, see the screenshot attached (this is using Orbit Profiler).

entry 0: RadixSAP_BucketJobProcessorEnki is the actual work being done (~avg 21us)
entry 1: ReversePInvokeWrapper_RadixSAP_BucketJobProcessorEnki is the pinvoke wrapper. (avg ~138us)

This means that the reverse p-invoke wrapper is taking around 6x time of the actual method being called.

To add to the confusion, this only seems to be happening for some methods - if I call an empty method, there's no massive overhead. Im at a loss of how to figure out what's going wrong.

Edit: So far I have only tested this on Windows and OSX IL2CPP builds, it happens on both of them - not had time to dig into mobile/consoles yet.

fholm · Aug 4, 2019

Did some more digging, it seems like whenever a managed delegates is being invoked from a native thread (without a manged thread equivalent) it goes through something called ScopedThreadAttacher.... which is causing insane slowdowns.

Tautvydas-Zilys · Aug 4, 2019

ScopedThreadAttacher initializes the garbage collector and IL2CPP threading infrastructure for that thread. It should be really fast if it's already initialized, but if it's the first managed frame on that thread, it will do the full initialization (and it will do full uninitialization once it's done).

You have several options:

1. Create the thread from managed code, and P/Invoke into your native thread entry point
2. Before calling managed methods many times where performance matters, do native -> managed -> native call and then call your performance sensitive managed callbacks from there, so there's a managed stack frame on the stack. That way, ScopedThreadAttacher will practically be a no-op.

fholm · Aug 5, 2019

Tautvydas-Zilys said: ↑

ScopedThreadAttacher initializes the garbage collector and IL2CPP threading infrastructure for that thread. It should be really fast if it's already initialized, but if it's the first managed frame on that thread, it will do the full initialization (and it will do full uninitialization once it's done).

You have several options:

1. Create the thread from managed code, and P/Invoke into your native thread entry point
2. Before calling managed methods many times where performance matters, do native -> managed -> native call and then call your performance sensitive managed callbacks from there, so there's a managed stack frame on the stack. That way, ScopedThreadAttacher will practically be a no-op.
Click to expand...

Hey! Thanks for these ideas... yeah currently this is butchering performance, because the ScopedThreadAttacher destructor detaches the thread again after every pinvoke call is over...

So both of these ideas are... somewhat cumbersome due to the task and threading library I'm using, is there no way to tell IL2CPP to keep the managed thread infra/gc init to stick around as I will be using it over and over? Or if there's a way of telling IL2CPP to not initialize this at all? As the code I'm invoking doesn't touch any C# references or anything like it, it's both a pure static function (depends on no outside values) and does not touch any C# reference types at all - only touching native memory blocks and working with pointers.

Tautvydas-Zilys · Aug 5, 2019

Unfortunately there isn't. It needs to detach upon leaving the last managed stackframe as otherwise it can deadlock on exit as it cannot uninitialize threads while they're still running.

And there isn't any mechanism to tell IL2CPP not to initialize a thread - that would be dangerous and could lead to random crashes if you were to accidentally touch the managed heap or do any operation that calls into IL2CPP runtime.

adammpolak · Mar 26, 2021

@fholm I am about to go on a similar journey to yours, using reverse p/invoke to stream native data into managed Unity.

Any chance you have an example of how you ended up getting it done?

Search Unity

IL2CPP - ReversePInvokeWrapper causing massive slowdowns

fholm

Attached Files:

orbit_2019-08-04_13-35-00.png

fholm

Tautvydas-Zilys

Unity Technologies

fholm

Tautvydas-Zilys

Unity Technologies

adammpolak

Search Unity

Unity ID

Useful Searches

IL2CPP - ReversePInvokeWrapper causing massive slowdowns

fholm

Attached Files:

orbit_2019-08-04_13-35-00.png

fholm

Tautvydas-Zilys

Unity Technologies

fholm

Tautvydas-Zilys

Unity Technologies

adammpolak