Search Unity

RSP/RBP-related crashes in windows standalone x64 build

Discussion in 'Windows' started by Leander1P, Jul 11, 2017.

  1. Leander1P


    Jun 30, 2017
    We've developed a "kiosk" application using Unity 5.5.0f3, a Windows (non-UWP) standalone non-development x64 build, and our client is occasionally seeing crashes (perhaps 1x/day). We've also reproduced the issue on a UWP build.

    We've reproduced these crashes under 5.6.1f1, both in standalone and in Editor. We also have been able to reduce the time-to-repro to about 10-20 minutes by speeding up timeouts.

    The issue so far always manifests as various corruptions of RSP/RBP -- perhaps associated with coroutines? We've most often seen it with a move/click of the mouse that triggers the InterruptFade and StopAllCoroutines code below, but it's not apparent exactly what the root cause is.

    Exception is always on the main thread, often similar to:

    0xC0000005: Access violation writing location 0xFFFFFFFFFFFFFFD8.

    Registers from one set:

    RAX = 0000000000000000 RBX = 0000000000A042B0 RCX = 000000000E0175B0
    RDX = 000000000E0175B0 RSI = 000000000E0175B0 RDI = 0000000000A01C28
    R8 = 000000000E011680 R9 = 000000000DEE4FA0 R10 = 0000000000A042B0
    R11 = 00007FFED7C432A0 R12 = 000000000E0175B0 R13 = 0000000000000000
    R14 = 00007FFED7D435B0 R15 = 0000000000C7CB00 RIP = 0000000004B619A8
    RSP = FFFFFFFFFFFFFFE0 RBP = 0000000000A01F20 EFL = 00010283

    "RSP" is out of range. Disassembly leading up the faulting "call" instruction:

    0000000004B61990 mov r12,rcx
    0000000004B61993 mov rdi,rsp
    0000000004B61996 mov rsp,qword ptr [rcx+28h]
    0000000004B6199A sub rsp,20h
    0000000004B6199E mov r11,7FFED7C432A0h
    0000000004B619A8 call r11

    The code we would have jumped to at r11 had RSP not been hosed is in

    00007FFED7C432A0 mov ecx,dword ptr [appdomain_thread_id
    00007FFED7C432A6 jmp qword ptr [__imp_TlsGetValue (07FFED7D9E318h)]

    RSP is loaded from 28h after RCX -- looks like a heap address -- which is in the middle of a block of zeros in our full dump; RSP comes in as 0x0, then the "sub rsp,20h" happens, and it ends up at the poison FFFFFFFFFFFFFFE0 value just before the "call r11". Peeking around the heap address I'm not seeing anything that screams buffer overflow or similar, but it's hard to tell.

    The repro on 5.6.1f1 manifested a little differently: both RBP and RSP appear to be hosed (pointing at an unwritable page). WinDbg did give us a little better stack:

    mono!mono_arch_find_jit_info_ext+0x30c [c:\buildslave\mono\build\mono\mini\exceptions-amd64.c @ 839]
    mono!mono_find_jit_info_ext+0xa9 [c:\buildslave\mono\build\mono\mini\mini-exceptions.c @ 358]
    mono!mono_jit_walk_stack_from_ctx_in_thread+0x125 [c:\buildslave\mono\build\mono\mini\mini-exceptions.c @ 748]
    mono!win32_handle_stack_overflow+0x9e [c:\buildslave\mono\build\mono\mini\mini-windows.c @ 208]
    mono!seh_vectored_exception_handler+0x249 [c:\buildslave\mono\build\mono\mini\exceptions-amd64.c @ 194]

    (nothing below here, sadly) this may have blown up trying to use more stack while handling a stack overflow? Haven't traced backward much to see if I can determine cause on this one.

    We have about 10 other minidumps (~10MB ea) with similar crashes, and one or two local full heap dumps (~2GB, the 4K textures on this large-screen-kiosk project add up quickly).

    We have seen occasional messages in logs (once in client logs, once locally):


    ...and this message a few times as well:

    Assertion failed on expression: '!m_CoroutineEnumeratorGCHandle.HasTarget()'

    Not sure if these are related but it seems plausible that they're effects of the same condition. These aren't always immediately followed by a crash.

    Our next steps may be to move away from coroutines entirely to something Update-driven; presumably there's something here we're not seeing re chaining them or how they're triggered generally that might be causing runaway stack use, leak, or race condition somewhere.

    Here's the most likely source file -- if anyone smells anything suspect in here or we've missed something, would love a heads up! Similarly if anyone can suggest any next steps/tools for debugging. We can likely do more with the dumps/repros we have, but it can be time-consuming to work backwards from the crash point (or even impossible, depending on the sequence of events, due to lost info) for things like this, in my experience.

    The general purpose is to continuously fade between a "slide carousel" of images (as a kiosk attract loop), but interrupt (and instantly undo / finish any crossfade) if someone wanders by and moves mouse / touches screen.

    Code (csharp):
    1. using System.Collections;
    2. using System.Collections.Generic;
    3. using UnityEngine;
    4. using UnityEngine.UI;
    6. [RequireComponent(typeof(InactivityTimer))]
    7. [RequireComponent(typeof(DetectAnyInput))]
    8. public class ImageCarousel : MonoBehaviour {
    9.    public float ImageFadeOutDuration;
    10.    public float DelayBetweenFades;
    12.    private CircularListNode<GameObject> m_currentImage;
    13.    private InactivityTimer m_inactivityTimer;
    14.    private Vector3 m_previousMousePosition;
    16.    void Start() {
    17.        // Hook into the inactivity timer.
    18.        m_inactivityTimer = GetComponent<InactivityTimer> ();
    19.        m_inactivityTimer.OnInactivityTimeout += OnInactivityTimeout;
    21.        // Activate the starting image so it can receive clicks.
    22.        if (m_currentImage != null) {
    23.            m_currentImage.Data.SetActive (true);
    24.        }
    25.    }
    27.    public void InsertImage(GameObject obj) {
    28.        if (m_currentImage == null) {
    29.            m_currentImage = new CircularListNode<GameObject> (obj);
    30.        } else {
    31.            m_currentImage = m_currentImage.Append (obj);
    32.        }
    33.    }
    35.    private void InterruptFade() {
    36.        // Attempt to remove ImageFadeIn.
    37.        CanvasGroupFadeIn fadeInScript = m_currentImage.Data.GetComponent<CanvasGroupFadeIn> ();
    38.        if (fadeInScript != null) {
    39.            Destroy (fadeInScript);
    40.        }
    42.        // Attempt to remove ImageFadeOut.
    43.        CanvasGroupFadeOut fadeOutScript = m_currentImage.Data.GetComponent<CanvasGroupFadeOut> ();
    44.        if (fadeOutScript != null) {
    45.            Destroy (fadeOutScript);
    46.        }
    48.        {
    49.            DetectAnyInput script = gameObject.GetComponent<DetectAnyInput>();
    50.            script.OnInputDetected -= InterruptFade;
    51.        }
    53.        // Kill running coroutines so we don't have two images visible.
    54.        StopAllCoroutines ();
    56.        // Make the interrupted image visible.
    57.        {
    58.            CanvasGroup script = m_currentImage.Data.GetComponent<CanvasGroup> ();
    59.            script.alpha = 1.0f;
    60.        }
    62.        // Enable the image so that it can receive clicks.
    63.        m_currentImage.Data.SetActive (true);
    65.        // Enable the inactivity timer.
    66.        m_inactivityTimer.enabled = true;
    67.    }
    69.    private void OnInactivityTimeout() {
    70.        if (m_currentImage != null) {
    71.            // Stop the inactivity timer.
    72.            m_inactivityTimer.enabled = false;
    74.            // Grab the mouse position when the inactivity prompt triggered.
    75.            m_previousMousePosition = Input.mousePosition;
    77.            FadeCurrentImage ();
    78.        }
    79.    }
    81.    private void FadeCurrentImage() {
    82.        // Make the backing image visible.
    83.        {
    84.            GameObject next = m_currentImage.Next.Data;
    86.            // Activate the GameObject.
    87.            next.SetActive (true);
    89.            // Update the Image alpha.
    90.            CanvasGroup script = next.GetComponent<CanvasGroup> ();
    91.            script.alpha = 1.0f;
    92.        }
    94.        // Ensure the current image is at the top of the hierarchy, and fade it.
    95.        {
    96.            GameObject current = m_currentImage.Data;
    98.            // Ensure that the fading image is at the top of the hierarchy.
    99.            current.transform.SetAsLastSibling ();
    101.            // Fade out the current image.
    102.            CanvasGroupFadeOut fadeOutScript = m_currentImage.Data.AddComponent<CanvasGroupFadeOut> ();
    103.            fadeOutScript.FadeDuration = ImageFadeOutDuration;
    104.            fadeOutScript.OnImageFadeOut += OnImageFadeOut;
    105.        }
    107.        // Detect input so that we can interrupt the fade
    108.        {
    109.            DetectAnyInput script = gameObject.GetComponent<DetectAnyInput>();
    110.            script.OnInputDetected += InterruptFade;
    111.        }
    112.    }
    114.    private void OnImageFadeOut() {
    115.        // Disable the image now that it isn't visible.
    116.        m_currentImage.Data.SetActive (false);
    118.        // Advance to the next image.
    119.        m_currentImage = m_currentImage.Next;
    121.        // Delay until we are ready to fade the next image.
    122.        StartCoroutine (DelayThenFadeNextImage ());
    123.    }
    125.    private IEnumerator DelayThenFadeNextImage() {
    126.        yield return new WaitForSeconds(DelayBetweenFades);
    127.        FadeCurrentImage ();
    128.    }
    129. }
  2. Tautvydas-Zilys


    Unity Technologies

    Jul 25, 2013
    It indeed looks like you're hitting a stack overflow somewhere. Do you have a fuller picture of the stacktrace? If you reproduced it on UWP, it might be easier to read stacktraces there since it doesn't use Mono - did you try capturing dumps from that?
  3. Leander1P


    Jun 30, 2017
    I'll dig in the crashdumps we have a bit more, but I was having a heck of a time getting VS to cooperate with stacktracing (probably due to RBP/RSP being corrupt). Lots of "one frame only". WinDbg seemed to be better about mining possible stack space for symbols, but I didn't apply it everywhere. I'm *pretty* sure I have a .dmp from the UWP build, but need to check.

    I've used a trick where I just symbolicate an arbitrary chunk of memory I assume is the stack, before, but I'm much more familiar with how to do that on ARM than x86/x64.
  4. Tautvydas-Zilys


    Unity Technologies

    Jul 25, 2013
    Could you send any of those dumps to me privately by any chance?
  5. Leander1P


    Jun 30, 2017
    I'll get a link to you via a DM in just a moment, thanks. =)
  6. Leander1P


    Jun 30, 2017
    Thanks to Tautvydas-Zilys' analysis of the .dmp files pointing the finger at a very long chain of InterruptFade delegates attached to an event (it was indeed a stack overflow):

    It looks like the script.OnInputDetected += InterruptFade; line may have been the culprit; the only time we're removing ( -= InterruptFade ) is in InterruptFade itself, so as we carousel through the images, we'll stack up several hundred long chain on OnInputDetected, which exhausts stack when that event is actually triggered. We likely need to remove them when the fade is finished (balanced with the add) or perhaps better yet not add a new delegate to the chain if there's already one there.

    A colleague was nearly finished with a solution last night, so we should know soon. I'll come back here and let everyone know how it went, but figured I'd update the thread on the general character of the issue in case anyone else was experiencing similar.
  7. Tautvydas-Zilys


    Unity Technologies

    Jul 25, 2013
    Leander wanted me to post it here, so here's the steps on how I found that offending function from the crash dump.

    The top frame of the dump is in this method:

    0000000004AA1993 mov rdi,rsp
    0000000004AA1996 mov rsp,qword ptr [rcx+28h]
    0000000004AA199A sub rsp,20h
    0000000004AA199E mov r11,7FFBE2B832A0h
    0000000004AA19A8 call r11
    0000000004AA19AB add rsp,20h
    0000000004AA19AF mov rax,qword ptr [rax+60h]
    0000000004AA19B3 sub rsp,20h
    0000000004AA19B7 mov r8d,0
    0000000004AA19BD mov rdx,rax
    0000000004AA19C0 mov rcx,r12
    0000000004AA19C3 mov r11,7FFBE2C85E28h
    0000000004AA19CD call r11
    0000000004AA19D0 add rsp,20h
    0000000004AA19D4 mov rsp,rdi
    0000000004AA19D7 ret

    As the very first thing that function does, is saves current rsp register to rdi and then takes the rsp value from somewhere else. That means at the point of the crash (call r11 instruction), the real stack pointer is in rdi. From there, I just printed memory contents starting rdi for 16 KB. And after some useless things that were on the stack from exception handling, I noticed a very short loop of values in the stack:

    [0x000006b8] 0x00000000004f2d52 void *
    [0x000006b9] 0x00000000053bbe90 void *
    [0x000006ba] 0x00000000116f0070 void *
    [0x000006bb] 0x00000000055d28c8 void *
    [0x000006bc] 0x00000000116f00e0 void *
    [0x000006bd] 0x00000000008d5610 void *
    [0x000006be] 0x00000000053bbf08 void *
    [0x000006bf] 0x0000000004a44d48 void *
    [0x000006c0] 0x00000000009cf350 void *
    [0x000006c1] 0x00000000116f00e0 void *
    [0x000006c2] 0x00000000004f2d52 void *
    [0x000006c3] 0x00000000053bbe90 void *
    [0x000006c4] 0x00000000116f00e0 void *
    [0x000006c5] 0x00000000055d28c8 void *
    [0x000006c6] 0x00000000116f0150 void *
    [0x000006c7] 0x00000000008d5660 void *
    [0x000006c8] 0x00000000053bbf08 void *
    [0x000006c9] 0x0000000004a44d48 void *
    [0x000006ca] 0x00000000009cf350 void *
    [0x000006cb] 0x00000000116f0150 void *
    [0x000006cc] 0x00000000004f2d52 void *
    [0x000006cd] 0x00000000053bbe90 void *
    [0x000006ce] 0x00000000116f0150 void *
    [0x000006cf] 0x00000000055d28c8 void *
    [0x000006d0] 0x00000000116f01c0 void *
    [0x000006d1] 0x00000000008d56b0 void *
    [0x000006d2] 0x00000000053bbf08 void *
    [0x000006d3] 0x0000000004a44d48 void *
    [0x000006d4] 0x00000000009cf350 void *
    [0x000006d5] 0x00000000116f01c0 void *
    [0x000006d6] 0x00000000004f2d52 void *
    [0x000006d7] 0x00000000053bbe90 void *
    [0x000006d8] 0x00000000116f01c0 void *

    [0x000006d9] 0x00000000055d28c8 void *
    [0x000006da] 0x00000000116f0230 void *
    [0x000006db] 0x00000000008d5700 void *
    [0x000006dc] 0x00000000053bbf08 void *
    [0x000006dd] 0x0000000004a44d48 void *
    [0x000006de] 0x00000000009cf350 void *
    [0x000006df] 0x00000000116f0230 void *
    [0x000006e0] 0x00000000004f2d52 void *
    [0x000006e1] 0x00000000053bbe90 void *
    [0x000006e2] 0x00000000116f0230 void *

    [0x000006e3] 0x00000000055d28c8 void *
    [0x000006e4] 0x00000000116f02a0 void *
    [0x000006e5] 0x00000000008d5750 void *
    [0x000006e6] 0x00000000053bbf08 void *
    [0x000006e7] 0x0000000004a44d48 void *

    The highlighted value 0x00000000053bbe90 is delegate invocation function, and the value after that points to a delegate. I first located the delegate invocation function, since it's easy to check if a pointer is a pointer to executable memory - just check if disassembly at that address is an executable function. After that, I dug onto mono jit tables to figure out which managed method it was, and I found it was the InputDetectedAction.Invoke method. I basically just manually did what this function does:

    Then, I decided to check if any of the addresses between the frames pointed to delegate objects, and luckily they did. Finally, the delegate objects have a field which shows which method it points to.
  8. Leander1P


    Jun 30, 2017
    To drop a final update in the thread: Balancing the delegate adds and removes worked as a quick solution. We're looking into better practices and/or wrappers to help avoid this ever happening in the future.