Search Unity

Official Decrease in Events - 2/21

Discussion in 'Unity Analytics' started by ap-unity, Feb 22, 2018.

Thread Status:
Not open for further replies.
  1. ap-unity

    ap-unity

    Unity Technologies

    Joined:
    Aug 3, 2016
    Posts:
    1,519
    We have identified an issue and have launched a root cause investigation.

    I will update this thread when we have more information.
     
  2. ap-unity

    ap-unity

    Unity Technologies

    Joined:
    Aug 3, 2016
    Posts:
    1,519
    We have concluded our root cause analysis of the data issues that occurred on Feb 20th.
    • There is a 41 hour window when we did not correctly ingest data received from the Asia-Pacific region.
    • This data loss began Feb 20 17:00 UTC and ended Feb 22 8:59 UTC
    • You will likely notice a sharp downturn in DAU on Feb 20 - 22 from players in the Asia-Pacific region.
    Our Analytics system automatically caches events and tries to re-send them if there was an error response, so some of the data from this time frame may have eventually been sent and processed correctly.

    Timeline:
    On Feb 13th, our server cluster in the Asia-Pacific region failed. However, the traffic was automatically shifted to other regions and the traffic was handled correctly; so rebuilding the cluster was delayed until Feb 20th.

    On Feb 20th, the Asia-Pacific server cluster was rebuilt, however, it was not routing traffic correctly. This meant that data from this region was not sent to our Analytics system. We became aware of this missing data on Feb 22 06:12 UTC. At 14:54 UTC we identified the underlying cause. At that point, we restarted the routing service on the server and made sure it was running correctly.

    Remediation steps:
    - Increased monitoring to check that data is routed correctly
    - Add canary monitoring to transmit messages to each region and verify message flow to end destination
    - Rebuilding certain infrastructure components to improve resiliency
    - Debug the routing service to uncover the cause of the failure to start

    Data accuracy is our highest priority and we take these issues very seriously. We apologize for any inconvenience this may cause.
     
Thread Status:
Not open for further replies.