Search Unity

[Showcase] ENet + Unity ECS (5000 real time player simulation)

Discussion in 'Data Oriented Technology Stack' started by wobes, Dec 31, 2018.

  1. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    Happy holidays everyone.

    I would like to show you my progress in creating a high level api based on Unity ECS on top of the ENet network library.

    The test was performed with 2 standalone client builds (right side, 2500 connections each). The server processes approximately 320000 messages per second with a tick-rate of 64 ticks per second. Each of the clients sends its position to the server.

    Current project is heavily utilizes techniques such as:

    * Serialization: https://github.com/nxrighthere/NetStack
    * Network library: ENet-CSharp https://github.com/nxrighthere/ENet-CSharp
    * Unity's ECS system
    * Non-blocking queue RingBuffer https://github.com/dave-hillier/disruptor-unity3d
    * Span, ReadOnlySpan

    Below you can see the demo and the chart that explains how does it work.





    Special thanks to @nxrighthere
     
    Last edited: Jan 1, 2019
  2. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    5,778
  3. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    Thank you for your feedback!

    The goal of my test was to demonstrate you the power of processing the messages in Unity ECS. However the most consuming part is serialization / deserialization that is performed in C# Logic thread via BitBuffer.
    As for the techniques, It does not use any delta compression, because storing a world snapshots history for each of 5000 clients would consume a lot of memory. However, you, might look forward to http://www.gamasutra.com/view/feature/129854/ this article explains how to synchronize entities under bandwidth constraints. Recently I've got good results with 6 bytes for Position and 4 bytes for Rotation, here you can see some statistics of receiving the data of 500 clients (~20kb / s) with a tickrate of 20 packets per second:


    That means that a server requires throughput of 80Mbps in order to allow all of 500 clients to be in one zone of their interests.

    Cheers!
     
    Last edited: Dec 31, 2018
  4. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    5,778
    80Mbps, Is quite size, but I think that quite average this days, for typical Joe.
    Of course I mean, if that is for the client.
    For a server that is quite decent.

    Surely with added prediction and ensuring determinism, that could increase number of concurrent moving instances.

    Either way, nice presentation so far.

    PS. I check gamasutra link about sync, when I got suitable time.
     
    wobes likes this.
  5. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    For a client it is 160kbps to see all of the 500 network entities.
    Much appreciate that.
     
    Last edited: Dec 31, 2018
    Antypodish likes this.
  6. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    7,130
    What will be the impact of having them shoot at each other, with health and ammo upates?
     
    perevezentsev likes this.
  7. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    5,778
    With authoritative server and such number of clients, for deterministic system with prediction, rather negligible, since you only send user commands, and receive back character syncing anyway.
     
  8. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,033
    Well, I'd not say this is true to be honest. In a fully deterministic system absolutely this would hold, but for a non-deterministic game like an FPS accounting for all player actions that can happen in the world can be very expensive in terms of bandwidth.

    How are you transfering the data from the ECS/game thread to the logic thread (and vice versa i suppose)? It's possible to setup a jobified serializer in the unity ecs without too much effort, allowing you to easily multi-thread the serialization.

    This seems really low? 20kb/s for 500 clients with 20 updates per second? Just simple math of:

    Code (csharp):
    1. 500 (client count) * 10 (position and rotation byte size) * 20 (packets per second)
    Would put you at almost 100kb/s per client?


    Edit:

    Really great work, but I also want to add that be careful of measuring how many clients the system can handle when using local host or local lan connections, as they are much more stable and performant than what you will see online.
     
    Rog likes this.
  9. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    RinBuffer of IntPtr. Single consumer single producer. Trust me, it's faster than schedule a job.


    The updates are not coming all at one tick. Each of the objects sends its position in a different time, however with the same interval. So you're ending up with less than that. + It's not exactly 20. The closest entities are getting sync at this tick-rate. The one that is far send twice less or even more less updates.


    The system would be stable the same as the LAN demo. The reason for that is it does not use any reliable messages for position updates. So the server would not waiting for anything if there would be any packet loss.
     
  10. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,033
    Are you simply passing an int ptr to the entity components then? But you're serializing in just one thread then right?


    Ah.


    Well it's not just about reliable message, it's simply that a local host/lan socket can send/receive much more data much faster, especially localhost.
     
  11. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    You serialize with Span when you produce, deserialize with ReadOnlySpan when you consume. All of that allows C# threads to talk to each other without blocking. For Network messages I use BitBuffer that later on converts to IntPtr of deserialized struct and then just passed to an Entity as NetworkPacket component. Later on I can call UnsafeUtility.CopyPtrToStructure either in Burst/NonBurst Jobs to get the packet data.
     
    Last edited: Jan 2, 2019
  12. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    If we are talking about bandwidth. Then of course 5k players in 1 zone would require 8000mbps for a server to output.

    But again, the purpose of the demo to demonstrate that modern CPUs are able to handle that many packets per second with proper threading setup.
     
    Last edited: Jan 2, 2019
  13. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,033
    Sure, but wouldn't you get a much higher throughput by not having to pass things to/from a single background thread and simply run the serializer as a parallel job? This would allow you to make use of many more threads. Maybe you are already using a multi threaded serializer and I missunderstood it.

    I've done a lot of testing around this myself, and been able to reach staggering amounts of serialization throughput on my 32 thread CPU, like hundreds of thousands of entities.
     
  14. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    The logic thread performs serialization / deserialization of network messages. While a game thread is free of it as well as job threads. With bit buffer Serialize / Deserialize are quiet fast operations (faster than any High Level Serializers (MessagePack, Protobuf, etc). But yes, I see your idea. It still possible to have jobified serializer / derserializer, however I think it would perform better only in a case when you have to serialize ~10.000 messages at one frame.
     
    Last edited: Jan 2, 2019
  15. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    I have a good example of how long does it take to process a million of network packets at once in Burst/Non burst job:

    Non burst:



    Burst :

     
    Kirsche, Micz84 and Antypodish like this.
  16. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,033
    It's pretty hard to say what these figures mean without the code that goes with them.
     
  17. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    As far as I remember it's just an IJobProcessComponentData with ReadOnly attribute of NetworkEvent component with a switch statement of enum event type: OnConnected, OnDisconnected, OnTimeout; inside the job.
     
  18. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,033
    Well, you usually have a set CPU budget for each game instance on a server, so it doesn't really matter how you split the code over the threads.

    I would never dream of using something like a high level serializer for network packets :)

    Not sure I follow here, I ran my tests with the target of having 1000 live entities in a world with 100 players, where you have 100*1000 candidate pairs for prioritization, serialization, etc. Of course all entities can not be written to each client, so there's a lot more logic going on than just writing packet data, and being able to do this within the job system works really well.
     
    wobes likes this.
  19. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    I've just decided to pick a regular C# threading approach but apply Unity's Jobs for actually processing of the data and performing of the game logic. Thank you for your feedback.
     
    Last edited: Jan 2, 2019
  20. ScottPeal

    ScottPeal

    Joined:
    Jan 14, 2013
    Posts:
    31
    Hi @wobes,

    Nice work!

    I am looking at enet based upon feedback from @nxrighthere performance GitHub project and assessment on new Unity networking stack. Any chance you are sharing your test code? Seems like you have found a great combination.

    VR Architect
     
  21. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    538
    I recently wrote some stuff regarding multi-threading with ENet, and here you can find more information. Everything else like utilizing C# Jobs and ECS discussed in the Discord over time.

    Some of my favorite resources regarding multi-threading which I shared with @wobes a while ago:
    Scalable Architecture
    General Recipe
    First Things First
    Your Arsenal
    Scalability Prerequisites
    Reader-Writer Problem
    Producer-Consumer Queues
    Introduction to the Disruptor
    Concurrency with LMAX
    Locks Aren't Slow; Lock Contention Is
     
  22. snacktime

    snacktime

    Joined:
    Apr 15, 2013
    Posts:
    2,403
    I like your general approach. With Unity the challenge I had when I tried using it on the server was the game loop.

    It starts with messages have to wait on the game loop to get in and out. Unity being on both ends compounds this.
    So worst case you get latency hits of 2X framerate. Messages worst case have to wait a full Update cycle to get in or out. That's just on one side. The client has the same issue. So that's 4X worst case round trip. The chances of worst case are different if it's client A -> server -> client A then client A -> server -> client B.

    But the point is even the average additional latency is quite high. The standard deviation is downright ugly. Most game developers never notice this. Even on large production games I've seen this go completely unnoticed. And interestingly enough mostly outside of Unity are the ones I have seen. It's a problem with engine game loops generally, and that the game loop is so often used server side also.
     
    wobes and nxrighthere like this.
  23. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    Such a good explanation of why the game loop based networking is a bad approach. As for framing, it's also was important to have ECS' processing systems updating order after the core system that creates an entity of 1 frame lifetime. Because we want to work with the message exactly in the frame we get the message but not in the next Update cycle. And what is powerful about it, you can add a message to a SendQueue in any time and it will be processed with the logic thread instantly because they work in a parallel in a non-blocking manner.
    So my game loop update looks like that:

    * ENetServerSystem (creates an entity with the data)
    * Processing system 1
    * Processing system 2
    ...
    * NetworkGarbageCollector (disposes the memory, destroys the entities)

    Thank you.
     
  24. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    Thank you for your feedback. As for sharing the code, I am still thinking about fixing some stuff. But it is always better to start your own journey when it comes to developing a decent networking game.
     
  25. starikcetin

    starikcetin

    Joined:
    Dec 7, 2017
    Posts:
    248
    Pretty impressive.
     
  26. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    Much appreciate that.
     
  27. Opeth001

    Opeth001

    Joined:
    Jan 28, 2017
    Posts:
    294
    Nice job wobes!
    Why you didnt use a multithreaded networking lib like Gameforge remastred?
    I think you Can optimise bandwith by changing some parts of the logic:
    1) instead of sending positions from client side at a tickrate of 60/s sen only the inputs when they change. In your case it's just location and rotation which Can be replaced by 2 sbytes x and y for the mouvement input and 2 sbytes for the rotation input too . Knowing that the inputs are between -1 to 1. Eg a -1f will be a -10 , a 0.7f will be a 7 ... ( U Can get a higher precision using the sbytes from -100 to 100)
    2) send the inputs from server to clients too instead of sending positions and rotations. If you send inputs to clients and use them to extrapolate instead of interpolate it will give you a better result with less bandwith and less tickrate ( you Can send positions 1/s in case some players didnt extrapolate correctly over Time . Dont send rotations cause they are automatically drived by the rotation inputs)

    this approach will make it Authoritative by default which will save you some CPU time.
    instead of using Vector3.Distance(serverResultPosition, clientInputPosition) u will just check if math.abs(your input ) > 10 or 100 depending on the precision u use.( the Burst Compiler will appreciate it )
     
    Last edited: Jan 8, 2019
  28. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    First of all, thank you for your feedback.

    Forge in fact is pseudo multi-threaded. Operated on managed sockets while ENet uses native sockets which is 10x times faster. ENet is pretty much low level library while Forge has ton of unnecessary overhead such as: lock contention (around 30 locks https://github.com/BeardedManStudios/ForgeNetworkingRemastered/search?q=lock&unscoped_q=lock), GC allocations, slow ConcurrentQueue that creates allocations over time. What my project has is: completely lock-free inter thread messaging, 0 GC allocations, native serialization / deserialization. I am pretty sure that you can't achieve 5000 CCU in one world with Forge. Plus, they still have open issue from the last year and they do not even try to fix it (https://github.com/BeardedManStudios/ForgeNetworkingRemastered/issues/126)

    As for your optimization strategy. Well, the demo is to show you the stress test of my project so I will not change the tick rate because it is 64 in purpose. For sending inputs. I am sorry but that is ridiculous. First of all, you have 5000 CCU in one world and to simulate all of 5000 with physics on server it is quite costly operation. Even World of Warcraft accepts positions from its clients. Sending inputs requires deterministic movement. With floating point errors the determinism is hard to achieve. Extrapolation in that scenario can create laggy / jittery movement because the momentum of a player can be changed in anytime. So we use simple snapshot interpolation with simple client-side prediction.

    Cheers.
     
    Last edited: Jan 8, 2019
    raul3d, Antypodish and e199 like this.
  29. Opeth001

    Opeth001

    Joined:
    Jan 28, 2017
    Posts:
    294
    Hi wobes ^_^
    i personally never used ENet :p so ill try it. ( Thanks )
    For sending inputs strategy, it's not mine XD practically all authoritative games use it, i just suggested you to send inputs as bytes only when they change for bandwith optimisation instead of sending positions + rotations 60/s by Player .
    thinking about it Each frame the position and rotation are changed for sure if the player is moving but inputs are practically impossible to be changed 60/s by a real humain player.

    For Physics u dont even need a simulation on server side, into your case u just need 1 capsule / entity which is a primitive colliders and highly performant.u dont need a rigidbody just use collisions and raycasts to implement your own and simple physics logic. like gravity.
    unity had a limitation of 65k collider / scene which is removed after the upgrade to PhysX 3.4.

    so yes i think Unity can handle the 5K Moving capsule Colliders easly.

    And i dont see how you can check for cheating between ServerSide simulated positions and clients postions if clients dont send their inputs .

    About Determinism: ( BurstCompile is the solution )

    Unity build pipeline must be deterministic. Users can choose if all simulation code should run deterministically.

    You should always get the same results with the same inputs, no matter what device is being used. This is important for networking, replay features and even advanced debugging tools.

    To do this we will leverage our Burst compiler to produce exact floating point math between different platforms. Imagine a linux server & iOS device running the same floating point math code. This is useful for many scenarios particularly for connected games, but also debugging, replay etc.
     
    Last edited: Jan 8, 2019
    hippocoder likes this.
  30. e199

    e199

    Joined:
    Mar 24, 2015
    Posts:
    98
    It is not deterministic yet
    He may use heavier player controller in real project
     
    wobes likes this.
  31. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,033
    This is simply not true, and your statement is way to generic.

    There's a few options you can use, but to only send input it requires determinism on all sides which unity doesn't have. And 'most' authoritative games do not do this, especially since it requires the client computers to be fast enough to simulate that many character controllers for example (if you receive input from 70-80 players in one packet, you then need to step these locally on all clients). Unreal Engine by default does something *similar* to this with its networked character controller, but it had to be dropped for fortnite and pubg since it was too expensive for the clients to locally simulate all players - and they fell back to just sending position/rotation to the clients.

    Note that unreal engines physics/cc is not deterministic, so AFAIK they send both the starting state (which is positon/rotation/velocities/etc) and the Input, which consumes even more bandwidth than just sending position/rotation.

    Obviously when sending input commands to the server in authoritative environment the client whos trying to move his character on is going to send a few bytes describing his input. But that is only between that one client and the server.
     
    wobes, elcionap and nxrighthere like this.
  32. Opeth001

    Opeth001

    Joined:
    Jan 28, 2017
    Posts:
    294
    Locally simulating all clients using ECS + C# jobs will not be a heavy thing to process. knowing that culled players will not be animated , rendred.. it's just about moving players entities . I think even on mobile devices it can be easly handled. Anyway i juste have my opinion.
    All games using extrapolation will have to send inputs to clients too to simulate it.
     
    hippocoder likes this.
  33. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,033
    Completely depends on what you're doing, a few things to consider:

    1) You need to be able to run at 2 core machines still
    2) You need to make the ecs physics engine (whenever that comes) be deterministic, which is going to be slower since it most likely will disable a lot of floating point optimizations on individual CPUs
    3) Something like a the ecs equivalent of the physx character controller is not going to be magically faster than the physx version
     
    wobes likes this.
  34. PhilSA

    PhilSA

    Joined:
    Jul 11, 2013
    Posts:
    1,084
    I agree if we're talking about "both client and server only send inputs to eachother", but:
    The "Clients send only input, Server sends entire world state" model is used in a lot of top-grade fast-paced online games, though, and does not require determinism. It's the model used by CSGO, Overwatch, Quake 3, Tribes 2, etc.... and countless more.
     
  35. Opeth001

    Opeth001

    Joined:
    Jan 28, 2017
    Posts:
    294
    Totally True !!!
    but i think in order to get a smooth multiplayer gameplay the Extrapolation is highly needed and it cant be done without inputs coming from Server side.
    Plus if the Server send inputs only when they change & positions 1-2/s even 10/s if needed to fix any position difference related to the determinism, i dont think it will be any visible inconsistency + Clients and Server will gain an important amount of bandwith.

    (mouvement Inputs and rotation 20/s Which is a high amount for a real humain player )
    (4 bytes * 20) + ( 6 bytes * 10/s) = 140 bytes /s by player
    Against
    (Position and Rotation ) (6 bytes + 4 bytes) *60/s = 600 bytes /s without even using the Extrapolation.
    which is a kind of Client-side Prediction for all the moving players.

    !!! It's much much easier for an Authoritative Server to prevent Cheating when he receive inputs instead of positions and rotations !!!

    Im Not Talking About Physics-based networked Games!!!
    im just talking about wobes's Example.
     
  36. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    I am sorry. But how many times I should say that the purpose of the demo is to show the stress test of my API? I did not and I do not want to reduce my tickrate for this demo. Last time I am explaining that the purpose of my example is to show that the ENet + ECS can handle 5k connections with 64 tickrate and 320000 messages per second with 0 GC allocations in a fully multi-threaded way. The thing that you say is the last thing that a developer has to worry about.
    First, you have to make the thing work, after, you have to optimize it. If I handled the scenario of 64 ticks that means that anything that is below it - can be easily achieved.
     
    raul3d, nirvanajie, fholm and 5 others like this.
  37. fholm

    fholm

    Joined:
    Aug 20, 2011
    Posts:
    2,033
    Read the last paragraph, where i specifically say what you are saying.
     
  38. Micz84

    Micz84

    Joined:
    Jul 21, 2012
    Posts:
    241
    What an attitude I would like to see the valuable projects you have shared. This kind of posts is valuable because they show what can be done with ECS. Maybe he has plans to put it on asset store in future so he does not what to share source code.
     
    nxrighthere likes this.
  39. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    538
    Those who were really looking for experience and code they got it, there are no secrets. He gladly sharing stuff across this forum quite often (especially in this forum section), you only need to ask.
     
    Last edited: Mar 5, 2019
    raul3d likes this.
  40. PhilSA

    PhilSA

    Joined:
    Jul 11, 2013
    Posts:
    1,084
    I think it's time to take a chill pill, everyone
     
  41. hippocoder

    hippocoder

    Digital Ape Moderator

    Joined:
    Apr 11, 2010
    Posts:
    25,620
    I agree! I've cleaned up the thread so discussion can continue. I did it one side now this side :) peace all.
     
    raul3d, bhseph, Knightmore and 5 others like this.
  42. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    Much appreciate that.
     
    hippocoder likes this.
  43. eizenhorn

    eizenhorn

    Joined:
    Oct 17, 2016
    Posts:
    1,522
    @hippocoder you need here
    EDIT: Lol just not refreshed page, you're actually here :D
     
    hippocoder likes this.
  44. GCat

    GCat

    Joined:
    Jul 31, 2012
    Posts:
    176
    Thanks wobes & nxrighthere, good learning material!
     
    Last edited: Feb 23, 2019
    wobes, Knightmore and nxrighthere like this.
  45. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    @PhilSA hi there. Do you have any plans of making an ECS version of Kinematic CC?
     
    e199 likes this.
  46. PhilSA

    PhilSA

    Joined:
    Jul 11, 2013
    Posts:
    1,084
    (I am guessing you are asking this in the context of this network test)

    It's something i really want to do, but these days it's hard for me to find time for it. I also wouldn't want it to become a support nightmare so I might wait a little until ECS matures

    And lastly, I dont think CapsuleCastCommand can return multiple hits yet. And OverlapCapsuleCommand and ComputePenetrationCommand dont exist either. This would be a requirement if we want to jobify the most expensive part of the controller
     
    Last edited: Feb 25, 2019
    Vincenzo, nxrighthere, wobes and 2 others like this.
  47. wobes

    wobes

    Joined:
    Mar 9, 2013
    Posts:
    678
    Thank you for the informative reply.
     
  48. Staakman

    Staakman

    Joined:
    May 15, 2013
    Posts:
    6
    Hi @wobes and @nxrighthere. So I found this benchmark (https://github.com/nxrighthere/BenchmarkNet/wiki/Benchmark-Results) were they compare several networking libraries. As Enet has the best performance I would like to learn more about that library (although this benchmark could be a bit biased as @nxrighthere wrote it :D). Unfortunately there's not much documentation nor are there any examples? So randomly clicking links in google to find more information or examples I finally ended up here.

    Is there any way you could share the code? Thanks in advance.
     
  49. nxrighthere

    nxrighthere

    Joined:
    Mar 2, 2014
    Posts:
    538
    Because ENet is written in C with all benefits that it carries, and based on the well-known efficient polled event model. That's two of couple of other reasons why it's fast.


    Check your PM.
     
    Last edited: Mar 27, 2019
    Staakman, Knightmore and wobes like this.
  50. unity_ryP9OKfyYBOFHw

    unity_ryP9OKfyYBOFHw

    Joined:
    May 18, 2019
    Posts:
    2
    Can please anyone explain how @wobes got 6 bytes for Position and 4 bytes for Rotation ?

    gafferongames.com is not available anymore :(