Search Unity

runtime ai scripting in burst

Discussion in 'Entity Component System' started by nijnstein, Aug 19, 2021.

  1. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    hi, anyone any experience with a light weight scripting engine that can be used from burst code with only minimal symantics? Im currently developing something that compiles a c-like script into a struct with a fixedbyte[512] for the code segment (8 bit words) and a fixedfloat[16] for the data stack. (enough for what i want for now) with a simple interpreter im able to use it in burst jobs and while i havent tested any performance it feels very fast.. currrently i have support for scripts with simple arithmetic and im adding support from the functions from unity.mathematics but i guess after that ill be adding the next thing...

    it feels like reinventing the wheel. what do you guys use in burstable code? everything i seem to find looks way to heavy and impossible or not performant enough to use from burst?
     
  2. Ashkan_gc

    Ashkan_gc

    Joined:
    Aug 12, 2009
    Posts:
    1,124
    The data structures required for script execution are usually full of pointers and ASTs and string manipulation, parsing and ... you will probably will not benefit too much from coding the interpreter in burst, however it can create a set of assembly like instructions to be executed by a burst runtime.
     
  3. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    i compile the script into bytecode much like assembly, the burst interpreter for that bytecode is very simple. no pointers, no strings and no variable length structures in the native side of things:

    Code (CSharp):
    1.  struct BurstScriptExecutable
    2. {
    3.     /// <summary>
    4.     /// unique script id
    5.     /// </summary>
    6.     public int Id;
    7.  
    8.                
    9.     /// <summary>
    10.     /// byte code compiled from script
    11.     /// </summary>
    12.     public fixed byte code[512];
    13.  
    14.     /// <summary>
    15.     /// size of code in bytes
    16.     /// </summary>
    17.     public int code_size;
    18.  
    19.     /// <summary>
    20.     /// index into bytecode, next point of execution, if == code_size then end of script is reached
    21.     /// </summary>
    22.     public int code_pointer;
    23.  
    24.     /// <summary>
    25.     /// index into dataregister, starts from data[start_stack_index]
    26.     /// </summary>
    27.     public int stack_pointer;
    28.  
    29.     /// <summary>
    30.     /// input, output and scratch data fields
    31.     /// </summary>
    32.     public fixed float data[32];
    33.  
    34.     /// <summary>
    35.     /// nr of data registers
    36.     /// </summary>
    37.     public int data_size;
    }

    performance is already as i want it in burst jobs, its only that it feels like im making stuff already made.. if not ill probably put in on github or something when ready.. but id rather use a proven solution.

    Using the burst as a jit like how it would be possible with c# to compile in runtime into il seems not possible for now, id really like to be able to do that. I could outout IL instead of bytecode but could i have the burst compiler compile it in runtime and give me back a function pointer?? no idea
     
  4. Ashkan_gc

    Ashkan_gc

    Joined:
    Aug 12, 2009
    Posts:
    1,124
    It is not likely that people built scripting runtime in burst or at least built and shared it online too. Less than 100 real companies probably are using it directly at this moment and most probably only for calculation heavy stuff.
     
  5. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    got it working and its fast, 10 million times running a script is so fast i had to check my timins tripple and again:

    my results for the following test:

    #bscript test
    a = 1 * 2; # a = 2
    b = -1 * 2; # b = -2
    c = a / b; # c = -1
    d = c * 100; # d = -100
    e = c + 1 * 100; # e = 0 (no muplication rules yet - should be simple node-reorder)
    f = c + (1 * 100); # f = 99

    This is compiled in into 44 bytes of bytecode and a stack of 9 floats and put into the struct in the previous post, some prelimenary results in the editor: (i9900k, gtx1080)

    (note that on the 2nd run compilation is largely skipped)




    im suprised that even with a clumsy interpretor i get better performance then i wanted :) i guess my cpu likes that everythin fits in a single cache line

    next ill be adding some control flows, vectors (float3) and functions and it should be usable.

    anybody an idea how to have burst compile a jump table from a switch statement on a byte enum? i guess that theres some codestyle to adhere to for ncpp to convert it in a jumptable.. it wont for now.. its the biggest slowdown in the interpreteror and using arrays of function pointers.. i dont know if and optimizer will benefit from that

    10.000.000 executions of this in less then 3 ms in editor, i wasnt expecting that even for such simple script


    edit: results in runtime are even more dramatic, 0.1ms in burst over 16 cores, for 10M executions of the above script:
     
    Last edited: Aug 21, 2021
  6. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    i think i will have it support something like this, it is based on a tokenizer but there is no grammer (intentionally)

    Code (CSharp):
    1. #
    2. #  burst script test 6
    3. #
    4. # - no declaring variables, this is not needed
    5. # - NO SCOPE
    6. # - no strings
    7. # - no allocations
    8.  
    9.  
    10. a = 1 * 2;             # a = 2
    11. b = -1 * 2;          # b = -2
    12. c = a / b;             # c = -1
    13. d = c * 100;         # d = -100
    14. e = c + 1 * 100;     # e = 0   (todo - reorder nodes in parsetree)
    15. f = c + (1 * 100);      # f = 99
    16.  
    17. #
    18. # functions
    19. #
    20. g = abs(c) * f + (max(a, b, c, d) * 4)
    21.  
    22. #
    23. # conditions
    24. #
    25. w = (a > b) & (c < d) ^ (e > f);   # note single char operators, executed left to right
    26. w = !w;
    27.  
    28. #
    29. # control flow
    30. #
    31. if(b > 0)
    32.     f = 16;
    33. else
    34.     f = 4;
    35.  
    36. if(f = 4) {
    37.     h = 50 - min(1, 2, a, b, c, d, e, f) * 2;
    38.     i = 1;
    39. }
    40. else
    41. {
    42.     h = 1
    43.     i = 1
    44. }
    45.  
    46. while( a > b )    # with support for break and continue
    47. {
    48.   b = b * 2;
    49.  a = b;
    50. }
    51.  
    52. #
    53. # jumps
    54. #
    55.  
    56. @label1:
    57.  
    58. goto label1;             # i like goto
    59. jump_z(a, label1);    #jump if a = 0 to label1
    60. jump_gz(a, label1);     # jump if a greater > 0 to label1
    61. jump_nz(a, label1);     # jump if a not zero to label1
    62. jump_lz(a, label1);     # jump if a smaller < 0 to label1
    63.  
    64. #
    65. # vectors
    66. #
    67. j = [1, 2, 3];
    68. k = [3, 2, 1];
    69.  
    70. m = j * k;
    71.  
    72. n = [1, 2, 3] * 4;
    73.  
    74. #
    75. # vector ops
    76. #
    77. o = normalize(n);
    78. p = min(n);           # should return smallest element
    79. q = min(n, m);        # should return smallest vector n or m
    80. r = min_arg(n, m); # should return smallest element of each vector as a [,,]
    81.  
    82. #
    83. # functions
    84. #
    85. function add1(a) { return a + 2; }
    86. function add2(a) { a = a + add(2); a = a + add(1); return a; }
    87.  
    88. s = add2(1))a
    89.  
    90. #
    91. # signals
    92. #
    93. signal_a = 80808080;
    94.  
    95. set(signal_a, 1);
    96.  
    97. t = wait_for_signal(signal_a);       # will yield until signal > 0
    98.  
    99. #
    100. # yield
    101. #
    102. yield;        # return back execution untill next frame
    103. yield a;      # yield for n frames
    104. return;       # return indefenately
    105.  
    106. #
    107. # stack stuff (more the 32 (config) push will result in allocations by the interpretor
    108. #
    109. pop a
    110. push a
    111. peek a
    112.  
    113. #
    114. # constants
    115. #
    116. u = deltatime;
    117. v = elapsedtime;
    external functions can be added as compiled fragments of bytecode for the interpretor based on the code the compiler finds, or externally from some library of scripts while some will be built in (math functions etc), float3 and quaternions will be compatible with vectors of size 3 resp. 4

    what do other people think i really miss? scoping inside a fragment like this is probably something i dont want/need/see much use for it as functions will be executed in a different context but hey maybe other poeple really want it.. there are some performance issues implementing that while it ass little for simple scripts

    strings and allocations wont be supported
     
    Last edited: Aug 21, 2021
  7. Ashkan_gc

    Ashkan_gc

    Joined:
    Aug 12, 2009
    Posts:
    1,124
    Depends on users of your scripts but probably having jumps and goto and jze and ... is not a good idea if those who use this are designers/high level programmers
    if you are compiling a higher level language to this byte-cocde then my point is mute of course.
     
  8. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    during startup it compiles in managed code something like:"
    Code (CSharp):
    1. f = 1 + 2 * 2 + 5;
    into:
    Code (CSharp):
    1. 001 128 017 130 129 002 131 000
    readable:
    Code (CSharp):
    1. assign <f> fma #2 #2 #1 + #5 nop
    it correctly re-ordered for multiplication while analysing the parsetree, then the optimizer used pattern recognition to replace 'a + b * c' (in any order) into fma b c a :p, this results in a 8 bytes of bytecode and only 3 switches because it can 'prefetch' parameters for an operation like fma (from 5 to 1 switch). this is getting fun :eek: i could also void the whole call because the parameters are constant but i dont want to make that optimization yet, ill save that for last.

    guess ill add if/then for and while loops as nobody i asked knew what jz ment...

    burst blasts through the bytecode this and attaching the resulting scripts with ibufferelementdata entities gives me a fast and clean solution for having millions of entities with unique behaviour (if i could only render that on my 1080)



    i have a hard time myself believing my own benchmarks, note that this is for 10M executions of the script in burst.
     
    Last edited: Aug 22, 2021
    apkdev likes this.
  9. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    burst really likes scripting like this, the cache would probably glow red hot on running it 10m times... now i need to find out a nice method to get the data from IComponentData... i could point the stack and map fields without allocations or copies but that seems nasty..

    functions, controlflow, i dont really need it right now but they will follow


    Code (CSharp):
    1. Reading scriptfile: ai/scripts/test5
    2.  
    3. CODE:
    4.  
    5. a = 1 * -2;                # -2
    6. b = -1 * 2;                # -2
    7. c = a / b;                # 1
    8. d = c * 100;            # 100
    9. e = 1 + 2 * 2;             # 5
    10. f = 1 + 2 * 2 + 5;        # 10
    11. g = 1 + 2 / 6;            # 1.3333
    12. h = 1 + -2 / 6;            # 0.6666
    13. i = 1 - 2 * 2 + 5;      # 2
    14. j = 1 + 2 * 2 - 5;      # 0
    15. k = 1 +-2 * 2 + 5;      # 2
    16.  
    17. COMPILE: 13,0137 ms
    18.  
    19. CODE SEGMENT: 83 bytes
    20.  
    21. 001 128 129 005 003 130 000 001 131 003 129 005 130 000 001 132 128
    22. 004 131 000 001 133 132 005 134 000 001 135 017 130 130 129 000
    23. 001 136 017 130 130 129 002 137 000 001 138 017 130 144 129 000
    24. 001 140 018 130 144 129 000 001 141 018 130 130 129 002 137 000
    25. 001 142 017 130 130 129 003 137 000 001 143 018 130 130 129 002
    26. 137 000
    27.  
    28. ASSEMBLY:
    29.  
    30. set <a> #1 * - #2 nop set <b> - #1 * #2 nop set <c> <a> / <b> nop set <d> <c> * #100 nop set <e> fma #2 #2 #1 nop set <f> fma #2 #2 #1 + #5 nop set <g> fma #2 #0,166666672 #1 nop set <h> fsm #2 #0,166666672 #1 nop set <i> fsm #2 #2 #1 + #5 nop set <j> fma #2 #2 #1 - #5 nop set <k> fsm #2 #2 #1 + #5 nop
    31.  
    32. DATA SEGMENT: 17 floats
    33. var 1    name = a                                 value = -2               refcount = 2
    34. var 2    constant                                 value = 1                refcount = 18
    35. var 3    constant                                 value = 2                refcount = 28
    36. var 4    name = b                                 value = -2               refcount = 2
    37. var 5    name = c                                 value = 1                refcount = 2
    38. var 6    name = d                                 value = 100              refcount = 1
    39. var 7    constant                                 value = 100              refcount = 2
    40. var 8    name = e                                 value = 5                refcount = 1
    41. var 9    name = f                                 value = 10               refcount = 1
    42. var 10   constant                                 value = 5                refcount = 8
    43. var 11   name = g                                 value = 1,33333337       refcount = 1
    44. var 12   constant                                 value = 6                refcount = 1
    45. var 13   name = h                                 value = 0,6666666        refcount = 1
    46. var 14   name = i                                 value = 2                refcount = 1
    47. var 15   name = j                                 value = 0                refcount = 1
    48. var 16   name = k                                 value = 2                refcount = 1
    49. var 17   constant                                 value = 0,166666672      refcount = 4
    50.  
    51. BENCHMARK: running 10.000.000x
    52.  
    53. execution benchmark (single thread c#):   27.149 ms    
    54. execution benchmark (parallel burst):        0.1333 ms        
    55.  
    56.  

    the referencecounts for constants are bugged... they are mainly used to remove divisions and replace the initial value with its inverse
     
    apkdev likes this.
  10. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    as expected im getting carried away a bit with stuff, this simple script:

    Code (CSharp):
    1.  
    2. input a 0 4;
    3. output a 0 4;
    4.  
    5. a = a + 1 * (1 / 2) * 2;
    6.  
    will convert into the following 11 bytes of bytecode:

    Code (CSharp):
    1.  
    2.  
    3. 001 128 128 002 031 004 131 131 133 132 000
    4.  
    5. // set <a> <a> + mula / #1 #1 #0,5 #2 nop
    6.  
    7.  
    through several steps, the input code is parsed into a token tree, then an analyser gets to check and optimize nodes according to some rules after which the compiler will output bytecode. The bytecode is then further optimized using pattern recognition. In this snippet the following things happen after parsing the token tree:

    1. analyser ensures order of operation
    Code (CSharp):
    1. a = a + (a * (1 / 2) * 2)
    2. the division is converted into a multiplication with its inverse as the division was determined to be with a constant
    Code (CSharp):
    1. a = a + (a * (1 * 0.5) * 2)
    note that the analyser will use reference counting to either replace the original value in the stack or to create a new constant value on the interpretor stack.
    3. the analyser removes unneeded parenthesis
    Code (CSharp):
    1. a = a + (a * 1 * 0.5 * 2)
    4. the optimizer kicks in and sees multiple multiplications and an addition and replaces it
    Code (CSharp):
    1. a = a + (mula(a, 1, 0.5, 2))
    5. finally, the parenthesis are not needed anymore as now the multiplications are 1 function and operator precedence is not a problem anymore
    Code (CSharp):
    1. a = a + mula(a, 1, 0.5, 2)
    the interpretor can now very efficiently produce an execution path in runtime/burst from the 9 bytes of code the script resulted in
    Code (CSharp):
    1. 128 128 002 031 004 131 131 133 132
    The assingment statement 'set' and all 'nop's are not actually needed to correcly execute any code the interpretor outputs.

    This all works better then expected and theres a lot more then this... i keep finding methods to improve the execution of the script by the interpretor.

    The input/output at the start of the script defines the mapping for io vars, currently it directly maps with pointers
    but it should become optional to copy or direct reference, bytesize is used not to determine the parameter types as those are inferred from their usage, it is used to determine if all parameters can be set on starting the script using a single native memset as they would be located in one block

    Code (CSharp):
    1.  
    2. input <id> [offset] [bytesize];  // map identifier to memory offset from the data used to run the script on
    3. output <id> [offset] [bytesize] // map identifier back
    4.  
    Next up is control flows and vector types.


    Some results:

    The following code compiles into 59 bytes and a stacksize of 13 floats (this could be a lot smaller but it is not optimized yet), i takes 0.33ms on average to compile in single threaded c#
    Code (CSharp):
    1.  
    2. a = a + 1 * (1 / 2) * 2;    
    3. b = 2 * (2 * 2 * 2) * 2;    
    4. c = 2 - 2 - 2 - 2 - 2 - 2;  
    5. x = 1 & -2 | 1;
    6. y = 1 & 2;  
    7. z = 1 & 2 & 3 & 4 & 5;      
    8.  
    and results in the following benchmark:

    1st run



    2nd run



    that is 10M times executed in 0.014s :)

    just maybe ill get scripting paralel in burst just as fast or faster as doing it direcly in c# without script or burst, thats just awesome. A really large optimization is that burst makes a jump table of all switches as in c# i wont get it compile like that.

    - note that for testing it evaluates expresions leading to non-referenced values aka useless calculations are performed, this is for evaluation purposes only, in release you would never calculate anything as not a single value is returned via output.
     
    Last edited: Aug 28, 2021
    Bivens32 likes this.
  11. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    - should be 10m times in 0.016 ms, i keep getting a spam error on editing the post

    - the second run has more stuff in close in memory, it shows how with ecs patterns it becomes much easier to have efficient memory acces by default
     
    Last edited: Aug 28, 2021
  12. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    its comming together, vectors are in:

    Code (CSharp):
    1. # define a float4
    2. f = (1 2 3 4);            
    3.  
    4. # perform scalar * vector * scalar operation
    5. g = 2 * f * 2;
    6.  
    7. COMPILE: 0,4084 ms
    8.  
    9. CODE SEGMENT: 15 bytes
    10.  
    11. 001 128 132 133 134 135 000 001 136 133 005 128 005 133 000
    12.  
    13. ASSEMBLY:
    14.  
    15. set f 1 2 3 4 nop set g 2 * f * 2 nop
    16.  
    17. DATA SEGMENT: 12 floats
    18. var 1    name = f [4]                             value = 1 2 3 4                          refcount = 3
    19. var 2    constant                                 value = 1                                refcount = 2
    20. var 3    constant                                 value = 2                                refcount = 6
    21. var 4    constant                                 value = 3                                refcount = 2
    22. var 5    constant                                 value = 4                                refcount = 2
    23. var 6    name = g [4]                             value = 4 8 12 16                        refcount = 1
    24.  


    A vector is deduced from the data following an operation.

    Most trouble was with the optimizer to check data sizes of variables while matching on bytecode patterns, the interpretor has very little knowledge of it other then it builds from the byte and a float[] with data. each operation > 128 is actually the offset (-128) into that table, maybe its best to have code and data share the same fixedbyte[] (with data using memorya backward offsetted from end index) as for the same sized icomponentdata the range of supported scripts can be larger or the number of parameters/stack vars can be larger for the same sized data block;

    Im not sure if it should be just 1 IScriptData with a preset size defined per project, or if i should have different IComponentDatas with different sizes.. for me 192 (i think i can get away with 64 bytes total for most usages for me, remember, unique script/entity) byte of shared code/data memory would probably be more then enough. Chunks like the smaller ones but i really want to test soon how much structure size influances performance.

    next up:
    - constant vectors, could ditch most of the scalars then from stack (only the '2' remains)
    - functions, math utilities mosly, anything to get smaller code for repeating tasks would be nice too.. option to map a few byte opcodes to native function pointers for flexibility or is that way too much..
    - control flow, ifthen, while etc.
    - small constant values should use non-used operand and varid opcode space, these would then not needed to be fetched but it would only work for small integers and booleans; the interpretor should be able to figure it out by using 1 byte as index for the id operation (128 now)
    - booleans, its very tempting to use 1 64 bit datatype and intrinsics to manipulate it and make 64 bools available to the script for the cost of 4 bytes of memory, boolean operations work but they use a float as backing field..
    - lot of testing.
    - it wont work on mobiles as i used sse intrinsics in the interpretor and plan to use it a lot more for vector math, i dont know if a neon code path would be nice or just a safe version for burst to figure out... too much questions time to sleep and work tomorrow :p
     
    Last edited: Aug 30, 2021
    apkdev likes this.
  13. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    i guess my biggest question is how to organize this nicely for reuse, when it has a data stack and it shares chunk memory space with your entities then it seems very wastefull to have duplicates of data being copied to and from the script, pointing to everything would force unsafe code (it is now but should not be later), having properies on each script means probably generating code to compile with your project for easy data access if the only place it is stored is in the script's stack..
     
    apkdev likes this.
  14. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    small update ;)

    Code (CSharp):
    1.  
    2. [B]CODE:[/B]
    3. aa = 1 + 1 * (1 / 2) * 2;  
    4. ab = 2 * (2 * 2 * 2) * 2;  
    5. ac = 2 - 2 - 2 - 2 - 2 - 2;
    6. ax = 1 & -2 | 1;              
    7. az = 1 & 2 & 3 & 4 & 5;  
    8. aw = 1 * 10 * 3 * (3 + 4 * 2);
    9.  
    10. [B]CODE SEGMENT: 72 bytes[/B]
    11. 024 018 037 004 085 085 107 086 019 001 128 085 002 025 000 001
    12. 129 037 005 086 086 086 086 086 000 001 130 039 086 086 086 086
    13. 086 086 000 000 001 131 085 006 003 086 007 085 000 001 132 040
    14. 005 085 086 087 088 133 000 024 018 033 088 086 087 019 001 134
    15. 037 003 085 090 087 005 025 000
    16.  
    17. [B]ASSEMBLY:[/B]
    18. push ( mula / 1 1 0,5002 ) set aa 1 + pop nop set ab mula * 2 2 2 2 2 nop set ac suba 2 2 2 2 2 2 nop nop set ax 1 &- 2 |1 nop set az all * 1 2 3 4 5 nop push ( fma 4 2 3 ) set aw mula - 1 10 3 * pop nop
    19.  
    20. [B]DATA SEGMENT: 7 floats, 28 bytes[/B]
    21.  var 1    name = aa                                value = 2                                refcount = 1
    22.  var 2    name = ab                                value = 32                               refcount = 1
    23.  var 3    name = ac                                value = -8                               refcount = 1
    24.  var 4    name = ax                                value = 1                                refcount = 1
    25.  var 5    name = az                                value = 1                                refcount = 1
    26.  var 6    constant                                    value = 5                                refcount = 1
    27.  var 7    name = aw                                value = 330                              refcount = 1
    28.  
    • all nesting is converted into linear execution using a stack, this wil void all recursive paths in the interpretor so there is no cost in maintaining function stacks by the os for deep recursive calls, there is some extra stack use but that is lighning fast compared to a function call nesting yet another compound iteration. From now on the interpretor follows a completely flat execution path.
    • constant values are reference counted and the most common are encoded in opcodes, saving data space while costing no additional code space, a future update will allow an extended instruction to load constant vectors from the interpretor, i bet that by analysing all scripts in a project the compiler can make a very neat list of 255 vectors[n] that are used most often and make sence as a constant encoded via an opcode; this will save a lot of memory in the data segment
     
    apkdev and Lukas_Kastern like this.
  15. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    Hard too see in bytecode but this is the result for a nested If then / while and a switch statement, flattened and with understanding of vectors of max size 4 x 63:

    Code (CSharp):
    1. set aa 2 nop push ( aa - 1 ) jz @jmp_0 25 ( aa & pop nop ) ( set aa 42 nop set ab mula yield 42 1 1 1 nop ) jump 6 @jmp_1 ( @label_0 set aa 2 nop ) @label_1 set ab 1 nop push @label_3 ( 2 - 10 ) jz 23 @jmp_2 ( aa > pop nop ) ( set aa aa - 1 nop set ab ab + aa nop ) jump_back 29 @jmp_3 jz @label_2 13 @jmp_4 ( ab >= 826 ) ( set ac 2020,2020 nop jump 22 ) @label_4 jz 14 @jmp_5 ( ab >= - 34 ) ( set ac 1010,1010 nop jump 6 ) @label_5 set ac - 100 nop
    and its getting fast.. i cant run a benchmark (next update) at the moment as im completely refactoring all code (it became a bit messy) and i forgot some 'minor' thingy... indexers like . and [], after that its time to start thinking about packaging blobs and making it simple to use. We could seperate code data and stack and depending on the task at hand their final memory location might be very important for maximized performance so i really want to have decent control about layout and destinations in several configurations. You might want to run script code from a manager using only data from a chunk and a stack based on a local stackallock [], or you might want something completely different.

    Then finally for now, the kicker: 2-n wide execution, with simple scripts and very, very many of them we are bound to be able to sort and select execution paths were only data changes, and we should not have to do that every frame akin to how the ecs systems organize themselves, we should be able to burst through that as long as scripts dont use conditional flow control. (and even then, on an unlikely jump that mismatches, we could just take out that 1 path and run it single. The interpretor is flat for ease of coding that, i dont see any problem not solvable but i doo see many fun ones :p

    My purpose was to build somekind of tool to run many changing behaviours as fast as possible with the behaviour determined elsewere but i guess it will still be a while.. there are many topics on maintaining data structures for changing behaviour and they are all complicated because they want AND performance AND flexibility AND are prone to change during development AND with the rigid setups needed to run fast code though c# can be extremely time consuming and not very rewarding given the time spent on it. With a very fast scripting language i hope to solve that problem for my company as i can have just some instuctions at each node based on some higher level representation elsewere. When code is deciding it can always use the higher level repros but on stepping each frame many small scripts in chunks seem a clean solution somewhere midway, with the benefit you can completely pull out and replace its content without code changes. it all becomes data seperate from other processes. We will probably release a compiler/interpretor package to the store in a while for the price of a hamburger or something.. we will also be bringing in a compiler specialist who will help improving the compiler before releasing it as we damn wel be sure its output is always correct and/or validated), that wont be free im afraid.
     
    apkdev likes this.
  16. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    small update for anyone reading;

    I started to make the parts that make it usable, its purpose is to on demand change functionality, either from input, ai or some server. While the interpretor got very fast, i will never be as fast as native code and one big drawback of using a script interpretor for core functionality is performance, most parts of the scripts will be static and running it with an interpretor would be a waste of cpu-cycles, hard coding it would mean making the code more complex if you want to keep supporting on demand scripts.

    For this reason the scriptcompiler has some stages added that allow it to output c# burstcompatible code compiled from the scripts during design time. It will use functionpointers to call the native burst code if the script was known at compile time.

    This has several nice advantage's: a) speed, it is much much faster, b) we can treat all scripts equal regardless what they are compiled into resulting in clean and compact code.
    We can just use the same method everywere and still update parts during runtime without losing performance or getting complex code paths.

    At some point a static repo is kept which is scanned with reflection for scripts:


    Then we can use them everwhere from a manager:

    Code (CSharp):
    1.  
    2.   int res_hpc = mgr.Execute(script_id, data);
    3.  
    Performance is very enjoyable (note the warmup on the first run, each is 1000x, with time div 1000):


    As expected, the interpretor is much slower, needing 0,141 ys for 1 iteration, while the native bursted version takes about 0,004ys :). (EDIT: the benchmark is off.. il2cpp replaced the whole script with a packed move.. forgot it applies constant analysis and i dont for testing... time is higher but still very much faster, as would be expected) The interpretor will get faster by executing multiple equal scripts with different data wide but that will probably halt at 8x speedup and depening on the jumps in the script it might take some instruction padding (nops) to keep all in sync if even possible (many conditional jumps will be hard to solve and it probably will be useless so..), otherwise it will drop paths forcing them back in single execution.

    It might seem wierd at first glance to first do all this trouble to end up with c# code again but this ensures we always have the same behaviour and most importantly: there is no drawback to using the script generally when only small parts are expected to change during runtime, and in a weekly application update important script might become hardcoded f who knows, in the meanwhile a lot of clients on an older code base while still executing the same functionality although with a (in real life) neglegible performance hit.

    The script used in this test is very simple, 4 float multiplication statements, the compiler stage to c# isnt very finished as i wanted to see it work first :- also the handling of data could use some love, i guess we need to play a little with it in order to find out the nicest way to use it, ideally it would just read/write into a component data and just use a pointer to it but there are more usecases.. this needs some time toying with it.
     
    Last edited: Sep 26, 2021
    Ashkan_gc, Guedez and apkdev like this.
  17. Guedez

    Guedez

    Joined:
    Jun 1, 2012
    Posts:
    827
    What is the data layout for the scripts like? Is it trivial to serialize and deserialize the scripts mid execution during runtime?
     
  18. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    all data is put in a byte array as follows:
    Code (CSharp):
    1.  
    2.        /// [....code...|...variables&constants...|.datasizes.|...stack.....................]
    the package when compiled into 1 combined segment looks like this, but you can have each segment seperate so if you want you can have your code in some global repo and the data tightly packed in componentdata (or the other way around), for saving space, the interpretor can use a stack local to its stack saving space as to fit more datasegments into 1 chunk. (you cant use yield in that situation because it uses the stack to preserve state) We want it to be as flexile as possible but at the same time we dont want 1 big mess so this is kinda open atm in how to do this cleanly for all users.

    Code (CSharp):
    1.  
    2. public struct BlastPackageInfo
    3. {
    4.         public BlastPackageMode package_mode;
    5.         public byte allocator;
    6.         public byte unused_1;
    7.         public byte unused_2;
    8.         public ushort code_size;
    9.         public ushort stack_size;
    10.         public ushort data_size;
    11.         public ushort package_size;
    12. }
    13. public struct BlastPackage96
    14. {
    15.        public BlastPackageInfo info;
    16.        unsafe public fixed byte segment[96];
    17. }
    18.  
     
  19. Guedez

    Guedez

    Joined:
    Jun 1, 2012
    Posts:
    827
    I see, so it should be very easy to save it all, be it runtime code or not.
    I wonder:
    Now that you can import runtime burst assemblies, couldn't this be done in an unloaded assembly at runtime and then load it? I think it was introduced in 2021.1?
     
  20. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    I saw there is support now for adding unloaded assemblies but i dont see a way to compile with burst at a client, we use csc.CompileAssemblyFromSource on windows (some testing path), could possibly use compilerservices in mono builds but it will be plain c#. we have codepaths that use the .net c# compiler to generate il in runtime for the scripts compiled ast instead of byte code but for now it will only function on windows and compile time is so slow that it gives management trouble we dont want now, in cases it takes seconds to compile a script in runtime and to justify that it would need to execute said script many 1000s of times. Its gonna be a tradeoff and at this point in the project we are looking how we can have a consistant interface for all of this.

    to be honest we have assumed that using the burst compiler from runtime is impossible and havent even looked into it

    update: if i only assign a constant in a test script the difference between interpretation and native is only about 15x. with wide execution we can probably get very close. it will be interesting to see what normal managed c# would do with the same script, can wide execution of small scripts be faster then managed c#? i think/hope so. but that test will take some time as it will need most components finished
     
    Last edited: Sep 27, 2021
    Guedez likes this.
  21. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    after some sideroads its finally becomming a solid library, 1 feature left to complete and ill put it up in the asset store. I guess nobody will guess that feature :p no one did so far although its in plain sight



    i guess the biggest time eater in this was attemting to make a fitting solution for using it in unity only to realise im not a game developer and i wont know how its used. So i went back to the core and will deliver on the bare metal works for others to make fitting solutions.


    EDIT: well everbody who sees SSMD thinks its a spelling error for SIMD..

    i tried to make that work but it wont easily. instead it assumes that in a massive simulation with millions of scripts running that some scripts, probably many are the same with different data. So SSMD stands for Single Script Multiple Data.



    Its about 33x faster on a single thread without any further optimizations regarding vectorization nor burst :) branches make life a lot less perfect however so mitigating these for the most part and handling outliers is the last thing left to finish.
     
    Last edited: Jan 15, 2022
  22. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    this is getting insane, with burst disabled im getting native speeds running things in large groups

    1,2 ns for the expression a = b * c;

    end up being memory bandwidth limited at only 4 threads..




    With burst in unity :):

     
    Last edited: Jan 17, 2022
  23. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    inching towards a release into the asset store, i never though it would be so hard to say STOP that will be in version 2 but it has been done and somewhat feature limited version of BLAST will hit the asset store very soon, below is a demo of the scene that will be provided with the package, now lets hope someone other then me is interested in using this :p

     
  24. Bivens32

    Bivens32

    Joined:
    Jul 8, 2013
    Posts:
    36
    I am definitely interested. Although I don't have any time to mess around with it rightnow. I have a DOTS based visual scripting solution in the works and I feel my tool will need something like what you have created for it to have maximum flexibility. Maybe if I ever get mine working, I can try and combine the them.
     
  25. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    Interesting, ive been thinking about visual scripting too, in my view it would generate the nodes that normally would be generated from the text in the solution and compile that into bytecode for executing it against a library of function pointers that make up the api.
     
  26. Bivens32

    Bivens32

    Joined:
    Jul 8, 2013
    Posts:
    36
    That sounds complicated to me, but I am sure you can do it. Haha. My solution is much higher lvl api wise. My tool is is meant to control when ecs systems turn on and off by providing a way to combine boolean conditions in any way you want. It's more of a control system than a granular visual programming system if that makes sense.
     
  27. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    luckily it only sounds complicated, it should become simple to use blast, for example, if you can write a burstcompatible function to enable/disable systems and register it then it would be simple to accomplish:

    in c#:
    Code (CSharp):
    1. Blast.Register(TurnOff, TurnOn); // psuedo code to register some function with the scripting api
    2. Blast.Execute(control.bs, componentdata);
    The componentdata maps to the datastack of the script, where control.bs would be script like:

    Code (CSharp):
    1. if(condition) then TurnOn(parameters); else TurnOff(parameters);
    the only caveat is that my current Blast.Register procedure is far from easy atm. Thanks for your comment it got me thinking into how others might be using blast.
     
    Neiist likes this.
  28. MicCode

    MicCode

    Joined:
    Nov 19, 2018
    Posts:
    59
    This looks really interesting, probably will pick it up once it hit asset store.
    Seems useful for game with some programming or for light modding.
    How would it interact with the ecs system?
     
  29. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    if you look at this script and then parameters deduced from it in the blue field, these align in memory and if used with ecs blast should be set to packagemode = ecs where it will use the icomponentdata associated with this script (in the system it takes a pointer to it).

    The icomponent data would have to be handcrafted though to match the data (or just use fixed float arrays, and us e the blastpackage to translate data in/out.

    Maintanance functions for entities are natively coded in the script api and can be used from script code as if they were math functions. These are to be implemented with the external function support in blast (and at the moment isnt :p) and can be extended at design/runtime by users.

    From there many different implementations can apply depending on whats wanted, im will be making some examples for it in the next few weeks before finally releasing, a statemachine in dots, maybe a particle system too as that would show its power very well and some more simple usecases for gameobjects as well.
     

    Attached Files:

  30. MicCode

    MicCode

    Joined:
    Nov 19, 2018
    Posts:
    59
    That sounds great, demo with use case will definitely make the tool easier to understand and integrate. Looking forward to your release.
     
  31. davenirline

    davenirline

    Joined:
    Jul 7, 2010
    Posts:
    982
    Is there a language reference?
     
  32. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    davenirline likes this.
  33. davenirline

    davenirline

    Joined:
    Jul 7, 2010
    Posts:
    982
  34. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    "I don't see support for functions yet" some small section has been added but i will elaborate on how to extende the script api with user defined functions there.
     
  35. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    In my (past) view those were always kindof predefined and have no room in script and initially i wanted to implement them exclusively in c# and register them as an external function and call them from script for performance.

    That said; it would be nice to have so it will definitively be added after the first release.
     
  36. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    you know this is why it takes so long for me to finally release :) i cant go to sleep tonight thinking about functions not being supported initially :p
     
  37. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    well latest game releases didnt help with devtime.... at 2 MB of c# code im finishing of the documentation, lets hope unity accepts the blast package in its store :)

     
    Antypodish and Bivens32 like this.
  38. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    Ashkan_gc and Bivens32 like this.
  39. nijnstein

    nijnstein

    Joined:
    Feb 6, 2021
    Posts:
    78
    version v1.0.4c finally succeeds in the plan originally put down: execute many scripts compiled at runtime in simd instructions, finally i get back to using dots again to put it to work where it belongs as its so much easier to have very many objects at once

    The script: a = a * b; has the following assembly executed after some indexing magic, it returns after needing about 1.5ns per dataset in a 64 wide array.

    mul_loop_indexed_aligned.PNG


    Now the real pain starts.. making this easy to use for a wide range of things, the first person to use it used it to transfrom geolocation data from csv files and.. well i would have never guessed that i would really like some input on how people would use a scripting language in their project, especially one that could make their code faster.

    I use it because i like to torment myself with a debugger (i really do), that voids me as a good use case..
     
  40. sinaari

    sinaari

    Joined:
    Jan 28, 2019
    Posts:
    47
    Hey, this is a very impressive project! How is it going?
    I notice the Github project was not updated for a while, but on the other hand it seems to be pretty well set already.
    What are your plans regarding BLAST further? :)
    I'm still researching it, but it looks very interesting.

    I did have an idea to create a DOTS-based script interpreter for a while, but I absolutely didn't anicipate that someone already tried it, and it turned out so great tbh.
    I was thinking more about something like Lua or Scheme, for general scripting, mainly to avoid Unity compilation time delays and be able to work with DOTS. And of course such language would've been much slower than the succinct computation cores like you demonstrated with BLAST. But! I imagine that if such language was centered around BLAST, as an additional external layer maybe, then it could still have that power for the cases when you need it.

    BTW, I came across BLAST absolutely by accident, searching for some assets that are not in any way related to something like it. Neither Google nor the Asset Store ever showed it to me even when I was directly searching for "Unity scripting", "Unity DOTS scripting", and so on quite a lot.
     
  41. sinaari

    sinaari

    Joined:
    Jan 28, 2019
    Posts:
    47
    Antypodish likes this.
  42. Antypodish

    Antypodish

    Joined:
    Apr 29, 2014
    Posts:
    10,770
    Hey, it looks like great asset.
    It works mostly on my end in Unity 2022.

    Any approach to make it work with jobs?
    It has managed data types.