Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Discussion BURST and AVX512

Discussion in 'Burst' started by laurentlavigne, Oct 20, 2022.

  1. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    5,992
    i noticed that avx512 was being removed from intel chips and added to amd.
    as i'm about to upgrade the old 8400, i was wondering if avx512 is even something that burst supports?
    (what about asset import ir build?)
     
  2. Neto_Kokku

    Neto_Kokku

    Joined:
    Feb 15, 2018
    Posts:
    1,751
    The main issue with avx512 is the large clock throttling that kicks in when it's enabled. This means that you can actually get a net performance reduction if most of your threads aren't constantly pumping long avx512 workloads.

    Games need to perform a bunch of heterogeneous tasks to produce a frame, only a fraction of which can benefit from vectorization, so the loss in clock speed can negate the jump to 512bit SIMD. Thus Intel saves some coin by limiting or removing it from their consumer CPUs (production workloads are where avx512 can truly shine).

    I don't know if AMD's implementation overcomes the issues Intel had, haven't looked zen4 benchmarks in detail.
     
  3. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    AVX512 is an insane instruction set. Especially the ternarnylogic instruction being my favorite, which can take any three valued boolean expression like
    (bool a, bool b, bool c) => a ^ (b ? c : !a)
    and execute 3 of them in one clock cycle, which would take up to 9 mostly interdependent instructions in AVX2 code. Another honorable mention is true vector processing with masked operations. AVX512 instructions also provide "overloads" for 128- and 256 bit vectors, respectively, which means the down throttling is not as relevant as it seems at first.

    Speaking of throttling: It is nothing new. SSE had it and especially AVX did (and still does!!!) which was always limited to the first couple architectures that support this new instruction set extension.

    What I personally think is the biggest problem is the absurd fragmentation of AVX512. There is AVX512 Foundation, which is a base-line, but also many extensions to the AVX512 extension, which not every AVX512 capable CPU supports. A CPU might support AVX512DQ and AVX512BW, the next one only supports one of those and yet another supports neither. This is a nightmare for the Burst team to implement, aswell as a nightmare with regard to build target choice and build size.

    But no doubt: AVX512 is the future and will not be surpassed by another vector size increase for decades. I mean 512 bit registers are the exact size of a cache line and the masked operations use 64 bit registers, which can alter 64 (s)byte x (s)byte instructions. It's a perfect fit ^^
     
  4. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    5,992
  5. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    @laurentlavigne Low level code and micro optimizations are my jam (as opposed to algorithmic optimizations, although the can be fun too).
    Working on a WW2 related project with 1 coder + outsource everything else. Entity instance counts of 20 million are not rare in it, which pretty much requires DoD and optimal machine code aswell as maxmum data compression with minimal performance trade-offs (like an enum with "None", "Poor", "Mediocre" and "Good" only requires 2 bits of storage...). Glad you liked it!
     
    laurentlavigne likes this.
  6. TheOtherMonarch

    TheOtherMonarch

    Joined:
    Jul 28, 2012
    Posts:
    791
    Last edited: Oct 22, 2022
  7. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    Also Linus Torvalds:
    "Anyway, I would like to point out (again) that I'm very very biased. I seriously have an irrational hatred of vector units and FP benchmarks. I'll be very open about it. I think they are largely a complete waste of transistors and effort, and I think the amount of time spent on them - both by hardware people and by software people trying to use them - has been largely been time wasted."

    He dislikes SIMD hardware in general, which is irrational.

    https://www.realworldtech.com/forum/?threadid=193189&curpostid=193203
     
  8. laurentlavigne

    laurentlavigne

    Joined:
    Aug 16, 2012
    Posts:
    5,992
    Micro opt are cool! Amiga!
    I'm not familiar with the bit ops that your repo does because i don't understand how they can be useful for a game besides layermask compare. Can you give some examples of how you use them on your game?
    And 20M entities... wow. Are you simulating an entire country's population?

    That settles it.
     
    Last edited: Oct 23, 2022
  9. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    20,124
    Zen 4 overcame them. Higher performance with very minimal changes to clock frequency, temperature, and power.

    https://www.phoronix.com/review/amd-zen4-avx512/
     
    Last edited: Oct 23, 2022
    Neto_Kokku likes this.
  10. Mortuus17

    Mortuus17

    Joined:
    Jan 6, 2020
    Posts:
    105
    @laurentlavigne I do also "simulate" i,e, cheat each countires' population into existence, which are just numbers and some code, but it's a game focused on the military, where the 20M+++ entity count refers to the amount of soldiers on planet earth during WW2. These actually hold data (mostly booleans, actually stored as 1 bit), which is what I'd define as an entity. Other data is something like the enum I mentioned, and x86 SIMD instructions are OK at unpacking and consequently manipulating them.

    This is also the most obvious way in which my libs, which almost only grew as needed, are useful. Manipulating / (un)packing bit fields, although the SIMD lib is mostly math, so I'm not actually sure what you're referring to :D
    It is still useful to be aware of specific instructions, as it is the only way you can exploit them, which sometimes requires a bit of up-front work (just like DoD is, when trying to utilize caches and SIMD).
    Since this response is already SO MUCH off-topic, I won't go into details, but I can recommend the chess programming wiki, (https://www.chessprogramming.org/BMI2) which finds a use for almost each and every x86 instruction in the craziest way imaginable, which should make your brain explode and hint at the optimization possibilities for almost any game there is.