Search Unity

  1. Welcome to the Unity Forums! Please take the time to read our Code of Conduct to familiarize yourself with the forum rules and how to post constructively.
  2. Dismiss Notice

Could it be worthwhile for Unity to support half precision (fp16) variables?

Discussion in 'General Discussion' started by Arowx, Mar 4, 2020.

  1. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    It looks like half (fp16) precision could be becoming more popular on CPUs and GPUs due to its usage in Machine Learning.

    In games could half precision be useful, in theory a SIMD instruction could pack twice as many 16 bit half precision operations into a cycle compared to 32 bit float instructions.

    There are inherent range and accuracy issues but as games tend to use lots of activity in proximity to the player maybe this could be utilised to provide more things near to the player.

    Also most particle systems could probably fit within a half precision range with the potential for a boost in performance.

    Do most modern hardware platforms support half precision SIMD and does the Burst compiler provide access to this?

    I know in Unity that about >8k from origin is where floats hit there precision limits, what range would a half have?
     
    Last edited: Mar 5, 2020
  2. MadeFromPolygons

    MadeFromPolygons

    Joined:
    Oct 5, 2013
    Posts:
    3,875
  3. MrArcher

    MrArcher

    Joined:
    Feb 27, 2014
    Posts:
    106
    Correct me if I'm wrong, but float is single precision (32bit).

    EDIT: OP has edited the original post. It originally specified single precision floats.
     
    Last edited: Mar 4, 2020
    Arowx likes this.
  4. MadeFromPolygons

    MadeFromPolygons

    Joined:
    Oct 5, 2013
    Posts:
    3,875
    I think by single precision he actually meant SIMD? As in single precision SIMD but again I am not really sure
     
  5. MrArcher

    MrArcher

    Joined:
    Feb 27, 2014
    Posts:
    106
    So single precision multiple data structs like Vector3, Color, Matrix4x4 and the like? We're already using those. I think OP was under the impression that float isn't single precision.

     
    MadeFromPolygons likes this.
  6. sxa

    sxa

    Joined:
    Aug 8, 2014
    Posts:
    741
    from https://docs.microsoft.com/en-us/do...e/builtin-types/floating-point-numeric-types#

     
    MadeFromPolygons and MrArcher like this.
  7. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    To clarify it's 16 bit floating point, is this 'half-precision' data type even supported by C#?
     
    Last edited: Mar 4, 2020
  8. XCPU

    XCPU

    Joined:
    Nov 5, 2017
    Posts:
    145
    Hardware wise been supported since about late 2000's i7 era processors included.
    DirectX and OpenGL use it, but you almost never see it as a general programming type for use.
    A little assembly library could easily be made and included in a DLL.
    You'd have to have a very specific use case to go to all that bother.
    In the Embedded world its quite common which is where I'm from,
    used several formats over the years, still do.
     
  9. Ryiah

    Ryiah

    Joined:
    Oct 11, 2012
    Posts:
    20,124
    According to the manual entry on shader data types and precision, standalone graphics cards always process everything in full 32-bit floating point precision regardless of the precision you specify. I don't know how accurate this entry is but it's marked with a recent date.

    https://docs.unity3d.com/Manual/SL-DataTypesAndPrecision.html
     
    Last edited: Mar 4, 2020
    MadeFromPolygons likes this.
  10. MadeFromPolygons

    MadeFromPolygons

    Joined:
    Oct 5, 2013
    Posts:
    3,875
    Yup as far as I am aware that is still the case. @bgolus or @jbooth would be the go to minds on that subject to confirm though
     
  11. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    So in theory a half precision transform would be good out to 60,000 meters (assuming 1 meter a unit), that does not sound right as the current 32 bit float transforms seem to get shaky around 8k-10k.

    Or does physics need 4-5 digits of precision to keep things stable?

    And other sources mention that half precision only has an accuracy to the nearest 32 when you get to 65k, has anyone tried using half precision in game development as a 3D transform?

    I wonder if it could really boost 2D graphics as a 16 bit float could easily provide good accuracy within display resolutions.
     
  12. Murgilod

    Murgilod

    Joined:
    Nov 12, 2013
    Posts:
    9,745
    It genuinely does not matter. The gains from this would be so minimal as to be entirely pointless.
     
  13. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    It isn't. Smallest floating point type is float/System.Single, takes 4 bytes/32 bits.
    https://docs.microsoft.com/en-us/do...ce/builtin-types/floating-point-numeric-types

    Half precision will be good up to 2(two) meters, not 60000.

    Floating point types do not opearate on METERS. You go from smallest precision you want, and then you get the maximum volume before the errors get bigger than the smallest unit.
    System.Single has precision of 6..9 decimal digits, or 23 bits for fractional parts. If your smallest measurable distance is 1 millimeter, that gives you 1km (using 10^6 mm) or 8 km (using 2^23 mm) from origin.

    In case of HALF, you get 3 decimal digits, or 11 bits. This gives you between 1 meters (10^3 mm) .. 2.048 meters (2^11 mm).

    Past that distance you'll be losing precision and things get shaky.

    Using half on a modern CPU is a waste of time, because floating point processor doesn't even have instructions for them, so operations would need to be emulated. Meaning half will be running slower than single.

    So, long story short with half precision, you'll get with a side of couple of meters, and past that point there will be precision loss.
     
  14. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Intel article on the benefits of using half precision calculations... source


    Given Unity's drive for performance e.g. DOTS and limited cache sizes of modern CPUs could half precision be used to boost throughput in some systems that can fit their data into 16 bits.

    The article mentions the F16C instruction set -> https://en.wikipedia.org/wiki/F16C

    These instructions convert 4 half precision float point variable to single precision floating point and are available in most modern Intel and AMD cpus (see link above).

    So from a DOTS memory and cache bandwidth perspective could provide a way to 'pump' more data down to the CPU whilst running high precision calculations with a limited 'range'.
     
    Last edited: Mar 6, 2020
  15. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    That quote is from @Unity documentation?
     
  16. MrArcher

    MrArcher

    Joined:
    Feb 27, 2014
    Posts:
    106
    The article linked is mainly talking about a bunch of conversion code in the intel compiler which was created for that specific purpose. As others have pointed out, most hardware has been optimized for single precision. If you're asking if there are specific cases where half precision floats may provide some benefit in unity, then the answer is yes, possibly. But that's something you can try for yourself to see if the benefits of switching to half outweigh the hardware optimizations already in place. Can you think of a situation where they'd be useful in a gamedev setting?
     
  17. MadeFromPolygons

    MadeFromPolygons

    Joined:
    Oct 5, 2013
    Posts:
    3,875
    I mean, half precision floats are super useful in compute shaders, I use them all the time for that purpose! Outside of shaders though I have never even tried, but could be a fun exercise!
     
    Last edited: Mar 6, 2020
  18. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    It is from reading specification for half precision floats, which have 11 bits for fractional parts and then applying logic and programming knowledge.

    All things considered, half precision is 2 bytes, meaning it can represent 65536 different numbers total. Because there are only 16 bits. It doesn't matter what those numbers are, but there can be no more than 65536 different numbers, because 65536 is the total number of states that can be stored in 2 bytes.

    Representing distances up to 60000 meters with 1 millimeter precsion requires 60000m/1mm ==> 10000000 different values, which is 24 bits (log2(10000000) = 23.25... and 2^24 = 16777216) and that does not fit into 2 bytes.

    In this day and age there's no sane reason to switch to half precision floats when you're programming for PC. Situations where you could possibly want half precision floats would likely involve custom hardware and FPGAs. Not unity engine.
     
    Japsu, ShilohGames and Ryiah like this.
  19. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    When it comes to Unity shaders, yes. The half and fixed precision types are ignored and everything is always a full 32 bit float on desktop GPUs.

    That’s not to say desktop GPUs can’t do 16 bit floating point math, in fact they can now. For a long time they’d gotten rid of that in favor of raw 32 bit speed. A few Maxwell based GPUs, like the Tesla P100 and Tegra X1 (like in the switch) have hardware support for 16 bit precision at double rate execution, like many mobile GPUs do. I believe most Pascal (GTX 1000 series) and all Turing (RTX 2000 series) GPUs have it. Plus the new Tensor Cores only do 16 bit math. It’s primarily for CUDA and AI related stuff that it gets used.

    The latest AMD GPUs have also native 16 bit float support, with Vega (RX Vega 64, Radeon VII) and Navi (RX 5000 series).

    Unfortunately outside of GLES and compute, half precision is only available when using D3D12 or Vulkan, and only when support is enabled... which I don’t think Unity does.

    Performance gains on the GPU can be massive.
    Performance gains on the CPU ... less so.

    I think someone recently tweeted a thing about using 16 bit for storage, 32 bit for execution was the sweet spot for perf on modern CPUs.

    No. That’s just how half precision floats work.

    “3 decimal digits” means the first three digits of a number, not 3 decimal places after the dot. Half precision has a range of -65504.0 to +65504.0, but between 32768 and 65504 it can only represent numbers in steps of 32.0. So it can store 32768 and 32800, but if you try to do 32768 + 10.25, the number won’t change because it can’t store the value 32778.25, and the closest value it can store is 32768. Different ranges of numbers have different levels of precision, that’s just how floating point values work.
    https://blog.demofox.org/2017/11/21/floating-point-precision/

    TLDR: a 32 bit float between 8192.0 and 16384.0 is the same precision a 16 bit float has between 2.0 and 4.0.
     
    Last edited: Mar 6, 2020
    Goularou, angrypenguin and Ryiah like this.
  20. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Probably the most specific use case would be particle effects, where you have lots of particles within their own limited volume. So your game is running and hundreds of bullets are flying ricochets and particle effects abound. With thousands or who knows millions of particles you will start to hit cache bandwidth performance limits on the hardware.

    If you can get away with dropping from single to half precision data you have potentially doubled your bandwidth to the CPU via memory and cache. And instructions like F16C allow the CPU to batch convert half precision data to single precision data on the CPU then process that data via SIMD and then convert it back.

    As shown by the Intel graph above on higher throughput more demanding tasks you can gain a significant boost in performance.

    Another use case could be sprites in a 2D game, or massive armies of 2D imposters total wars style. It could even be useful for short range player bullets.
     
    Last edited: Mar 6, 2020
  21. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Side Thought: If memory bandwidth is one of the biggest problems with CPUs and why we need so many levels of cache now. Could we have dynamic precision compressible data like half precision with smart caches that can decompress and compress the data too and from main memory therefore boosting bandwidth?
     
  22. MrArcher

    MrArcher

    Joined:
    Feb 27, 2014
    Posts:
    106
    True, so long as all this excitement is happening within a few units of the origin, otherwise the imprecision starts to creep in. Even if you're doing all your simulation math in half precision range, you'll still have to convert to float at some point to render it in the game world, where the imprecision will come through.

    And again, that graph was specifically about the conversion code in the intel compiler, which was optimized for that specific purpose. It's not a generic graph showing performance gains across the board from using half precision floats.
     
  23. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    With half precision available on GPU could you at least transfer the data to the GPU as fp16 then have the shaders do the work to render it in a fp32 world. That way your are gaining bandwidth and throughput to the GPU as well as CPU/RAM.

    Edit: @Joachim_Ante For instance could the Mega City demo work faster if some of the smaller details were fp16 and rendered on the GPU in fp16, could it work as a hierarchical GPU LODding boost.

    However nearly all modern x86 cpus hit bandwidth limits due to cache sizes and latency overheads e.g. Notice the stepped nature here of the latency of a high end AMD CPU depending on the the volume of data (below) it basically maps to the CPUs caches.


    So it might be that below a certain data volume fp16 is a moot point as you can see in the Intel graph only when you hit high volumes of data do you gain the most from the overhead of converting vs the boosted throughput.

    However if future CPUs adopt ML cores or add more SIMD FP16 instruction sets the potential doubling in performance for game engines could be useful.
     
    Last edited: Mar 6, 2020
  24. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    To be more specific, floating points in general store numbers in
    "A * 2^B" format.
    Where A is fractional part, and B is exponent.
    It is similar to 3.4 * 10^7 notation.
    Decimal digits refers to precision in "A" part.
     
    MadeFromPolygons likes this.
  25. MadeFromPolygons

    MadeFromPolygons

    Joined:
    Oct 5, 2013
    Posts:
    3,875
    I found that super interesting, I didnt realise floating point numbers were saved using what is basically SI notation underneath - although thinking about it, it makes perfect sense.
     
  26. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    In language like C++, floating point can be represented as a bitfield structure.
    https://en.cppreference.com/w/cpp/language/bit_field

    Single precision float would be something like this:
    Code (csharp):
    1.  
    2. struct Single{
    3.     unsigned int fraction: 23;
    4.     int exponent: 8;
    5.     unsigned char signFlag: 1;
    6. };
    7.  
    https://en.wikipedia.org/wiki/Floating-point_arithmetic
    This would explicitly demonstrate all the parts of floating point number.

    C# does not have bitfields, but something similar could be implemented with manual bitmasks and properties.
     
    MadeFromPolygons likes this.
  27. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    Now, fun part.

    Apparently if you're using standard unity shaders, then "fixed" and "half" can be ignored, at least for d3d11 and GLSL. They're converted to floats.

    I'm not sure if maybe under some circumstances they actually turn into different data type, but
    Code (csharp):
    1.  
    2. Shader "Unlit/Precision"
    3. {
    4.     Properties
    5.     {
    6.         _MainTex ("Texture", 2D) = "white" {}
    7.     }
    8.     SubShader
    9.     {
    10.         Tags { "RenderType"="Opaque" }
    11.         LOD 100
    12.  
    13.         Pass
    14.         {
    15.             CGPROGRAM
    16.             #pragma vertex vert
    17.             #pragma fragment frag
    18.             // make fog work
    19.             #pragma multi_compile_fog
    20.  
    21.             #include "UnityCG.cginc"
    22.  
    23.             struct appdata
    24.             {
    25.                 float4 vertex : POSITION;
    26.                 float2 uv : TEXCOORD0;
    27.             };
    28.  
    29.             struct v2f
    30.             {
    31.                 float2 uv : TEXCOORD0;
    32.                 float3 worldPos: TEXCOORD1;
    33.                 float4 vertex : SV_POSITION;
    34.             };
    35.  
    36.             sampler2D _MainTex;
    37.             float4 _MainTex_ST;
    38.  
    39.             v2f vert (appdata v){
    40.                 v2f o;
    41.                 o.vertex = UnityObjectToClipPos(v.vertex);
    42.                 o.worldPos = mul(v.vertex, unity_ObjectToWorld).xyz;
    43.                 o.uv = TRANSFORM_TEX(v.uv, _MainTex);
    44.                 return o;
    45.             }
    46.  
    47.             float4 frag (v2f i) : SV_Target
    48.             {
    49.                 float4 col = 1.0;
    50.                 float l1 = sqrt(dot(i.worldPos, i.worldPos));
    51.                 //int l2 = l1;
    52.                 //half l2 = l1;
    53.                 fixed l2 = l1;
    54.                 float scale = 1.0;
    55.                 col.x = l1;
    56.                 col.y = l2;
    57.                 col.z = abs(l1-l2)*scale;
    58.                 return col;
    59.             }
    60.             ENDCG
    61.         }
    62.     }
    63. }
    64.  
    65.  
    Produces this kind output for fragment part, for example:
    Code (csharp):
    1.  
    2. in  vec3 vs_TEXCOORD1;
    3. layout(location = 0) out vec4 SV_Target0;
    4. float u_xlat0;
    5. void main()
    6. {
    7.     u_xlat0 = dot(vs_TEXCOORD1.xyz, vs_TEXCOORD1.xyz);
    8.     SV_Target0.xy = sqrt(vec2(u_xlat0));
    9.     SV_Target0.zw = vec2(-0.0, 1.0);
    10.     return;
    11. }
    12.  
    Basically, it converted fixed to float, realized that substracting a float from itself results in zero and... optimized comparison between float and fixed away.
    Trying to use half to transfer data between vertex and fragment shader results in floats being generated.
    And GLSL does not even have half as a data type.

    Fun.
     
  28. MadeFromPolygons

    MadeFromPolygons

    Joined:
    Oct 5, 2013
    Posts:
    3,875
    Yeah I only really use halfs and fixed in compute shaders where they do work (unless I have misunderstood compute shaders this entire time and literally written garbage code without realising....)

    But this sort of info should really be on the docs! Theres a ton of unity shaders out there that use those types and probably dont realise they dont make a difference.
     
  29. MrArcher

    MrArcher

    Joined:
    Feb 27, 2014
    Posts:
    106
    @Ryiah linked to it earlier:

    They're still useful when working on mobile projects, though!
     
  30. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    Tried to use a compute shader with "half" variables to check precision differences...
    Nope! They're still converted to floats and trying to do "var - half(var)" ends up being optimized away and replaced with a zero.

    Apparently...
    https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-scalar

    They aren't really supported on GPUs.
    There's a thread mentioning this as well:
    https://forum.unity.com/threads/dat...fer-floats-into-half-fixed-data-types.390542/
     
  31. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    That.

    Mobile GLSL, which is actually GLSL ES (sometimes called ESSL) does have precision qualifiers. It's one of the main differences between OpenGL and OpenGL ES. The actual qualifiers that
    fixed
    ,
    half
    , and
    float
    originally mapped to,
    lowp
    ,
    mediump
    , and
    highp
    , exist in GLSL 1.2, but are only "reserved" and don't actually do anything, where as in GLSL ES 1.0 they actually do something. However it's up to the GPU to determine what it does with them.

    Very few modern mobile GPUs have real
    fixed
    precision support, but the precision qualifiers are simply the requested minimum precision. A full 32 bit float fulfills the minimum requirements of both fixed and half, so it's valid to use that, but if the hardware has support for the lower precision types than it usually comes with significant performance & power benefits. These days Unity totally ignores
    fixed
    in shaders and always converts them to
    half
    or
    float
    depending on the compilation target.

    One curious thing about IEEE 754 half precision. It doesn't actually use all 65536 possible numbers. The -65504 to +65504 range uses something like 64,510 of those possible numbers, which includes both positive and negative 0 (since the sign is it's own bit). Then there's +/- infinity.

    And then there's NaN. There are 2046 possible NaNs in a half precision float, and most of them go unused. This is because both infinity and NaN have a binary exponent of 11111 (31) to denote a value outside of the valid range, and the 10 bits of mantissa only really have the binary values of 0000000000 == infinity, and any other value == NaN. Some programming languages actually exploit this fact and internally store other data as "signaling NaNs". Like JavaScript, where ever value in the language that's not a number is stored as a NaN.
    https://anniecherkaev.com/the-secret-life-of-nan

    Again, modern GPUs totally do support half precision floats. The problem is Direct3D 10 and desktop OpenGL don't. When I said "compute" I really mean CUDA, and OpenCL.

    Direct3D 11 sort of does, but since most GPUs didn't when Unity added support for that it probably wasn't given much thought and it isn't converting
    half
    to the
    min16float
    that Direct3D 11 actually uses. The funny thing is Unity does convert
    half
    (and
    fixed
    ) to
    min16float
    when compiling for mobile!
    https://github.com/TwoTailsGames/Un...blob/master/CGIncludes/HLSLSupport.cginc#L179

    So if you use
    min16float
    in your shader, you might actually get real half precision float support when compiling to Direct3D 11! Unfortunately that's just a "might", as it's still up to the compiler and drivers as to whether or not it chooses to actually use it. Explicit support requires Vulkan or Direct3D 12, and some optional settings on the shader compiler to enable them, which I don't think Unity does.
    https://therealmjp.github.io/posts/shader-fp16/
     
    Findeh, neginfinity and Ryiah like this.
  32. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    Back to the original topic of "Unity adding 16 bit support". Lots of internal data in Unity is using 16 bit precision, and sometimes less, for storage reasons. Lots of people will use bitpacking to put several numbers of different precision levels into a single float or int. This is good practice for storage on CPUs and GPUs!

    GPUs support use of different precision types for data storage as well, the best example being Unity's vertex colors being stored as a byte per channel. Direct3D also supports 16 bit float values for the vertex buffer, but I don't think Unity chose to use those, though the on disc data for the mesh may be stored at that precision or less. There's also all of the 16 bit texture formats.
     
    Ryiah likes this.
  33. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    The choice of "number" was not exactly correct in my part. If it has 2 bytes, it has 65536 possible states. Not all of those states might be valid and not all of them will map to real numbers. As described by you.
     
    Ryiah and bgolus like this.
  34. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    I also got the "64510 number of numbers" totally wrong. Heh. Not that it matters that much. It's closer to 63488 (65536 - 2048, which is all infinity and NaN values).
     
    angrypenguin likes this.
  35. neoshaman

    neoshaman

    Joined:
    Feb 11, 2011
    Posts:
    6,469
    I love these thread, thank you to all teachers above!
     
  36. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Could 3D model mesh or terrain data use fp16 to compress most of their data into neighbouring chunks linked to fp32 'root' nodes?
     
  37. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    As mentioned earlier, fp16 cannot express sub-millimeter precision in range outside of 2 meters away form origin. So I wouldn't use it, and actually using 16bit signed int fixedupoint would yield better precision for such models (0.06 millimeter precision for up to 2 units away from origin). I think something similar was actualy used in Quake 1 which used tweening for animating its models. For terrain heightmaps, 16 bit integer is also a commonly used recommended choice.

    All things considered, this data type seems to be only good in scenario where you need to express small values in range between 0..1.0 and closer to zero.
    However when actually using such values in practice people often switch to signed/unsigned byte and signed/unsigned short (int/uint8 and int/uint16 respectively).

    So, minimum usability, as far as I can tell, unless you're dealing with specialized hardware of sorts.

    This type woudl be useful for floating point rendertargets, deferred rendering and so on. For model compression - not so much.
     
  38. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Found this great source which shows you how to work out the accuracy of floating point numbers

    From this I calculated the following table:

    Code (csharp):
    1.  
    2. fp16
    3. exponent    range            accuracy
    4. 0           1, 2             0.0009765625
    5. 1           2, 4             0.001953125
    6. 2           4, 8             0.00390625
    7. 3           8, 16            0.0078125
    8. 4           16, 32           0.015625
    9. 5           32, 64           0.03125
    10. 6           64, 128          0.0625
    11. 7           128, 256         0.125
    12. 8           256, 512         0.25
    13. 9           512, 1024        0.5
    14. 10          1024, 2048       1
    15. 11          2048, 4096       2
    16. 12          4096, 8192       4
    17. 13          8192, 16384      8
    18. 14          16384,32768      16
    19. 15          32768, 65536     32
    20.  
    fp16 provides:
    millimeter (0.001) level accuracy only in th 1-2 meter range*.
    centimeter (0.01) level accuracy in the 2-4 meter range.
    decimeter (0.1) level accuracy in th 4-8 meter range.

    So in theory FPS guns/hands and magazine and objects/items could be fp16.**
    Doors and furniture could fit into the millimeter/centimeter range and scale 1-4 meters.

    Also inversely LOD could use fp16 to reduce the bandwidth of objects at a distance.
    Terrain furniture e.g. flowers, plants, trees, rocks, stones, boulders could probably fit within fp16 accuracy ranges quite well.

    *Assuming 1 unit is a meter.

    **Given the level of detail applied to these meshes in games a fp16 doubling of bandwidth could be a huge boost to performance.
     
    Last edited: Mar 8, 2020
    Jakky27 likes this.
  39. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    Which is why I linked to that page in my first response to this thread.

    Which is why Unity already does this in places it matters. You can enable mesh compression on import and in the player settings to reduce the build size and by extension load times. But mesh data is rarely the thing that takes up considerable memory space; a 10,000 vertices with UVs, normals, and tangents, uses less memory than a single 1024x1024 DXT1 texture. And on the GPU, while you could send the initial position data to the shader in 16 bit precision, the actual math would need to be done at 32 bit precision or you’ll start to get PS1 like visual anomalies in the rendering which vertices appearing to vibrate when the camera moves. Using 16 bit floats to calculate the screen space position would not have enough precision to render to a 1920x1080 render target without visible issues.

    Sure, normals and UVs can get away with less, which is why they are often limited to 16 bit on mobile. It’s also why AAA console devs make heavy use of data packing on interpolated data. AMD GPUs do the interpolation in the fragment shader, so if you have low enough access to the shaders you can output custom packed data from the vertex shader and get correctly interpolated data in the fragment shader. That’s not really plausible outside of consoles though.
     
    Ryiah likes this.
  40. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    I think a test would be needed here as FPS guns/hands tend to be fixed to the camera.

    And distant LODs would be range and resolution dependent e.g. 1k,2k,4k resolutions would need to have different fp16 ranges to allow for pixel sizes. The final distant render mesh would be in 32 bit it's the benefit of having lots of items in memory and moving between CPU and GPU with half the bandwidth that could provide a performance boots.

    CPU to GPU bandwidth is often a limiting factor in frame rates.
     
  41. neoshaman

    neoshaman

    Joined:
    Feb 11, 2011
    Posts:
    6,469
    Another fun thing would be to test the precision against the size of pixel in screen space at the near plane, to have the minimal visible delta.
    Ie size of near plane / resolution. Using the FOV you can infer the precision needed at all distance, you would twice the precision to account for the nyquist sampling, and probably a bit more due to rotation.
    Designing mesh, not just compressing them, to account for these resolution would help too.
     
  42. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    And where are negative exponents in this table?
    The precision should be much higher close to zero.
    I also see one millimeter precision as low.

    I feel that for the amount of geometry data used by animated guns/hands switching to fp16 will mostly waste time and produce no benefit.

    The format makes sense for rendertargets, because offscreen HDR surface can gobble up huge amount of memory. For example ARGB float offscreen surface for 4k will take ... 3840 * 2160 * 4 * 4 ==> 132 710 400 bytes. That's a lot. Switching to fp16 will cut that amount in half.
     
  43. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    Almost never is this a factor in limiting frame rate. Hence why PCIe 4.0 GPUs (like the Radeon 5700 XT) see almost zero improvement in frame rate vs using PCIe 3.0, and is even sometimes slower (which basically just means the difference is almost purely within the margin of error).
    https://www.techpowerup.com/review/pci-express-4-0-performance-scaling-radeon-rx-5700-xt/24.html

    3DMark has benchmarks showing off PCIe 4.0 having huge improvements over PCIe 3.0, but they’re extremely contrived demos where a CPU is driving millions of objects while also re-uploading unnecessarily large images every frame to saturate the bandwidth. In reality most of the work would be done on the GPU, or with minor optimizations work over PCIe 3.0 just as well or better.


    The way Unity’s built in rendering paths & URP work, which is transform the object into world space, then transform that into clip space, would mean disaster when using 16 bit floats. Stepping a few meters away from the origin would immediately mean your FPS model turns into polygon soup.

    The HDRP’s camera centric space avoids the worst parts of this as it transforms from object to camera relative world space rather than real world space. But that only avoids the worse problem. There still isn’t enough precision in 16 bit floats to render at 1920x1080 accurately even avoiding all other precision issues. Just using 16 bit clip space alone would be problematic. To avoid rendering artifacts, graphics APIs require hardware have at least 8 bits of precision per pixel. At 1920x1080, 16 bit clip space would only provide about 2 bits of precision for the outer half of the range.

    Sub 1.0 precision isn’t on the chart, but it’s easy to calculate. Each range of 2^x to 2^(x+1) has the precision of (2^(x+1) - 2^x) / (2^mantissa bits). Half precision has 10 bits of mantissa, and a possible exponent range of -14 to 15.

    So that means between the range of 0.25 to 0.5 it has a precision of 1/4096, or 0.00024414062. Between 0.125 and 0.25 is 1/8192, or 0.00012207031, etc. That goes all the way down to the range of 2^-13 to 2^-12. Once you get to 2^-14 floats change to “subnormal numbers”, which have the same precision as the next highest range, but they actually go from 0.0 to the next power (2^-13), rather than from 2^-14 to 2^-13. So from 0.0 to 2^-12 it has the same precision across that whole range.

    Generally speaking the precision of less that 1.0 values in any floating point is more than enough for most use cases, even with 16 bit precision. It just happens to be that 3D transforms for render at modern resolutions is not one of them. I mean once you get to 4K resolutions you’re getting near fixed integer precision for most of the screen (which is what the PS1 did), only it’s not aligned to the pixels so the artifacts will potentially be even more bizarre.
     
    Ryiah likes this.
  44. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    For the lazy, that’s ~126.5 MB. A lot for old GPUs, not really a lot for today’s 6 GB+ GPUs.

    For reference against PCIe 3.0 x16, the usual connection today’s GPUs have, it has enough bandwidth to transfer 124 UHD ARGBFloat textures per second. Or one of those plus everything else usually needed for rendering most games at 120 fps without any issue. Most modern games really don’t stress the CPU to GPU connection all that much.
     
    neoshaman and Ryiah like this.
  45. neginfinity

    neginfinity

    Joined:
    Jan 27, 2013
    Posts:
    13,321
    I was thinking about post-processing effects, deferred rendering and the like.

    For screen output the game would require (A)RGB buffer, Depth/Stencil buffer, and multiple offscreen buffers. Basically numbers go up along with resolution, and I do feel like it may be easy enough to hit bandwidth problems by going crazy with effects that rely on offscreen rendering.

    However, this will not be CPU to GPU bandwidth, but GPU bandwidth.
     
  46. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    Oh yeah. For GPU bandwidth that size texture can become an issue, because the GPU has to do a lot in each frame. Consoles in the past have often use B10G11R11 or A2R10G10B10 for HDR to reduce the memory & bandwidth usage even further.

    An RTX 2080 Super's memory bandwidth is around 496 GB/s. 126.5 MB doesn't seem like a lot, but it would still take ~0.25ms to write that texture with a single full screen triangle. That's not the time it takes to render the triangle, just the time to copy the output of the shader into the framebuffer. 0.25 ms doesn't sound like a lot, but when you only have ~16 ms to render everything on screen (or less if aiming for higher than 60 fps) every little bit counts. And if you have a single particle system with 16 particles that you can get really close to, that's now 4 ms, or 1/4th of your entire rendering time ... just writing the data to memory.
     
    neoshaman likes this.
  47. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    PCIe 3.0 x 16 has about 15,760 MB/s but in game developer terms that's only 15.76 MB a millisecond.

    Or 8 milliseconds to transfer your 4k ARGB screenshot giving you a max of 125 fps @ 4k.

    However we need to put this bandwidth into perspective (source)
    Code (CSharp):
    1. SSD          560 MB/s
    2. HDMI       2,250 MB/s
    3. M.2        3,500 MB/s
    4. PCIe16    15,800 MB/s
    5. RAM       25,600 MB/s
    6. VRAM      45,640 MB/s
    7. L2 Cache 175,000 MB/s
    PCIe 3.0 x 16 is a bottleneck when compared to RAM (0.61x) and VRAM (0.346x) with about half or a third of the speed of either.

    Mind you PCIe 4.0 aims to be twice as fast or about equal to RAM but still a bit slower than most VRAM.

    Even RAM and VRAM can be considered bottlenecks when compared to L2 cache on die bandwidths.
     
    Last edited: Mar 10, 2020
  48. bgolus

    bgolus

    Joined:
    Dec 7, 2012
    Posts:
    12,236
    Sure ... but modern games just don't send that much information to the GPU that often for it to be a limiting factor for frame rate. Most data is sent to the GPU once either during startup or level load, or occasionally in smaller chunks for things like open world games with streaming content. The limiting factor for streamed content is still how long it takes to get data off of the drive, not how long it takes to send the data to the GPU across the PCIe connection. With PCIe 4.0 NVMe SSDs, those are limited to PCIe 4.0 4x which is still only half of the bandwidth of PCIe 3.0 16x that GPUs currently use. Plus the Gen 4 SSDs are only just barely getting above PCIe 3.0 4x speeds, and then only in very specific tests that don't reflect real world use cases. Like your example above, 560 MB/s is far, far off from PCIe 3.0 16x, and even that speed is pretty far off from what you'll see in more real world uses, even with PCIe 4.0 NVMe SSDs.

    Another way to think about it is PCIe 3.0 16x has enough bandwidth to update the positions of many millions of objects while still doing >120 fps, because really that's the only information the CPU is sending the GPU.
     
    MadeFromPolygons, Ryiah and neoshaman like this.
  49. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    Good point:

    If PCIe x 16 has 15,800 MB/s
    Thats 15.8 MB a millisecond or 15,800,000 bytes
    A float3 transform is 12 bytes so you can have 1,316,666 transforms per millisecond or 8.3 times that @ 120 fps.
    A float4 rotation is 16 bytes so you could only have 564,285 transforms and rotations (28 bytes) per millisecond.

    But what about animation, in a AAA game: characters, weapons, vehicles, flora and water can weigh in with tens if not hundreds of thousands of vertices each that are being animated every frame.

    Let's say you have a 100,000 polygon budget per character, now you are down to 5.6 characters per millisecond or about 46.9 @ 120 fps.

    If each character has a weapon with a 50,000 polygon budget that's 3.7619 characters per millisecond or about 31.3479 characters @ 120 fps.

    Of course you can have lower polygon budgets and introduce LODding to gain performance in larger scenes.

    Anyway think of character animation, most of the movement occurs in < 1m range per joint, if you could move to a half precision animation system you could double your animation bandwidth and potentially gain from boosts in processing bandwidth on hardware that supports FP16.

    Even if your animation is pre-calculated or baked FP16 halves it's memory footprint and all AMD/Intel CPUs have FP16 SIMD conversion instructions.

    And this is the key point FP16 is growing because it has a big impact in performance for Machine Learning, something CPU and GPU manufacturers are chasing.

    Now if only we could figure out a game development use for FP8 we could see an 4x performance boost... wait a minute if we run a game faster then things are moving less therefore we need less precision so we can boost performance. :):cool:o_O
     
    Last edited: Mar 11, 2020
  50. Arowx

    Arowx

    Joined:
    Nov 12, 2009
    Posts:
    8,194
    PS Remind me the Quaternions X Y Z W are the raw values limited in range and could FP16 be used?