Questions about packet compression (Huffman Stream)

Kichang-Kim · Nov 1, 2018

Hi, I checked FPSSample and thanks for this great sample.

I have some questions about packet compression, especially HuffmanInputStream / HuffmanOutputStream.

My basic understanding for Huffman encoding is here:
1. Calculate symbol probability distribution of input data
2. Construct binary-tree and generate symbol-code table by using it
3. Encode input data using symbol-code table

And here is my questions:
1. Is this true that NetworkCompressionModel uses pre-defined symbol-code table? It seems that there is no code for calculating table from actual data.
2. What are the roles of these variables? :
NetworkCompressionConstants.k_BucketOffsets
NetworkCompressionConstants.k_BucketSizes
"int context" argument of every read/write method

Thanks.

stubbesaurus · Nov 1, 2018

Hi,

Things got a little hectic up to Unite, but we definitely want to revisit this code and document it.

It seems your basic understanding of Huffman is correct. Symbols are coded using a variable number of bits based on their expected frequency to minimize the expected number of bits needing to be sent.

1. The default is to use a pre-defined symbol-code table, but there is functionality to generate a model from statistics. NetworkCompressionCapture is a helper class that you can pipe network traffic into and then call AnalyzeAndGenerateModel on to obtail a serialized huffman model that you can use to construct a NetworkCompressionModel specifically for that type of traffic.

This functionality is also exposed through the console, where 'beginnetworkprofile' starts a recording session, 'endnetworkprofile myfile' ends the recording session and stores the model to a file and finally 'loadcompressionmodel myfile' can be used on the server to load and use a model. Clients connecting after the model has been loaded will receive and use this model as they connect.

2. When you send something like an 32bit integer over the network, it is impractical to create a huffman tree that encompasses every possible value the integer can take as it would require a tree with 2^32 leaves! To make this more practical we lump values into a managable number of power-of-two-sized buckets and then only code the bucket index with Huffman and code the position in the bucket using a number of raw bits corresponding to the size of the bucket. The buckets are arranged so they are small around 0 and become progressively larger as you move away from zero. The rationale is that as most data is deltas against predictions, we expect values to be small and generally expect most of the redundancy to be in the size of the error and not in exactly which of the values of that size we end up hitting.

NetworkCompressionConstants.k_BucketSizes specifies the sizes of the buckets in bits, so a bucket of n bits has 2^n values.
NetworkCompressionConstants.k_BucketOffsets specifies the starting positions of the bucket.

Bucket n starts at k_BucketOffsets[n] and ends at k_BucketOffsets[n] + (1 << k_BucketSizes[n]).

The context can be throught of as a submodel that has it's own statistics and uses its own huffman tree. The context used to read and write a specific value must always be the same. The benefit of using multiple contexts is that it allows you to separate the statistics of things that have different expected distributions, which leads to more precise statistics, which again yields better compression. For example FPS sample assigns unique contexts to all fields of all property sheet. More contexts does however result in a marginal cost of a slightly larger model.

I hope this is helpful.

Kichang-Kim · Nov 1, 2018

Thanks for detailed explanation!

CPlusSharp22 · Jul 28, 2020

stubbesaurus said: ↑

Hi,

Things got a little hectic up to Unite, but we definitely want to revisit this code and document it.

It seems your basic understanding of Huffman is correct. Symbols are coded using a variable number of bits based on their expected frequency to minimize the expected number of bits needing to be sent.

1. The default is to use a pre-defined symbol-code table, but there is functionality to generate a model from statistics. NetworkCompressionCapture is a helper class that you can pipe network traffic into and then call AnalyzeAndGenerateModel on to obtail a serialized huffman model that you can use to construct a NetworkCompressionModel specifically for that type of traffic.

This functionality is also exposed through the console, where 'beginnetworkprofile' starts a recording session, 'endnetworkprofile myfile' ends the recording session and stores the model to a file and finally 'loadcompressionmodel myfile' can be used on the server to load and use a model. Clients connecting after the model has been loaded will receive and use this model as they connect.

2. When you send something like an 32bit integer over the network, it is impractical to create a huffman tree that encompasses every possible value the integer can take as it would require a tree with 2^32 leaves! To make this more practical we lump values into a managable number of power-of-two-sized buckets and then only code the bucket index with Huffman and code the position in the bucket using a number of raw bits corresponding to the size of the bucket. The buckets are arranged so they are small around 0 and become progressively larger as you move away from zero. The rationale is that as most data is deltas against predictions, we expect values to be small and generally expect most of the redundancy to be in the size of the error and not in exactly which of the values of that size we end up hitting.

NetworkCompressionConstants.k_BucketSizes specifies the sizes of the buckets in bits, so a bucket of n bits has 2^n values.
NetworkCompressionConstants.k_BucketOffsets specifies the starting positions of the bucket.

Bucket n starts at k_BucketOffsets[n] and ends at k_BucketOffsets[n] + (1 << k_BucketSizes[n]).

The context can be throught of as a submodel that has it's own statistics and uses its own huffman tree. The context used to read and write a specific value must always be the same. The benefit of using multiple contexts is that it allows you to separate the statistics of things that have different expected distributions, which leads to more precise statistics, which again yields better compression. For example FPS sample assigns unique contexts to all fields of all property sheet. More contexts does however result in a marginal cost of a slightly larger model.

I hope this is helpful.
Click to expand...

Hello @stubbesaurus

Do you have more information on this? Especially on the AnalyzeAndGenerateModel details. It's largely undocumented and somewhat an enigma to me. Is there a resource where these models were created from, like a paper? I'm familiar with Huffman encoding but not these buckets and the way contexts are used.

In the FPS Sample, there's `maxFieldsPerSchema = 128` which contributes to the `maxcontexts` of the whole compression model. If I increase this to 256, this create an order of magnitude bigger max contents and many more buckets. I'm curious what this is actually doing and how I can perhaps use the profiling functions to improve it.

In short, the reason I'm doing this is because I've added the ability to use arrays in the fields of a schema. However, each entry of an array is technically its own "field" which means 1:1 context to field. I'm thinking this is the wrong direction to take and may end up condensing the arrays to only use 1 context.

Search Unity

Questions about packet compression (Huffman Stream)

Kichang-Kim

stubbesaurus

Unity Technologies

Kichang-Kim

CPlusSharp22

Search Unity

Unity ID

Useful Searches

Questions about packet compression (Huffman Stream)

Kichang-Kim

stubbesaurus

Unity Technologies

Kichang-Kim

CPlusSharp22