Parallel reduction

Pyromuffin · Jan 21, 2019

Hi, excuse me if this has been addressed elsewhere, but is it possible to do a parallel reduction using the job system?

I see we have Interlocked.Add(), but that only works for integer types. For example, if I am trying to sum up a large list of floats using the job system, then it seems I would have to lock the native array element that I would be writing to. The lock statement only works on reference types, so I am unsure of what to do here.

Thanks!

tertle · Jan 21, 2019

Just do it in a single job thread. Honestly burst is so fast for this type of basic math stuff the overhead of scheduling parallel jobs is probably slower than just crunching it all in a single thread.

Pyromuffin · Jan 21, 2019

Hi tertle,

the example is not exactly what i am doing, really i am calculating the volume of a dynamic mesh using the shoelace algorithm, so there is actually quite a lot of work to be done.

tertle · Jan 21, 2019

Well, if you must do it from multiple threads you could try to local sum it first (a separate element per thread) then do the final sum on the array in a single thread at the end. This is kind of how a few of the NativeContainer concurrent methods kind of work.

But out of interest, have you benchmarked single threaded job performance?

Pyromuffin · Jan 21, 2019

yes, performance is fine with 16 ish meshes, but when i scale up i start going 20 ms in just this function.

i think i can probably do one mesh per thread and see if that works, i guess it’s the same amount of work.

i’m coming from compute shaders, so i am used to dealing with much much higher thread counts with poor single threaded performance. i still think there should be a way to do some kind of cooperative parallel algorithms.

jpvanoosten · Jan 20, 2022

I know this post is old, but I was curious about the same thing and looking for solutions for this. I didn't find any implementations, so I thought I'd attempt my own.

This is the implementation of a generic parallel reduce using Unity's job system and the burst compiler. I've also did some performance testing and found that doing a reduction (sum) of 100k values takes 6.5ms (average) to sum all of the values in a single thread, and 0.41ms using this parallel sort. Summing 1M values takes 67ms (average) single threaded, and 2.8ms using this parallel sort (Intel i7-8700K).

Code (CSharp):

using Unity.Burst;

using Unity.Jobs;

using Unity.Collections;

/// <summary>

/// An interface that defines a binary operation.

/// The operation can be any binary operation (sum, difference, multiply, min, max, etc...).

/// </summary>

/// <typeparam name="T">The type of the values to perform the operation on.</typeparam>

public interface IBinaryOperator<T> where T : struct

{

public T Operator(T a, T b);

}

[BurstCompile(CompileSynchronously = true)]

public struct ParalellReduceJob<T, U> : IJobParallelForBatch

where T : struct

where U : struct, IBinaryOperator<T>

{

// The step rate of the source array.

public int Step;

[ReadOnly]

public NativeSlice<T> Src;

[WriteOnly]

public NativeSlice<T> Dst;

/// <summary>

/// The operation to perform on the values of the array.

/// </summary>

public U Operator;

/// <summary>

/// Serial reduction.

/// </summary>

/// <param name="src">The source array to reduce.</param>

/// <returns>The reduced value.</returns>

public static T1 Reduce<T1, U1>(in NativeSlice<T1> src, int step, U1 op)

where T1 : struct

where U1 : struct, IBinaryOperator<T1>

{

T1 val = src[0];

for (int i = step; i < src.Length; i += step)

{

val = op.Operator(val, src[i]);

}

return val;

}

public void Execute(int startIndex, int count)

{

Dst[startIndex] = Reduce(Src.Slice(startIndex, count), Step, Operator);

}

}

public static class ParallelReduce

{

/// <summary>

/// Swap the arrays. This is very efficient for NativeArray as only the internal

/// memory pointer is swapped, not the values of the array.

/// </summary>

/// <typeparam name="T">The value type of the arrays being swapped.</typeparam>

/// <param name="a">The first array to swap with the second.</param>

/// <param name="b">The second array to swap with the first.</param>

private static void Swap<T>(ref NativeArray<T> a, ref NativeArray<T> b)

where T : struct

{

var t = a;

a = b;

b = t;

}

/// <summary>

/// Perform a parallel reduction on the elements of the array.

/// </summary>

/// <typeparam name="T">The type of the elements to be reduced.</typeparam>

/// <typeparam name="U">The type of the binary operation to perform on each element of the array.</typeparam>

/// <param name="values">The values to be reduced.</param>

/// <param name="op">The operation to perform on the elements of the array.</param>

/// <returns>The result of the reduction.</returns>

public static T Reduce<T, U>(in NativeArray<T> values, U op)

where T : struct

where U : struct, IBinaryOperator<T>

{

// The number of values to reduce per thread batch.

const int BATCH_SIZE = 1024;

// The step rate for the reduction.

// On the first iteration, this is every value of the source array.

int stepRate = 1;

// How many values to reduce in the current batch.

int batchSize = BATCH_SIZE;

JobHandle job = default;

var src = new NativeArray<T>(values.Length, Allocator.TempJob, NativeArrayOptions.UninitializedMemory);

var dst = new NativeArray<T>(values, Allocator.TempJob);

while (stepRate < values.Length)

{

Swap(ref src, ref dst);

job = new ParalellReduceJob<T, U>

{

Src = src,

Dst = dst,

Step = stepRate,

Operator = op,

}.ScheduleBatch(values.Length, batchSize, job);

// Increment the step rate and batch size.

stepRate = batchSize;

batchSize *= 2;

}

job.Complete();

T res = dst[0];

src.Dispose();

dst.Dispose();

return res;

}

}

And a performance test case:

Code (CSharp):

using System;

using System.Collections.Generic;

using NUnit.Framework;

using Unity.Collections;

using Unity.Jobs;

using Unity.Mathematics;

using Unity.PerformanceTesting;

using UnityEditor;

using UnityEngine;

using static Unity.Mathematics.math;

public class ParallelReduceTest

{

struct SumInt : IBinaryOperator<int>

{

public int Operator(int a, int b)

{

return a + b;

}

}

[Test, Performance]

public void Test()

{

const int NUM_VALUES = 1000000;

var a = new NativeArray<int>(NUM_VALUES, Allocator.TempJob, NativeArrayOptions.UninitializedMemory);

for (int i = 0; i < NUM_VALUES; ++i)

{

a[i] = i;

}

int sum0 = 0;

Measure.Method(() =>

{

sum0 = 0;

for (int i = 0; i < NUM_VALUES; ++i)

{

sum0 += a[i];

}

}).SampleGroup($"Serial ({NUM_VALUES})").Run();

int sum1 = 0;

Measure.Method(() =>

{

sum1 = ParallelReduce.Reduce(a, new SumInt());

}).SampleGroup($"Parallel ({NUM_VALUES})").Run();

Assert.AreEqual(sum0, sum1);

a.Dispose();

}

}

jpvanoosten · Mar 28, 2022

I've setup a GitHub project to test this script: https://github.com/jpvanoosten/ParallelReduce

There is a known issue in https://docs.unity3d.com/Packages/com.unity.jobs@0.50 that generates the following error:

```
System.InvalidOperationException : Reflection data was not set up by an Initialize() call
```

I believe this is a Unity bug in Jobs 0.50.0-preview.8. The workaround is to stick to 0.11.0-preview.6 until this bug is resolved.

Chris-Herold · Mar 28, 2022

A problem like this is lends itself very well to be solved using compute shaders, especially with larger datasets.

amarcolina · Mar 29, 2022

I worked on a small utility a while back that might be of use here. It's pretty old, but it may at least be of some use for inspiration writing a new version. It avoided using interlocked operations at all, and instead has each job thread write to it's own local buffer, which then get combined at the end when the result is read.

It's called NativeResult<T, Op>, which allowed a parallel reduction of a set of T values using a defined Op. For example you could have a NativeResult<int, Sum>, which allows you to parallel-reduce a set of integers using addition. But it supports extension to any kind of reduce operation you might want, as long as it is commutative.

Search Unity

Unity ID

Useful Searches

Parallel reduction