Table of Contents

VectorTraits

English | Chinese(中文)

VectorTraits: SIMD Vector type traits methods (SIMD向量类型的特征方法).

NuGet

This library provides many important arithmetic methods(e.g. Shift, Shuffle, NarrowSaturate) and constants for vector types, making it easier for you to write cross-platform SIMD code. It takes full advantage of the X86 and Arm architectures' intrinsic functions to achieve hardware acceleration and can enjoy inline compilation optimization.

Commonly Used Types:

  • Vectors: For vector types, common tool functions are provided, e.g. Create(T/T[]/Span/ReadOnlySpan), CreatePadding, CreateRotate, CreateByFunc, CreateByDouble ... It also provides traits methods for vectors, e.g. ShiftLeft、ShiftRightArithmetic、ShiftRightLogical、Shuffle ...
  • Vectors<T>: For vector types, constants are provided for various element types. e.g. Serial, SerialDesc, XyzwWMask, MantissaMask, MaxValue, MinValue, NormOne, FixedOne, E, Pi, Tau, VMaxByte, VReciprocalMaxSByte ...
  • Vector64s/Vector128s/Vector256s/Vector512s: Common tool functions and traits methods are provided for vectors of fixed bit width (Vector64/Vector128/Vector256/Vector512).
  • Vector64s<T>/Vector128s<T>/Vector256s<T>/Vector512s<T>: Provides constants of various element types for vectors of fixed bit width.
  • Scalars: For scalar types, various tool functions are provided. e.g. GetByDouble, GetFixedByDouble, GetByBits, GetBitsMask ...
  • Scalars<T>: For scalar types, a number of constants are provided. e.g. ExponentBits, MantissaBits, MantissaMask, MaxValue, MinValue, NormOne, FixedOne, E, Pi, Tau, VMaxByte, VReciprocalMaxSByte ...
  • VectorTextUtil: Provides some textual instrumental functions for vectors. e.g. GetHex, Format, WriteLine ...

Traits methods:

  • Support for .NET Standard 2.1 new vector methods: ConvertToDouble, ConvertToInt32, ConvertToInt64, ConvertToSingle, ConvertToUInt32, ConvertToUInt64, Narrow, Widen .
  • Support for .NET 5.0 new vector methods: Ceiling, Floor .
  • Support for .NET 6.0 new vector methods: Sum .
  • Support for .NET 7.0 new vector methods: ExtractMostSignificantBits, Shuffle, ShiftLeft, ShiftRightArithmetic, ShiftRightLogical .
  • Support for .NET 8.0 new vector methods: WidenLower, WidenUpper.
  • Provides the vector methods of narrow saturate: YNarrowSaturate, YNarrowSaturateUnsigned .
  • Provides the vector methods of round: YRoundToEven, YRoundToZero .
  • Provides the vector methods of shuffle: YShuffleInsert, YShuffleKernel, YShuffleG2, YShuffleG4, YShuffleG4X2 . Also provides ShuffleControlG2/ShuffleControlG4 enum.
  • Provides vector methods for de-interleave: YGroup2Unzip, YGroup2UnzipEven, YGroup2UnzipOdd, YGroup3Unzip, YGroup3UnzipX2, YGroup4Unzip, YGroup6Unzip_Bit128.
  • Provides vector methods for interleave: YGroup2Zip, YGroup2ZipHigh, YGroup2ZipLow, YGroup3Zip, YGroup3ZipX2, YGroup4Unzip, YGroup6Zip_Bit128.
  • ...
  • Full list: TraitsMethodList

Supported instruction set:

  • x86 (Need .NET Core 3.0+)
    • 128-bit vector: Sse, Sse2, Sse3, Ssse3, Sse41, Sse42. And 128-bit instructions from Avx family.
    • 256-bit vector: Avx, Avx2. And 256-bit instructions from Avx512VL.
    • 512-bit vector: Avx512BW, Avx512DQ, Avx512F, Avx512Vbmi.
  • Arm (Need .NET 5.0+)
    • 128-bit vector: AdvSimd.
  • Wasm (Need .NET 8.0+)
    • 128-bit vector: PackedSimd.

Purpose

The SIMD instruction set is known to accelerate multimedia processing (graphics, images, audio, video, ...) , artificial intelligence, scientific computing, etc. However, traditional SIMD programming suffers from the following pain points.

  • Difficult to cross-platform. Because different CPU systems provide different SIMD instruction sets, for example, there are many differences between the SIMD instruction sets of X86 and Arm platforms. If you want to port the program to another platform, you need to find the SIMD instruction set manual of that platform and develop it again.
  • Bit widths are difficult to upgrade. Even for the same platform, as it evolves, instruction sets with wider bit widths are gradually added. For example, the X86 platform, in addition to the obsolete 64-bit MMX series instructions, provides a 128-bit SSE instruction set, a 256-bit AVX instruction set, and some high-end processors are starting to support the 512-bit AVX-512 instruction set. Algorithms previously written with 128-bit SSE series instructions need to be redeveloped to take full advantage of the wider SIMD instruction set if they are to be ported to the 256-bit AVX instruction set.
  • Poor code readability and high development threshold. Many modern C compilers map Intrinsic Functions for SIMD instructions, which is much easier and more readable than writing assembly code. However, due to the use of some obscure abbreviations for function names, and the fact that C does not support function name overloading, as well as the complexity of the C language itself, there is still a high threshold for code readability and development difficulty.

NET Core 1.0 in 2016 added vector types such as Vector<T>, which largely solves the above pain points.

  • Easy cross-platform. NET platform is run by JIT (Just-In-Time Compiler). Only one set of algorithms based on vector methods is written and compiled into only one set of programs. When that program is subsequently run on a different platform, the vector method is compiled by JIT into a platform-specific SIMD instruction set, thus taking full advantage of hardware acceleration.
  • Bitwidth can be upgraded automatically. For the Vector<T> type, its length is not fixed, but is the same as the longest vector register for that processor. Specifically, if the CPU supports the AVX instruction set (strictly AVX2 and above), the Vector<T> type is 256 bits; if the CPU only supports the SSE instruction set (strictly SSE2 and above), the Vector<T> type is 128 bits. Simply put, you can write your program using only the Vector<T> type, and when the program runs, JIT will automatically use the widest SIMD instruction set.
  • The code is more readable and lowers the development threshold. .NET platform, the method names of vector types are composed of complete English words, and make full use of C# syntax features such as function name overloading, so that these method names are both concise and clear. The readability of the code has been greatly improved.

The vector type Vector<T> although well designed, it lacks many important vector functions such as Ceiling, Sum, Shift, Shuffle, etc. This led to many algorithms that were difficult to implement with vector types. When .NET platform versions are upgraded, sometimes several vector methods are added. .NET 7.0 released in 2022, for example, added ShiftRightArithmetic, Shuffle and other methods. However, there are still few vector methods, such as the lack of saturation processing. To address the lack of vector methods, .NET Core 3.0 starts to support intrinsic functions. This allows developers to use the SIMD instruction set directly, but again, this faces problems such as difficulty in cross-platform and bit-width upgrades. As the .NET platform is upgraded, more intrinsic functions will be added. For example, .NET 5.0 adds intrinsic functions for the Arm platform. For developing libraries, you can't just support .NET 7.0, but you need to support multiple .NET versions. So you will face tedious version checking and conditional processing. And the highest version of the .NET Standard class library (2.1) still does not support vector methods like Ceiling, which makes version checking even more tedious.

This library is dedicated to solve the above troubles, so that you can write cross-platform SIMD algorithms more easily. Feature:

  • Support for low versions of .NET programs (.NET Standard 1.1, .NET Core 1.0, .NET Framework 4.5, ...). Enables low version of .NET programs to use the latest vector functions. For example, ShiftRightArithmetic, Shuffle, etc. are new in .NET 7.0.
  • Powerful functions . In addition to referencing vector methods from higher versions of .NET, this library also provides many useful vector methods by referring to intrinsic functions. e.g. ShiftLeft_Fast, YNarrowSaturate ...
  • High performance. This library can take full advantage of the X86 and Arm architecture's intrinsic functions for hardware acceleration of vector type computations, and can enjoy inline compilation optimization. This library solves the problem that some of BCL's vector methods (e.g. Multiply, Shuffle, etc.) are not hardware-accelerated on some platforms, because it supplements the hardware-accelerated algorithms.
  • Software algorithms are also fast. If you find a method of vector type does not support hardware acceleration, .NET Bcl will switch to software algorithm, but many of its software algorithms contain branching statements, so the performance is poor. The software algorithm of this library is a highly optimized branchless algorithm.
  • Easy to use. This library supports not only Vector<T>, but also Vector128<T>/Vector256<T> and other vector types. The class name of the tool class is easy to remember (Vectors/Vector64s/Vector128s/Vector256s) and provides many common vector constants through a generic class of the same name.
  • For each traits method, some properties are added to obtain information. e.g. _AcceleratedTypes, _FullAcceleratedTypes .

Tip: The Disassembly window in Visual Studio allows you to view the assembly code at runtime. For example, when running on a machine that supports the Avx instruction set, Vectors.ShiftLeft_Const will be compiled inline and optimized to use the vpsllw instruction. And for constant value(1), it will be compiled as the immediate number of the instruction. Vectors.ShiftLeft_use_inline.png

Example 2: Using Vectors.ShiftLeft_Args and Vectors.ShiftLeft_Core, you can move some of the operations outside the loop to be processed earlier. For example, when running on a machine that supports the Avx instruction set, xmm1 is set outside the loop, and then used it in the vpsllw instruction of the inner loop. And here it is shown: the inline compilation optimization eliminates redundant xmm/ymm conversions. Vectors.ShiftLeft_Core_use_inline.png

Getting started

1) Install via NuGet

Either open the 'Package Management Console' and enter the following or use the built-in GUI

NuGet: PM> Install-Package VectorTraits

2) Usage examples

The static class Vectors provides some methods. e.g. CreateRotate, ShiftLeft, Shuffle. The generic structure 'Vectors' provides fields for commonly used constants.

The example code is in the samples/VectorTraits.Sample folder. The source code is as follows.

using System;
using System.IO;
using System.Numerics;
#if NETCOREAPP3_0_OR_GREATER
using System.Runtime.Intrinsics;
#endif
using Zyl.VectorTraits;

namespace Zyl.VectorTraits.Sample {
    class Program {
        private static readonly TextWriter writer = Console.Out;
        static void Main(string[] args) {
            writer.WriteLine("VectorTraits.Sample");
            writer.WriteLine();
            VectorTraitsGlobal.Init(); // Initialization .
            TraitsOutput.OutputEnvironment(writer); // Output environment info. It depends on `VectorTraits.InfoInc`. This row can be deleted when only VectorTraits are used.
            writer.WriteLine();

            // -- Start --
            Vector<short> src = Vectors.CreateRotate<short>(0, 1, 2, 3, 4, 5, 6, 7); // The `Vectors` class provides some methods. For example, 'CreateRotate' is rotate fill .
            VectorTextUtil.WriteLine(writer, "src:\t{0}", src); // It can not only format the string, but also display the hexadecimal of each element in the vector on the right Easy to view vector data .

            // ShiftLeft. It is a new vector method in `.NET 7.0`
            const int shiftAmount = 1;
            Vector<short> shifted = Vectors.ShiftLeft(src, shiftAmount); // shifted[i] = src[i] << shiftAmount.
            VectorTextUtil.WriteLine(writer, "ShiftLeft:\t{0}", shifted);
#if NET7_0_OR_GREATER
            // Compare BCL function .
            Vector<short> shiftedBCL = Vector.ShiftLeft(src, shiftAmount);
            VectorTextUtil.WriteLine(writer, "Equals to BCL ShiftLeft:\t{0}", shifted.Equals(shiftedBCL));
#endif
            // ShiftLeft_Const
            VectorTextUtil.WriteLine(writer, "Equals to ShiftLeft_Const:\t{0}", shifted.Equals(Vectors.ShiftLeft_Const(src, shiftAmount))); // If the parameter shiftAmount is a constant, you can also use the Vectors' ShiftLeft_Const method. It is faster in many scenarios .
            writer.WriteLine();

            // Shuffle. It is a new vector method in `.NET 7.0`
            Vector<short> desc = Vectors<short>.SerialDesc; // The generic structure 'Vectors<T>' provides fields for commonly used constants. For example, 'SerialDesc' is a descending order value .
            VectorTextUtil.WriteLine(writer, "desc:\t{0}", desc);
            Vector<short> dst = Vectors.Shuffle(shifted, desc); // dst[i] = shifted[desc[i]].
            VectorTextUtil.WriteLine(writer, "Shuffle:\t{0}", dst);
#if NET7_0_OR_GREATER
            // Compare BCL function . 
            Vector<short> dstBCL = default; // Since `.NET 7.0`, the Shuffle method has been provided in Vector128/Vector256, but the Shuffle method has not yet been provided in Vector .
            if (Vector<short>.Count == Vector128<short>.Count) {
                dstBCL = Vector128.Shuffle(shifted.AsVector128(), desc.AsVector128()).AsVector();
            } else if (Vector<short>.Count == Vector256<short>.Count) {
                dstBCL = Vector256.Shuffle(shifted.AsVector256(), desc.AsVector256()).AsVector();
            }
            VectorTextUtil.WriteLine(writer, "Equals to BCL Shuffle:\t{0}", dst.Equals(dstBCL));
#endif
            // Shuffle_Args and Shuffle_Core
            Vectors.Shuffle_Args(desc, out var args0, out var args1); // The suffix is the `Args' method used for parameter calculation, which involves processing such as parameter transformation in advance It is suitable for external loop .
            Vector<short> dst2 = Vectors.Shuffle_Core(shifted, args0, args1); // The suffix is the `Core` method used for core calculations, which calculates based on cached parameters It is suitable for internal loop to improve performance .
            VectorTextUtil.WriteLine(writer, "Equals to Shuffle_Core:\t{0}", dst.Equals(dst2));
            writer.WriteLine();

            // Show AcceleratedTypes.
            VectorTextUtil.WriteLine(writer, "ShiftLeft_AcceleratedTypes:\t{0}", Vectors.ShiftLeft_AcceleratedTypes);
            VectorTextUtil.WriteLine(writer, "Shuffle_AcceleratedTypes:\t{0}", Vectors.Shuffle_AcceleratedTypes);
        }
    }
}

3) Example results

.NET8.0 on X86

Program: VectorTraits.Sample

VectorTraits.Sample

IsRelease:	True
Environment.ProcessorCount:	16
Environment.Is64BitProcess:	True
Environment.OSVersion:	Microsoft Windows NT 10.0.22631.0
Environment.Version:	8.0.8
Stopwatch.Frequency:	10000000
RuntimeEnvironment.GetRuntimeDirectory:	C:\Program Files\dotnet\shared\Microsoft.NETCore.App\8.0.8\
RuntimeInformation.FrameworkDescription:	.NET 8.0.8
RuntimeInformation.OSArchitecture:	X64
RuntimeInformation.OSDescription:	Microsoft Windows 10.0.22631
RuntimeInformation.RuntimeIdentifier:	win-x64
IntPtr.Size:	8
BitConverter.IsLittleEndian:	True
Vector.IsHardwareAccelerated:	True
Vector<byte>.Count:	32	# 256bit
Vector<float>.Count:	8	# 256bit
Vector128.IsHardwareAccelerated:	True
Vector256.IsHardwareAccelerated:	True
Vector512.IsHardwareAccelerated:	True
Vector<T>.Assembly.CodeBase:	file:///C:/Program Files/dotnet/shared/Microsoft.NETCore.App/8.0.8/System.Private.CoreLib.dll
GetTargetFrameworkDisplayName(VectorTextUtil):	.NET 8.0
GetTargetFrameworkDisplayName(TraitsOutput):	.NET 8.0
VectorTraitsGlobal.InitCheckSum:	-2122844161	# 0x8177F7FF
VectorEnvironment.CpuModelName:	AMD Ryzen 7 7840H w/ Radeon 780M Graphics
VectorEnvironment.SupportedInstructionSets:	Aes, Avx, Avx2, Avx512BW, Avx512CD, Avx512DQ, Avx512F, Avx512Vbmi, Avx512VL, Bmi1, Bmi2, Fma, Lzcnt, Pclmulqdq, Popcnt, Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, X86Base
Vector128s.Instance:	WVectorTraits128Avx2	// Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Avx, Avx2, Avx512VL
Vector256s.Instance:	WVectorTraits256Avx2	// Avx, Avx2, Sse, Sse2, Avx512VL
Vector512s.Instance:	WVectorTraits512Avx512	// Avx512BW, Avx512DQ, Avx512F, Avx512Vbmi, Avx, Avx2, Sse, Sse2
Vectors.Instance:	VectorTraits256Avx2	// Avx, Avx2, Sse, Sse2, Avx512VL
Vectors.BaseInstance:	VectorTraits256Base

src:    <0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7>        # (0000 0001 0002 0003 0004 0005 0006 0007 0000 0001 0002 0003 0004 0005 0006 0007)
ShiftLeft:      <0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14>  # (0000 0002 0004 0006 0008 000A 000C 000E 0000 0002 0004 0006 0008 000A 000C 000E)
Equals to BCL ShiftLeft:        True
Equals to ShiftLeft_Const:      True

desc:   <15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0>  # (000F 000E 000D 000C 000B 000A 0009 0008 0007 0006 0005 0004 0003 0002 0001 0000)
Shuffle:        <14, 12, 10, 8, 6, 4, 2, 0, 14, 12, 10, 8, 6, 4, 2, 0>  # (000E 000C 000A 0008 0006 0004 0002 0000 000E 000C 000A 0008 0006 0004 0002 0000)
Equals to BCL Shuffle:  True
Equals to Shuffle_Core: True

ShiftLeft_AcceleratedTypes:     SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64        # (00001FE0)
Shuffle_AcceleratedTypes:       SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Single, Double        # (00007FE0)

Note: The text before Vectors.BaseInstance is the environment information output by TraitsOutput.OutputEnvironment. OutputEnvironment. The text starting from srcis the main code of the example. Since the CPU supports the X86 Avx2 instruction set,Vector.Countis 32(256bit), andVectors.InstanceisVectorTraits256Avx2`.

.NET8.0 on Arm

Program: VectorTraits.Sample

VectorTraits.Sample

IsRelease:	True
Environment.ProcessorCount:	2
Environment.Is64BitProcess:	True
Environment.OSVersion:	Unix 6.8.0.1015
Environment.Version:	8.0.7
Stopwatch.Frequency:	1000000000
RuntimeEnvironment.GetRuntimeDirectory:	/home/ubuntu/.dotnet/shared/Microsoft.NETCore.App/8.0.7/
RuntimeInformation.FrameworkDescription:	.NET 8.0.7
RuntimeInformation.OSArchitecture:	Arm64
RuntimeInformation.OSDescription:	Ubuntu 22.04.2 LTS
RuntimeInformation.RuntimeIdentifier:	linux-arm64
IntPtr.Size:	8
BitConverter.IsLittleEndian:	True
Vector.IsHardwareAccelerated:	True
Vector<byte>.Count:	16	# 128bit
Vector<float>.Count:	4	# 128bit
Vector128.IsHardwareAccelerated:	True
Vector256.IsHardwareAccelerated:	False
Vector512.IsHardwareAccelerated:	False
Vector<T>.Assembly.CodeBase:	file:///home/ubuntu/.dotnet/shared/Microsoft.NETCore.App/8.0.7/System.Private.CoreLib.dll
GetTargetFrameworkDisplayName(VectorTextUtil):	.NET 8.0
GetTargetFrameworkDisplayName(TraitsOutput):	.NET 8.0
VectorTraitsGlobal.InitCheckSum:	-2122844159	# 0x8177F801
VectorEnvironment.CpuModelName:	Neoverse-N1
VectorEnvironment.CpuFlags:	fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
VectorEnvironment.SupportedInstructionSets:	AdvSimd, Aes, ArmBase, Crc32, Dp, Rdm, Sha1, Sha256
Vector128s.Instance:	WVectorTraits128AdvSimdB64	// AdvSimd
Vectors.Instance:	VectorTraits128AdvSimdB64	// AdvSimd
Vectors.BaseInstance:	VectorTraits128Base

src:	<0, 1, 2, 3, 4, 5, 6, 7>	# (0000 0001 0002 0003 0004 0005 0006 0007)
ShiftLeft:	<0, 2, 4, 6, 8, 10, 12, 14>	# (0000 0002 0004 0006 0008 000A 000C 000E)
Equals to BCL ShiftLeft:	True
Equals to ShiftLeft_Const:	True

desc:	<7, 6, 5, 4, 3, 2, 1, 0>	# (0007 0006 0005 0004 0003 0002 0001 0000)
Shuffle:	<14, 12, 10, 8, 6, 4, 2, 0>	# (000E 000C 000A 0008 0006 0004 0002 0000)
Equals to BCL Shuffle:	True
Equals to Shuffle_Core:	True

ShiftLeft_AcceleratedTypes:	SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64	# (00001FE0)
Shuffle_AcceleratedTypes:	SByte, Byte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Single, Double	# (00007FE0)

The result is the same as the X86 one, only the environment information is different. Since the CPU supports Arm's AdvSimd instruction set, Vector<byte>.Count is 16(128bit) and Vectors.Instance is VectorTraits128AdvSimdB64.

.NET Framework 4.5 on X86

Program: VectorTraits.Sample.NetFw.

VectorTraits.Sample

IsRelease:	True
Environment.ProcessorCount:	16
Environment.Is64BitProcess:	True
Environment.OSVersion:	Microsoft Windows NT 6.2.9200.0
Environment.Version:	4.0.30319.42000
Stopwatch.Frequency:	10000000
RuntimeEnvironment.GetRuntimeDirectory:	C:\Windows\Microsoft.NET\Framework64\v4.0.30319\
RuntimeInformation.FrameworkDescription:	.NET Framework 4.8.9277.0
RuntimeInformation.OSArchitecture:	X64
RuntimeInformation.OSDescription:	Microsoft Windows 10.0.22631 
IntPtr.Size:	8
BitConverter.IsLittleEndian:	True
Vector.IsHardwareAccelerated:	True
Vector<byte>.Count:	32	# 256bit
Vector<float>.Count:	8	# 256bit
Vector<T>.Assembly.CodeBase:	file:///E:/zylSelf/Code/cs/base/VectorTraits/tests/VectorTraits.Benchmarks.NetFw/bin/Release/System.Numerics.Vectors.DLL
GetTargetFrameworkDisplayName(VectorTextUtil):	.NET Standard 1.1
GetTargetFrameworkDisplayName(TraitsOutput):	.NET Framework 4.5
VectorTraitsGlobal.InitCheckSum:	-25396097	# 0xFE7C7C7F
VectorEnvironment.CpuModelName:	AMD Ryzen 7 7840H w/ Radeon 780M Graphics
VectorEnvironment.SupportedInstructionSets:	
Vectors.Instance:	VectorTraits256Base	// 
Vectors.BaseInstance:	VectorTraits256Base

src:    <0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7>        # (0000 0001 0002 0003 0004 0005 0006 0007 0000 0001 0002 0003 0004 0005 0006 0007)
ShiftLeft:      <0, 2, 4, 6, 8, 10, 12, 14, 0, 2, 4, 6, 8, 10, 12, 14>  # (0000 0002 0004 0006 0008 000A 000C 000E 0000 0002 0004 0006 0008 000A 000C 000E)
Equals to ShiftLeft_Const:      True

desc:   <15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0>  # (000F 000E 000D 000C 000B 000A 0009 0008 0007 0006 0005 0004 0003 0002 0001 0000)
Shuffle:        <14, 12, 10, 8, 6, 4, 2, 0, 14, 12, 10, 8, 6, 4, 2, 0>  # (000E 000C 000A 0008 0006 0004 0002 0000 000E 000C 000A 0008 0006 0004 0002 0000)
Equals to Shuffle_Core: True

ShiftLeft_AcceleratedTypes:     SByte, Byte, Int16, UInt16, Int32, UInt32       # (000007E0)
Shuffle_AcceleratedTypes:       None    # (00000000)

ShiftLeft/Shuffle of Vectors works fine. Since the CPU supports the X86 Avx2 instruction set, Vector<byte>.Count is 32 (256bit). Vectors.InstanceisVectorTraits256Base. It's not VectorTraits256Avx2because the intrinsic function wasn't supported until.NET Core 3.0`. The value of ShiftLeft_AcceleratedTypes contains types such as "Int16", which means that ShiftLeft is hardware-accelerated when using these types. The library makes clever use of vector algorithms to try to achieve hardware acceleration even without intrinsic functions.

Results of benchmark

Unit of data: Million operations per second. The larger the number, the better the performance.

ShiftLeft

ShiftLeft: Shifts each element of a vector left by the specified amount. It is a new vector method in .NET 7.0.

ShiftLeft - X86 - AMD Ryzen 7 7840H

Type Method .NET Framework .NET Core 2.1 .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Byte SumSLLScalar 1062.046 1025.936 1287.865 1265.446 1445.575 1416.712 1693.330
Byte SumSLLNetBcl 1344.738 1109.752
Byte SumSLLNetBcl_Const 1281.901 1164.382
Byte SumSLLTraits 11312.499 10715.920 28897.868 28611.234 28219.205 34068.741 57456.802
Byte SumSLLTraits_Core 55791.675 52165.732 53563.421 68653.359 59916.622 67868.291 74889.177
Byte SumSLLConstTraits 13408.916 12604.412 38925.388 57842.081 57095.294 62012.692 62729.225
Byte SumSLLConstTraits_Core 56843.523 55673.528 53642.484 62674.397 65797.708 50869.840 73873.979
Int16 SumSLLScalar 1081.716 999.767 1261.475 1198.111 1218.767 1365.754 1547.294
Int16 SumSLLNetBcl 32011.646 34816.284
Int16 SumSLLNetBcl_Const 39975.924 37368.541
Int16 SumSLLTraits 6752.349 6185.968 25221.856 26382.708 27125.955 32617.944 36448.716
Int16 SumSLLTraits_Core 34727.283 31457.238 31800.310 32231.553 35687.996 37750.305 30731.745
Int16 SumSLLConstTraits 6037.367 6498.819 27783.526 37605.559 40699.914 39598.663 36242.630
Int16 SumSLLConstTraits_Core 37678.435 34784.616 32625.543 33694.338 40019.325 39380.404 36914.775
Int32 SumSLLScalar 1369.140 1315.852 1514.690 1521.516 2284.670 2484.407 2409.358
Int32 SumSLLNetBcl 17373.567 15954.004
Int32 SumSLLNetBcl_Const 17967.080 15983.409
Int32 SumSLLTraits 3762.374 3511.433 13343.304 12906.293 12661.423 17279.760 15886.410
Int32 SumSLLTraits_Core 17324.275 15468.381 14587.937 17407.823 17886.651 18052.162 14126.571
Int32 SumSLLConstTraits 3910.600 3724.412 12646.545 15290.340 17745.992 17829.078 15991.615
Int32 SumSLLConstTraits_Core 16235.154 14216.598 15282.565 16088.400 17940.330 15961.166 16378.506
Int64 SumSLLScalar 1394.719 1281.156 1517.938 1441.160 2270.521 2508.577 2421.558
Int64 SumSLLNetBcl 7528.184 8530.835
Int64 SumSLLNetBcl_Const 8743.504 8471.981
Int64 SumSLLTraits 483.430 494.335 6677.544 6570.711 6635.070 6891.705 7469.236
Int64 SumSLLTraits_Core 479.761 488.827 7758.515 8525.784 8596.290 8267.855 7879.060
Int64 SumSLLConstTraits 509.585 525.195 7036.223 6787.101 8246.601 8254.880 8526.022
Int64 SumSLLConstTraits_Core 512.652 528.381 8229.954 8747.125 8711.523 8871.948 8647.339

Description.

  • SumSLLScalar: Use scalar algorithm.
  • SumSLLNetBcl: Use the BCL method (Vector.ShiftLeft) with variable arguments. Note that this method is only available in .NET 7.0.
  • SumSLLNetBcl_Const: Use the BCL method (Vector.ShiftLeft) with constant arguments. Note that this method is only available in .NET 7.0.
  • SumSLLTraits: Use this library's normal method (Vectors.ShiftLeft) with variable arguments.
  • SumSLLTraits_Core: Use this library's Core suffixed methods (Vectors.ShiftLeft_Args, Vectors.ShiftLeft_Core) with variable arguments.
  • SumSLLConstTraits: Use this library's Const suffixed method (Vectors.ShiftLeft_Const) with constants arguments.
  • SumSLLConstTraits_Core: Use this library's ConstCore suffixed methods (Vectors.ShiftLeft_Args, Vectors.ShiftLeft_ConstCore) with constant arguments.

BCL's method (Vector.ShiftLeft) runs on X86 platform, only Int16/Int32/Int64 are hardware accelerated, while Byte is not hardware accelerated. This is probably because the Avx2 instruction set only has 16-64 bit left shift instructions, and does not provide other types of instructions, so the BCL is converted to a software algorithm. For these types of numbers, this library will replace them with efficient algorithms realized by combinations of other instructions. For example, for Byte type, SumSLLConstTraits_Core in . NET 7.0 has the value of 73873.979, which is 73873.979/1693.330≈43.6264 times the performance of scalar algorithm, and 73873.979/1164.382≈63.4448 times the performance of BCL method. 32872.874/1137.564≈28.8976times. Because X86 intrinsic functions have only been available since.NET Core 3.0. Therefore, for Int64 types, hardware acceleration is not available until after .NET Core 3.0`.

For ShiftLeft, when shiftAmount is a constant, the performance is generally better than when it is a variable. This is true for both BCL and this library methods. Using this library's Core suffix optimizes performance by moving some operations out of the loop to be processed earlier. When the CPU provides instructions with constant parameters (the technical term is "immediate parameters"), the performance of the instructions is generally higher. So the library also provides a ConstCore suffix method, which selects the fastest instruction for that platform. Sometimes the performance fluctuates due to "CPU Turbo Boost", "other processes taking CPU resources", etc. But rest assured, after checking the assembly instructions of the Release's program runtime, it is already running on the best hardware instructions. An example of this is the following figure.

Vectors.ShiftLeft_Core_use_inline.png

ShiftLeft - Arm - AWS Arm t4g.small

Type Method .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Byte SumSLLScalar 606.721 607.751 674.256 890.878 1238.814
Byte SumSLLNetBcl 19585.982 19831.927
Byte SumSLLNetBcl_Const 19564.840 19840.232
Byte SumSLLTraits 5541.532 13075.259 13190.705 13209.927 19844.497
Byte SumSLLTraits_Core 14048.511 16947.485 15828.571 19589.430 19841.525
Byte SumSLLConstTraits 9734.870 15699.315 15853.772 19511.952 19811.385
Byte SumSLLConstTraits_Core 13007.028 16817.247 15838.060 19422.222 19839.627
Int16 SumSLLScalar 606.135 603.800 605.734 820.880 1031.035
Int16 SumSLLNetBcl 9943.220 9803.495
Int16 SumSLLNetBcl_Const 9937.639 9837.136
Int16 SumSLLTraits 4215.369 6547.514 6558.299 9923.088 9839.256
Int16 SumSLLTraits_Core 7918.688 8431.934 7892.235 9939.469 9839.496
Int16 SumSLLConstTraits 6568.606 7829.860 7887.842 9925.988 9839.534
Int16 SumSLLConstTraits_Core 8494.550 8416.796 7902.444 9914.384 9823.608
Int32 SumSLLScalar 747.656 746.013 749.108 1406.122 1410.137
Int32 SumSLLNetBcl 4926.651 4826.909
Int32 SumSLLNetBcl_Const 4917.732 4840.232
Int32 SumSLLTraits 3293.943 3269.129 3278.303 4925.488 4836.941
Int32 SumSLLTraits_Core 4210.811 3930.619 3927.408 4923.867 4844.083
Int32 SumSLLConstTraits 3275.986 3249.809 3923.176 4926.463 4846.238
Int32 SumSLLConstTraits_Core 4205.245 4199.155 4156.634 4925.448 4844.679
Int64 SumSLLScalar 739.137 729.158 741.673 1372.480 1296.655
Int64 SumSLLNetBcl 2477.025 2264.032
Int64 SumSLLNetBcl_Const 2473.102 2251.272
Int64 SumSLLTraits 486.734 1638.835 1636.233 1985.596 2285.512
Int64 SumSLLTraits_Core 489.554 2075.273 1967.902 2474.105 2289.521
Int64 SumSLLConstTraits 467.393 1930.821 1968.798 2471.124 2308.745
Int64 SumSLLConstTraits_Core 466.293 2074.656 1968.834 2476.602 2281.018

Description.

  • SumSLLScalar: Use scalar algorithm.
  • SumSLLNetBcl: Use the BCL method (Vector.ShiftLeft) with variable arguments. Note that this method is only available in .NET 7.0.
  • SumSLLNetBcl_Const: Use the BCL method (Vector.ShiftLeft) with constant arguments. Note that this method is only available in .NET 7.0.
  • SumSLLTraits: Use this library's normal method (Vectors.ShiftLeft) with variable arguments.
  • SumSLLTraits_Core: Use this library's Core suffixed methods (Vectors.ShiftLeft_Args, Vectors.ShiftLeft_Core) with variable arguments.
  • SumSLLConstTraits: Use this library's Const suffixed method (Vectors.ShiftLeft_Const) with constants arguments.
  • SumSLLConstTraits_Core: Use this library's ConstCore suffixed methods (Vectors.ShiftLeft_Args, Vectors.ShiftLeft_ConstCore) with constant arguments.

The BCL method (Vector.ShiftLeft) runs on the Arm platform with hardware acceleration for integer types. The AdvSimd instruction set provides special instructions for left shifting of 8 to 64 bit integers. This library uses the same instructions when running on the Arm platform. The performance is close. Because Arm's intrinsic functions have only been available since .NET 5.0. The hardware acceleration for Int64 types is not available until after `.NET 5.0'.

ShiftRightArithmetic

ShiftRightArithmetic: Shifts (signed) each element of a vector right by the specified amount. It is a new vector method in .NET 7.0.

ShiftRightArithmetic - X86 - AMD Ryzen 7 7840H

Type Method .NET Framework .NET Core 2.1 .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Int16 SumSRAScalar 1085.176 1043.731 1227.822 1215.729 1209.230 1310.645 1397.378
Int16 SumSRANetBcl 31888.645 35102.079
Int16 SumSRANetBcl_Const 39751.018 36630.458
Int16 SumSRATraits 1829.405 1861.938 25643.096 26584.675 26634.093 31578.602 37184.123
Int16 SumSRATraits_Core 1837.663 1874.262 33248.481 36967.972 36890.508 37648.798 37673.670
Int16 SumSRAConstTraits 1836.653 1880.351 28724.613 36985.528 39429.041 32925.588 37356.009
Int16 SumSRAConstTraits_Core 1830.444 1879.354 33935.625 37498.165 38127.794 33120.549 35752.947
Int32 SumSRAScalar 1362.876 1321.507 1508.831 1508.378 2226.648 2555.622 2327.611
Int32 SumSRANetBcl 16806.958 15967.982
Int32 SumSRANetBcl_Const 18365.861 16092.208
Int32 SumSRATraits 883.925 895.137 12901.507 12508.762 11931.480 17609.103 16282.512
Int32 SumSRATraits_Core 919.507 931.419 15956.786 15252.829 17412.025 18296.493 16230.128
Int32 SumSRAConstTraits 911.750 942.523 13450.043 17314.816 14198.095 16799.445 16393.351
Int32 SumSRAConstTraits_Core 917.228 938.789 15344.136 15470.629 17084.816 18274.411 16054.229
Int32 SumSRAFastTraits 915.754 946.521 13266.168 15337.171 14562.129 17003.224 16124.004
Int64 SumSRAScalar 1393.540 1331.963 1532.719 1544.306 1513.245 1801.859 2560.284
Int64 SumSRANetBcl 524.702 8652.579
Int64 SumSRANetBcl_Const 557.152 8870.207
Int64 SumSRATraits 482.604 490.804 4949.328 4970.328 4932.277 4902.239 7541.726
Int64 SumSRATraits_Core 509.432 521.769 5941.547 6050.322 6104.433 6043.337 8537.297
Int64 SumSRAConstTraits 510.778 529.298 5526.893 5360.460 5834.075 6217.509 7562.071
Int64 SumSRAConstTraits_Core 509.597 531.344 5899.752 5978.398 6049.756 6171.211 7720.979
SByte SumSRAScalar 997.067 974.147 1278.049 1350.082 1227.788 1328.380 1387.993
SByte SumSRANetBcl 1135.177 1113.944
SByte SumSRANetBcl_Const 1165.780 1061.118
SByte SumSRATraits 3635.592 3696.780 24686.302 22906.323 22437.129 24879.962 44225.353
SByte SumSRATraits_Core 3652.670 3743.427 41915.608 45147.925 45375.300 46792.941 45642.076
SByte SumSRAConstTraits 3651.109 3753.761 29819.076 42019.515 43095.169 44048.300 47091.982
SByte SumSRAConstTraits_Core 3662.694 3753.270 39588.701 46397.665 47507.648 43046.477 46878.753

Description.

  • SumSRAScalar: Use scalar algorithm.
  • SumSRANetBcl: Use the BCL method (Vector.ShiftRight) with variable arguments. Note that this method is only available in .NET 7.0.
  • SumSRANetBcl_Const: Use the BCL method (Vector.ShiftRight) with constant arguments. Note that this method is only available in .NET 7.0.
  • SumSRATraits: Use this library's normal method (Vectors.ShiftRight) with variable arguments.
  • SumSRATraits_Core: Use this library's Core suffixed methods (Vectors.ShiftRight_Args, Vectors.ShiftRight_Core) with variable arguments.
  • SumSRAConstTraits: Use this library's Const suffixed method (Vectors.ShiftRight_Const) with constants arguments.
  • SumSRAConstTraits_Core: Use this library's ConstCore suffixed methods (Vectors.ShiftRight_Args, Vectors.ShiftRight_ConstCore) with constant arguments.

The BCL method (Vector.ShiftRightArithmetic) runs on X86 platforms with hardware acceleration only for Int16/Int32, but not for SByte/Int64. This is probably because the Avx2 instruction set only has 16-32 bit arithmetic right shift instructions. The Avx512 instruction set has added a 64 bit arithmetic right shift instruction. For these types of numbers, this library replaces them with efficient algorithms that are implemented by a combination of other instructions. As of .NET Core 3.0, hardware acceleration is available.

ShiftRightArithmetic - Arm - AWS Arm t4g.small

Type Method .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Int16 SumSRAScalar 604.429 602.027 606.297 818.740 830.302
Int16 SumSRANetBcl 9941.412 9837.372
Int16 SumSRANetBcl_Const 9931.397 9838.530
Int16 SumSRATraits 1713.818 5611.316 4949.502 9932.269 9837.893
Int16 SumSRATraits_Core 1928.197 7881.850 8435.043 9930.918 9707.757
Int16 SumSRAConstTraits 1936.057 7776.346 8432.064 9926.348 9834.469
Int16 SumSRAConstTraits_Core 1895.291 7825.036 8426.085 9923.414 9834.395
Int32 SumSRAScalar 745.287 749.467 747.486 1181.651 1244.019
Int32 SumSRANetBcl 4929.438 4848.848
Int32 SumSRANetBcl_Const 4937.824 4854.964
Int32 SumSRATraits 859.173 2815.113 2819.116 4937.562 4813.108
Int32 SumSRATraits_Core 945.694 3917.314 3916.943 4933.939 4787.843
Int32 SumSRAConstTraits 967.576 3904.750 4188.713 4901.680 4849.051
Int32 SumSRAConstTraits_Core 947.955 3906.471 4192.951 4908.354 4853.184
Int64 SumSRAScalar 738.902 734.754 741.343 1185.217 1243.954
Int64 SumSRANetBcl 2474.620 2433.159
Int64 SumSRANetBcl_Const 2478.519 2438.677
Int64 SumSRATraits 467.838 1233.506 1233.401 1418.970 2424.896
Int64 SumSRATraits_Core 468.470 1952.967 1971.453 2478.229 2424.819
Int64 SumSRAConstTraits 467.182 1939.969 1970.321 2474.340 2413.790
Int64 SumSRAConstTraits_Core 468.634 2095.352 2102.958 2474.473 2432.455
SByte SumSRAScalar 608.671 609.771 652.251 889.935 830.400
SByte SumSRANetBcl 19779.972 19615.987
SByte SumSRANetBcl_Const 19803.799 19613.758
SByte SumSRATraits 3482.537 11212.340 9894.245 11352.199 19512.654
SByte SumSRATraits_Core 3857.464 16756.195 15733.712 19816.163 19419.454
SByte SumSRAConstTraits 3905.027 15518.199 15732.344 19791.972 19617.529
SByte SumSRAConstTraits_Core 3796.018 16708.142 16787.090 19791.891 19619.300

Description.

  • SumSRAScalar: Use scalar algorithm.
  • SumSRANetBcl: Use the BCL method (Vector.ShiftRight) with variable arguments. Note that this method is only available in .NET 7.0.
  • SumSRANetBcl_Const: Use the BCL method (Vector.ShiftRight) with constant arguments. Note that this method is only available in .NET 7.0.
  • SumSRATraits: Use this library's normal method (Vectors.ShiftRight) with variable arguments.
  • SumSRATraits_Core: Use this library's Core suffixed methods (Vectors.ShiftRight_Args, Vectors.ShiftRight_Core) with variable arguments.
  • SumSRAConstTraits: Use this library's Const suffixed method (Vectors.ShiftRight_Const) with constants arguments.
  • SumSRAConstTraits_Core: Use this library's ConstCore suffixed methods (Vectors.ShiftRight_Args, Vectors.ShiftRight_ConstCore) with constant arguments.

BCL methods (Vector.ShiftRightArithmetic) are hardware accelerated for integer types when running on Arm platforms. The AdvSimd instruction set provides special instructions for arithmetic right shifting of 8 to 64 bit integers. This library uses the same instructions when running on the Arm platform. The performance is similar. As of .NET 5.0, hardware acceleration is available.

Shuffle

Shuffle: Shuffle and clear. Creates a new vector by selecting values from an input vector using a set of indices. It is a new vector method in .NET 7.0. Since .NET 7.0, the Shuffle method has been provided in Vector128/Vector256, but the Shuffle method has not yet been provided in Vector.

Shuffle allows an index to exceed the valid range, and then sets the corresponding element to 0. This feature slows down performance a bit, so this library also provides the YShuffleKernel method (Only shuffle). If you want to make sure that the index is always within the valid range, it is faster to use YShuffleKernel.

Shuffle - X86 - AMD Ryzen 7 7840H

Type Method .NET Framework .NET Core 2.1 .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Int16 SumScalar 1236.944 1263.908 1214.484 1278.657 1195.188 1408.179 1235.365
Int16 Sum256_Bcl 1074.656 938.447
Int16 Sum512_Bcl 918.911
Int16 SumTraits 1221.046 1255.341 8067.493 10943.134 10421.696 14194.280 32579.746
Int16 SumTraits_Args0 1278.650 1211.361 22661.648 25363.988 24123.555 26722.243 34671.910
Int16 SumTraits_Args 1255.109 1154.801 22911.649 26138.766 24804.170 26585.684 33172.777
Int16 SumKernelTraits 1269.733 1192.079 8698.117 12377.326 11972.407 17610.477 35632.301
Int16 SumKernelTraits_Args0 1297.765 1199.697 23028.564 25852.122 25176.482 24261.582 36741.022
Int16 SumKernelTraits_Args 1270.852 1142.885 23265.595 25960.405 21744.418 23156.078 37227.607
Int32 SumScalar 850.057 829.782 816.013 859.672 817.223 853.140 837.720
Int32 Sum256_Bcl 755.314 770.558
Int32 Sum512_Bcl 930.330
Int32 SumTraits 821.394 844.388 10852.534 10832.760 10943.342 12695.692 15067.794
Int32 SumTraits_Args0 864.447 818.042 12704.591 15953.127 15574.554 14391.785 15559.766
Int32 SumTraits_Args 810.166 762.183 12531.310 14746.991 14125.335 13524.193 15368.528
Int32 SumKernelTraits 825.747 841.229 14515.308 14407.190 14545.131 16276.648 15999.993
Int32 SumKernelTraits_Args0 856.015 814.055 14754.810 14880.916 17262.390 14319.199 16261.174
Int32 SumKernelTraits_Args 806.479 765.218 15073.768 14604.621 16999.007 16367.119 16422.220
Int64 SumScalar 425.474 430.216 457.179 497.203 465.105 432.348 425.921
Int64 Sum256_Bcl 506.686 515.520
Int64 Sum512_Bcl 688.892
Int64 SumTraits 474.906 431.296 3789.327 4192.951 4280.568 4155.819 8171.028
Int64 SumTraits_Args0 423.703 461.664 6979.885 7855.241 8501.271 7846.303 8198.449
Int64 SumTraits_Args 446.260 420.925 6704.874 8599.441 8317.550 7312.362 8378.340
Int64 SumKernelTraits 473.823 426.081 4854.793 5862.440 5735.074 5938.699 8560.856
Int64 SumKernelTraits_Args0 424.508 458.248 7804.575 8108.408 9181.086 8364.106 8701.155
Int64 SumKernelTraits_Args 446.097 428.538 8386.279 9239.331 9198.798 8344.952 8673.715
SByte SumScalar 1496.783 1403.348 1448.660 1239.277 1468.827 1415.139 1213.582
SByte Sum256_Bcl 901.114 1022.223
SByte Sum512_Bcl 989.131
SByte SumTraits 1476.771 1494.144 17086.314 24231.464 24097.622 30243.434 60885.250
SByte SumTraits_Args0 1392.158 1331.083 45038.802 50540.409 49090.081 46979.783 60672.985
SByte SumTraits_Args 1389.074 1295.641 46794.997 51069.265 50078.249 46518.750 65261.554
SByte SumKernelTraits 1476.637 1242.198 27650.933 32894.218 32711.664 39630.939 72350.167
SByte SumKernelTraits_Args0 1523.543 1440.011 44451.891 49973.813 51540.236 48754.502 72615.251
SByte SumKernelTraits_Args 1395.106 1274.943 41001.996 50067.099 49654.805 45904.504 71412.964

Description.

  • SumScalar: Use the scalar algorithm.
  • Sum256_Bcl: Use BCL's 256-bit vector methods (Vector256.Shuffle).
  • Sum512_Bcl: Use BCL's 512-bit vector methods (Vector512.Shuffle).
  • SumTraits: Use the normal methods of this library (Vectors.Shuffle).
  • SumTraits_Args0: Use this library's Core suffixed methods (Vectors.Shuffle_Args, Vectors.Shuffle_Core), without ValueTuple, use the "out" keyword to Returns multiple values.
  • SumTraits_Args: Use this library's Core suffixed methods (Vectors.Shuffle_Args, Vectors.Shuffle_Core), using ValueTuple.
  • SumKernelTraits: Use the normal methods of this library's YShuffleKernel (Vectors.YShuffleKernel).
  • SumKernelTraits_Args0: Use the Core suffixed methods of this library's YShuffleKernel (Vectors.YShuffleKernel_Args, Vectors.YShuffleKernel_Core), without ValueTuple, use the "out" keyword to return multiple values.
  • SumKernelTraits_Args: Use the Core suffixed methods of this library's YShuffleKernel (Vectors.YShuffleKernel_Args, Vectors.YShuffleKernel_Core), using ValueTuple.

BCL's method (Vector.Shuffle) runs on X86 platforms without hardware acceleration for all number types. This library replaces these types with efficient algorithms implemented by combinations of other instructions. As of .NET Core 3.0, hardware acceleration is available. Methods using this library's Core suffix optimize performance by moving some operations out of the loop to be processed earlier. This is especially true for the Shuffle method. YShuffleKernel can be used instead of Shuffle if you can ensure that the index is always in the valid range. It is faster. For Args suffixed methods, in addition to returning multiple values with the "out" keyword, ValueTuple can be used to receive multiple values, simplifying the code. However, be aware that ValueTuple can sometimes slow down performance.

Shuffle - Arm - AWS Arm t4g.small

Type Method .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Int16 SumScalar 427.276 421.887 421.454 526.589 516.294
Int16 Sum128_Bcl 482.907 468.383
Int16 SumTraits 428.281 4922.876 5555.655 5864.193 9711.569
Int16 SumTraits_Args0 428.928 7902.420 8416.624 9925.441 9709.555
Int16 SumTraits_Args 405.537 2809.483 2798.925 9880.804 9707.490
Int16 SumKernelTraits 427.637 5650.913 6540.446 7957.175 9833.813
Int16 SumKernelTraits_Args0 427.578 7897.224 7891.894 9929.863 9819.774
Int16 SumKernelTraits_Args 405.223 2811.195 2797.170 9861.330 9829.822
Int32 SumScalar 286.900 281.167 281.838 317.876 309.427
Int32 Sum128_Bcl 304.320 301.222
Int32 SumTraits 286.596 2311.209 2472.592 2917.343 4801.979
Int32 SumTraits_Args0 288.066 4185.430 3928.604 4934.590 4821.784
Int32 SumTraits_Args 270.249 1396.323 1401.742 4886.669 4806.886
Int32 SumKernelTraits 287.386 2677.394 3247.692 3953.573 4846.437
Int32 SumKernelTraits_Args0 286.724 3919.619 4182.617 4930.469 4852.808
Int32 SumKernelTraits_Args 270.724 1399.968 1395.953 4899.359 4853.093
Int64 SumScalar 448.592 440.758 444.884 552.061 534.531
Int64 Sum128_Bcl 708.356 692.663
Int64 SumTraits 190.913 1005.614 1064.650 1255.025 2448.365
Int64 SumTraits_Args0 426.809 2090.887 2100.527 2479.821 2451.574
Int64 SumTraits_Args 179.534 698.013 699.200 2457.898 2451.414
Int64 SumKernelTraits 448.065 1237.258 1412.876 1753.457 2434.096
Int64 SumKernelTraits_Args0 449.857 2101.411 1967.152 2469.054 2443.626
Int64 SumKernelTraits_Args 345.877 701.805 698.753 2456.761 2451.680
SByte SumScalar 665.739 664.224 658.168 834.224 803.566
SByte Sum128_Bcl 647.757 610.244
SByte SumTraits 680.590 13176.730 16739.161 19723.567 19531.685
SByte SumTraits_Args0 660.595 15704.393 15724.340 19723.852 19530.241
SByte SumTraits_Args 637.568 5597.644 5602.803 19605.289 19527.338
SByte SumKernelTraits 672.784 15604.597 16732.629 19692.571 19533.892
SByte SumKernelTraits_Args0 675.236 16718.959 15715.512 19729.144 19534.508
SByte SumKernelTraits_Args 642.795 5573.999 5598.168 19588.655 19538.006

Description.

  • SumScalar: Use the scalar algorithm.
  • Sum128_Bcl: Use BCL methods (Vector128.Shuffle).
  • SumTraits: Use the normal methods of this library (Vectors.Shuffle).
  • SumTraits_Args0: Use this library's Core suffixed methods (Vectors.Shuffle_Args, Vectors.Shuffle_Core), without ValueTuple, use the "out" keyword to Returns multiple values.
  • SumTraits_Args: Use this library's Core suffixed methods (Vectors.Shuffle_Args, Vectors.Shuffle_Core), using ValueTuple.
  • SumKernelTraits: Use the normal methods of this library's YShuffleKernel (Vectors.YShuffleKernel).
  • SumKernelTraits_Args0: Use the Core suffixed methods of this library's YShuffleKernel (Vectors.YShuffleKernel_Args, Vectors.YShuffleKernel_Core), without ValueTuple, use the "out" keyword to return multiple values.
  • SumKernelTraits_Args: Use the Core suffixed methods of this library's YShuffleKernel (Vectors.YShuffleKernel_Args, Vectors.YShuffleKernel_Core), using ValueTuple.

BCL's method (Vector.Shuffle) runs on the Arm platform without hardware acceleration for all number types. This library replaces these types with efficient algorithms implemented by combinations of other instructions. As of .NET 5.0, hardware acceleration is available. Note that prior to .NET 7.0, SumTraits_Args sometimes had a large performance difference from SumTraits_Args0, due to the large performance loss of ValueTuple under Arm.

YNarrowSaturate

YNarrowSaturate: Saturate narrows two Vector instances into one Vector .

YNarrowSaturate - X86 - AMD Ryzen 7 7840H

Type Method .NET Framework .NET Core 2.1 .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Int16 SumNarrow_If 208.976 197.924 195.466 200.430 197.261 205.623 221.224
Int16 SumNarrow_MinMax 200.034 201.184 197.505 208.715 199.736 222.635 208.102
Int16 SumNarrowVectorBase 21160.119 19565.035 19063.346 19960.925 19532.398 19258.689 24197.090
Int16 SumNarrowVectorTraits 20477.038 18251.731 44050.630 45196.128 43674.654 44677.389 47325.429
Int32 SumNarrow_If 211.070 218.235 225.479 211.761 207.353 223.740 232.860
Int32 SumNarrow_MinMax 221.396 206.735 214.815 214.341 211.238 210.944 223.415
Int32 SumNarrowVectorBase 9753.258 9549.313 9743.042 9519.188 9577.993 10513.071 12059.829
Int32 SumNarrowVectorTraits 9117.869 9253.891 20503.088 20225.447 19198.947 19012.815 19398.087
Int64 SumNarrow_If 207.654 206.920 215.020 207.405 207.239 220.198 227.592
Int64 SumNarrow_MinMax 205.724 201.036 203.815 200.292 213.422 213.819 231.741
Int64 SumNarrowVectorBase 2951.264 2720.663 2835.882 2949.423 2915.473 4372.612 5917.536
Int64 SumNarrowVectorTraits 2941.336 2696.543 4690.391 4875.851 4917.149 3808.744 9411.507
UInt16 SumNarrow_If 1263.960 1205.876 1247.409 1184.537 1124.520 1175.733 1387.128
UInt16 SumNarrow_MinMax 1363.298 1283.027 1336.103 1178.860 1344.978 761.908 1487.848
UInt16 SumNarrowVectorBase 25617.831 25358.182 25019.795 25056.656 26527.170 25337.769 30941.796
UInt16 SumNarrowVectorTraits 24795.433 24950.279 33163.801 41303.846 40678.067 29966.481 45560.104
UInt32 SumNarrow_If 1446.297 1396.148 1364.953 1339.805 1382.470 1240.158 1507.078
UInt32 SumNarrow_MinMax 1461.884 1346.542 1363.853 1376.390 1373.016 960.104 1383.498
UInt32 SumNarrowVectorBase 12509.780 11160.711 11971.259 11511.978 11080.158 11897.237 15997.508
UInt32 SumNarrowVectorTraits 12962.030 11581.014 14895.009 16343.372 17051.602 14727.107 19760.603
UInt64 SumNarrow_If 1003.570 1326.642 913.881 912.071 878.848 1312.352 1874.180
UInt64 SumNarrow_MinMax 1455.402 1404.391 1392.157 891.629 902.245 937.792 895.795
UInt64 SumNarrowVectorBase 3340.377 3102.954 3033.044 3449.113 3649.422 5104.550 7693.314
UInt64 SumNarrowVectorTraits 3306.018 3050.492 4497.385 5401.914 5969.621 4527.588 9530.757

Description.

  • SumNarrow_If: Use scalar algorithm based on if statements.
  • SumNarrow_MinMax: Use scalar algorithm based on the Min/Max methods of the Math class.
  • SumNarrowVectorBase: Use this library's base method (VectorTraitsBase.Statics.YNarrowSaturate). It is implemented by combining vector methods using BCL, and can take advantage of hardware acceleration.
  • SumNarrowVectorTraits: Use this library's traits method (Vectors.YNarrowSaturate). It is implemented as an intrinsic function, allowing for better hardware acceleration.

For 16-32 bit integers, SumNarrowVectorTraits are much better than SumNarrowVectorBase after .NET Core 3.1. This is because X86 provides specialized instructions. For 64-bit integers (Int64/UInt64), X86 does not provide an equivalent instruction. However, the SumNarrowVectorTraits version of the code uses a better intrinsic function algorithm, so it still outperforms SumNarrowVectorBase in many cases.

YNarrowSaturate - Arm - AWS Arm t4g.small

Type Method .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Int16 SumNarrow_If 157.270 154.692 157.383 181.610 193.265
Int16 SumNarrow_MinMax 160.909 165.733 108.425 184.240 189.973
Int16 SumNarrowVectorBase 6100.275 6193.938 6308.118 7201.735 8261.974
Int16 SumNarrowVectorTraits 6102.238 13460.358 13445.824 15514.261 13674.647
Int32 SumNarrow_If 163.854 165.352 165.160 190.240 213.807
Int32 SumNarrow_MinMax 154.976 162.019 161.884 195.349 194.881
Int32 SumNarrowVectorBase 3047.923 3268.933 3253.378 3532.128 4034.752
Int32 SumNarrowVectorTraits 3125.498 6121.553 6162.533 7914.641 6782.358
Int64 SumNarrow_If 161.788 160.690 161.656 203.670 190.163
Int64 SumNarrow_MinMax 160.836 157.655 164.693 194.496 201.793
Int64 SumNarrowVectorBase 728.629 1157.104 1139.372 1231.877 1326.584
Int64 SumNarrowVectorTraits 727.603 3114.720 3307.205 4088.677 3409.341
UInt16 SumNarrow_If 527.761 515.076 531.818 608.056 832.441
UInt16 SumNarrow_MinMax 573.087 525.410 576.628 608.744 893.594
UInt16 SumNarrowVectorBase 8361.120 8439.577 7945.486 8853.731 11829.808
UInt16 SumNarrowVectorTraits 8307.680 13106.613 14179.297 13964.213 16532.648
UInt32 SumNarrow_If 537.550 534.718 539.467 620.874 989.646
UInt32 SumNarrow_MinMax 539.997 537.029 545.333 620.923 827.472
UInt32 SumNarrowVectorBase 4099.703 4021.154 3963.463 4356.804 5896.924
UInt32 SumNarrowVectorTraits 4024.310 6340.994 6977.151 6619.009 7993.300
UInt64 SumNarrow_If 619.788 621.120 620.256 827.649 995.113
UInt64 SumNarrow_MinMax 619.494 620.151 620.119 818.259 994.695
UInt64 SumNarrowVectorBase 1229.723 1821.232 1848.632 1805.499 2169.309
UInt64 SumNarrowVectorTraits 1228.911 3489.303 3526.548 3480.212 4100.727

Description.

  • SumNarrow_If: Use scalar algorithm based on if statements.
  • SumNarrow_MinMax: Use scalar algorithm based on the Min/Max methods of the Math class.
  • SumNarrowVectorBase: Use this library's base method (VectorTraitsBase.Statics.YNarrowSaturate). It is implemented by combining vector methods using BCL, and can take advantage of hardware acceleration.
  • SumNarrowVectorTraits: Use this library's traits method (Vectors.YNarrowSaturate). It is implemented as an intrinsic function, allowing for better hardware acceleration.

Since .NET 5.0, the Arm intrinsic function is provided. Therefore, starting from NET 5.0, SumNarrowVectorTraits are much more powerful than SumNarrowVectorBase.

YGroup3Unzip

YGroup3Unzip: De-Interleave 3-element groups into 3 vectors. It converts the 3-element groups AoS to SoA. It can also deinterleave packed RGB pixel data into R,G,B planar data.

YGroup3UnzipX2: De-Interleave 3-element groups into 3 vectors and process 2x data.

YGroup3Unzip - X86 - AMD Ryzen 7 7840H

Type Method .NET Framework .NET Core 2.1 .NET Core 3.1 .NET 5.0 .NET 6.0 .NET 7.0 .NET 8.0
Byte SumBase_Basic 255.172 496.713 501.725 499.601 566.925 505.052 670.702
Byte SumBase 1140.616 1053.352 1089.103 1138.235 1111.114 1478.675 1463.708
Byte SumTraits 1121.904 1086.799 7468.216 11280.246 11541.671 12438.171 21865.365
Byte SumX2Base 2169.025 2088.353 2171.143 2111.332 2179.099 2812.575 2973.122
Byte SumX2Traits 2229.977 2160.516 10419.951 10989.673 10985.330 11472.251 22393.695
Int16 SumBase_Basic 213.465 389.617 439.760 352.833 453.870 404.842 533.252
Int16 SumBase 738.972 723.809 686.669 739.079 728.061 1015.709 1008.942
Int16 SumTraits 759.109 691.273 3767.055 5383.595 5638.094 6270.971 10452.168
Int16 SumX2Base 1327.217 1262.400 1260.547 1312.866 1288.727 1723.543 1761.102
Int16 SumX2Traits 1320.545 1227.530 6120.175 6190.444 6208.993 5798.718 10909.299
Int32 SumBase_Basic 186.128 276.261 295.992 219.993 323.416 280.863 391.511
Int32 SumBase 184.001 273.403 306.846 224.431 320.332 551.148 555.068
Int32 SumTraits 189.108 277.059 6262.687 6454.641 6392.289 6488.127 6951.683
Int32 SumX2Base 155.218 257.316 284.894 247.659 318.492 1072.598 1093.091
Int32 SumX2Traits 160.252 253.319 5049.720 6341.390 6285.681 6215.097 7422.183
Int64 SumBase_Basic 136.976 170.057 187.362 131.130 193.633 175.953 240.232
Int64 SumBase 135.652 170.323 187.933 125.485 192.634 168.300 238.422
Int64 SumTraits 135.704 167.900 4095.410 3868.199 4015.411 4061.920 4385.505
Int64 SumX2Base 108.319 151.252 178.444 137.145 182.990 155.501 243.663
Int64 SumX2Traits 109.441 151.243 2684.613 3883.237 3978.648 3893.358 4785.675

Description.

  • SumBase_Basic: Use scalar algorithm.
  • SumNarrowVectorBase: Use this library's base method (VectorTraitsBase.Statics.YGroup3Unzip). It is implemented by combining vector methods using BCL, and can take advantage of hardware acceleration.
  • SumNarrowVectorTraits: Use this library's traits method (Vectors.YGroup3Unzip). It is implemented as an intrinsic function, allowing for better hardware acceleration.
  • SumX2Base: Use VectorTraitsBase.Statics.YGroup3UnzipX2. For 8~16 bit integers, YGroup3UnzipX2 is generally faster than YGroup3Unzip, more so under earlier versions of .NET.
  • SumX2Traits: Use Vectors.YGroup3UnzipX2.

YGroup3Unzip - Arm - AWS Arm t4g.small

Type Method .NET Core 3.1 .NET 6.0 .NET 7.0 .NET 8.0
Byte SumBase_Basic 263.957 265.524 327.819 381.159
Byte SumBase 380.369 406.259 430.545 443.813
Byte SumTraits 378.710 4381.575 4113.304 6510.157
Byte SumX2Base 702.851 728.691 740.690 767.491
Byte SumX2Traits 700.539 4412.785 4273.763 5294.112
Int16 SumBase_Basic 188.885 189.823 222.856 279.398
Int16 SumBase 213.360 228.410 235.157 242.377
Int16 SumTraits 213.356 1926.559 2134.925 3037.124
Int16 SumX2Base 419.434 448.638 466.043 475.565
Int16 SumX2Traits 419.442 2413.794 2650.031 2638.161
Int32 SumBase_Basic 138.088 143.089 154.241 196.818
Int32 SumBase 141.071 143.390 186.784 198.177
Int32 SumTraits 144.696 1033.899 1069.974 1494.205
Int32 SumX2Base 121.726 138.986 275.479 310.983
Int32 SumX2Traits 119.468 1598.185 1547.795 1618.239
Int64 SumBase_Basic 109.766 100.523 84.039 189.270
Int64 SumBase 109.531 102.084 81.358 185.056
Int64 SumTraits 107.335 1153.333 1176.315 1191.362
Int64 SumX2Base 97.857 96.111 79.729 203.008
Int64 SumX2Traits 98.162 1216.716 1155.302 1374.619

More results

See: BenchmarkResults

Documentation

  • Traits method list: TraitsMethodList
  • Online document: https://zyl910.github.io/VectorTraits_doc/
  • DocFX: Run docfx_serve.bat. Then browse http://localhost:8080/ .
  • Doxygen: Run the Doxywizard and click File ->Open on the menu bar. Select the Doxyfile file and click "OK". Click on the "Run" tab and click on the "Run doxygen" button. It will generate documents in the "doc_gen" folder.

ChangeLog

Full list: ChangeLog