pyvkfft-benchmark

Run pyvkfft benchmark tests. This is pretty slow as each test runs in a separate process (including the GPU initialisation) - this is done to avoid any context a memory issues when performing a large number of tests. This can also be used to compare results with cufft (scikit-cuda or cupy) and gpyfft.

usage: pyvkfft-benchmark [-h] [--backend {cuda,opencl,gpyfft,skcuda,cupy}]
                         [--precision {single,double}] [--gpu GPU]
                         [--opencl_platform OPENCL_PLATFORM] [--serial]
                         [--save] [--compare COMPARE] [--systematic]
                         [--dry-run] [--plot PLOT [PLOT ...]]
                         [--radix [{2,3,5,7,11,13} ...]] [--bluestein]
                         [--ndim {1,2,3} [{1,2,3} ...]] [--range RANGE RANGE]
                         [--range-mb RANGE_MB RANGE_MB]
                         [--minsize-mb MINSIZE_MB] [--nbatch NBATCH] [--r2c]
                         [--dct {1,2,3,4}] [--dst {1,2,3,4}] [--inplace]
                         [--disableReorderFourStep {-1,0,1}]
                         [--coalescedMemory {-1,16,32,64,128} [{-1,16,32,64,128} ...]]
                         [--numSharedBanks {-1,16,20,24,28,32,36,40,44,48,52,56,60,64}]
                         [--aimThreads {-1,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,208,212,216,220,224,228,232,236,240,244,248,252,256} [{-1,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,208,212,216,220,224,228,232,236,240,244,248,252,256} ...]]
                         [--performBandwidthBoost {-1,0,1,2,4}]
                         [--registerBoost {-1,1,2,4}]
                         [--registerBoostNonPow2 {-1,0,1}]
                         [--registerBoost4Step {-1,1,2,4}]
                         [--warpSize {-1,1,2,4,8,16,32,64,128,256} [{-1,1,2,4,8,16,32,64,128,256} ...]]
                         [--batchedGroup BATCHEDGROUP BATCHEDGROUP BATCHEDGROUP]
                         [--useLUT {-1,0,1}]
                         [--forceCallbackVersionRealTransforms {-1,0,1}]

Named Arguments

--backend

Possible choices: cuda, opencl, gpyfft, skcuda, cupy

FFT backend to use, 'cuda' and 'opencl' will use pyvkfft with the corresponding language.

Default: "opencl"

--precision

Possible choices: single, double

Precision for the benchmark

Default: "single"

--gpu

GPU name (or sub-string)

--opencl_platform

Name (or sub-string) of the opencl platform to use (case-insensitive). Note that by default the PoCL platform is skipped, unless it is specifically requested or it is the only one available (PoCL has some issues with VkFFT for some transforms)

--serial

Use this to perform all tests in a single process. This is mostly useful for testing, and can lead to GPU memory issues, especially with skcuda.

Default: False

--save

Save results to an sql file

Default: False

--compare

Name of database file to compare to.

--systematic

Perform a systematic benchmark over a range of array sizes. Without this argument only a small number of array sizes is tested.

Default: False

--dry-run

Perform a dry-run, printing the number of array shapes to test

Default: False

--plot

Plot results stored in *.sql files. Separate plots are given for different dimensions. Multiple *.sql files can be given for comparison. This parameter supersedes all others (no tests are run if --plot is given)

systematic

Options for --systematic:

--radix

Possible choices: 2, 3, 5, 7, 11, 13

Perform only radix transforms. Without --radix, all integer sizes are tested. With '--radix', all radix transforms allowed by the backend are used. Alternatively a list can be given: '--radix 2' (only 2**n array sizes), '--radix 2 3 5' (only 2**N1 * 3**N2 * 5**N3)

--bluestein, --rader

Test only non-radix sizes, using the Bluestein or Rader transforms. Not compatible with --radix

Default: False

--ndim

Possible choices: 1, 2, 3

Number of dimensions for the transform. The arrays will be stacked so that each batch transform is at least 1GB.

Default: [2]

--range

Range of array lengths [min, max] along each transform dimension, '--range 2 128'. This is combined with --range-mb to determine the actual range, so you can put large values here and let the maximum total size limit the actual memory used.

Default: [2, 256]

--range-mb

Range of array sizes in MBytes. This is combined with --range tofind the actual range to use.

Default: [0, 128]

--minsize-mb

Minimal size (in MB) of the transformed array to ensure a precise enough timing, as the FT is tested on a stacked array using a batch transform. Larger values take more time. Ignored if --nbatch is not -1 (the default)

Default: 100

--nbatch

Specify the batch size for the array transforms. By default (-1), this number is automatically adjusted for each length so that the total size is equal to 'minsize-mb' (100MB by default), e.g. for 2D R2C test of 512x512, the batch number is 100. Use 1 to disable batch, or any other number to use a fixed batch size.

Default: -1

--r2c

Test real-to-complex transform (default is c2c)

Default: False

--dct

Possible choices: 1, 2, 3, 4

Test direct cosine transform of the given type (default is c2c)

Default: False

--dst

Possible choices: 1, 2, 3, 4

Test direct sine transform of the given type (default is c2c)

Default: False

--inplace

Test inplace transforms

Default: False

advanced

Advanced options for VkFFT. Do NOT use unless you really know what these mean. -1 will always defer the choice to VkFFT. For some parameters (coalescedMemory, aimThreads and warpSize), if multiple values are used, this will trigger the automatic tuning of the transform by testing each possible configuration of parameters, before using the optimal parameter for the actual transform.

--disableReorderFourStep

Possible choices: -1, 0, 1

Disables unshuffling of Four step algorithm. Requires tempbuffer allocation

Default: -1

--coalescedMemory

Possible choices: -1, 16, 32, 64, 128

Number of bytes to coalesce per one transaction: defaults to 32 for Nvidia and AMD, 64 for others.Should be a power of two

Default: [-1]

--numSharedBanks

Possible choices: -1, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64

Number of shared banks on the target GPU. Default is 32.

Default: -1

--aimThreads

Possible choices: -1, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76, 80, 84, 88, 92, 96, 100, 104, 108, 112, 116, 120, 124, 128, 132, 136, 140, 144, 148, 152, 156, 160, 164, 168, 172, 176, 180, 184, 188, 192, 196, 200, 204, 208, 212, 216, 220, 224, 228, 232, 236, 240, 244, 248, 252, 256

Try to aim all kernels at this amount of threads.

Default: [-1]

--performBandwidthBoost

Possible choices: -1, 0, 1, 2, 4

Try to reduce coalesced number by a factor of Xto get bigger sequence in one upload for strided axes.

Default: -1

--registerBoost

Possible choices: -1, 1, 2, 4

Specify if the register file size is bigger than shared memory and can be used to extend it X times (on Nvidia 256KB register file can be used instead of 32KB of shared memory, set this constant to 4 to emulate 128KB of shared memory).

Default: -1

--registerBoostNonPow2

Possible choices: -1, 0, 1

Specify if register over-utilization should be used on non-power of 2 sequences

Default: -1

--registerBoost4Step

Possible choices: -1, 1, 2, 4

Specify if register file over-utilization should be used in big sequences (>2^14), same definition as registerBoost

Default: -1

--warpSize

Possible choices: -1, 1, 2, 4, 8, 16, 32, 64, 128, 256

Number of threads per warp/wavefront. Normally automatically derived from the driver. Must be a power of two

Default: [-1]

--batchedGroup

How many FFTs are done per single kernel by a dedicated thread block, for each dimension.

Default: [-1, -1, -1]

--useLUT

Possible choices: -1, 0, 1

Use a look-up table to bypass the native sincos functions.

Default: -1

--forceCallbackVersionRealTransforms

Possible choices: -1, 0, 1

force callback version of R2C and R2R (DCT/DST) algorithmsfor all usecases. this is normally activated automaticallyby VkFFT for odd sizes.

Default: -1

Examples: * Simple benchmark for radix transforms:

pyvkfft-benchmark --backend cuda --gpu titan

  • Systematic benchmark for 1D radix transforms over a given range:

    pyvkfft-benchmark --backend cuda --gpu titan --systematic --ndim 1 --range 2 256

  • Same but only for powers of 2 and 3 sizes, in 2D, and save the results to an SQL file for later plotting:

    pyvkfft-benchmark --backend cuda --gpu titan --systematic --radix 2 3 --ndim 2 --range 2 256 --save

  • plot the result of a benchmark:

    pyvkfft-benchmark --plot pyvkfft-version-gpu-date-etc.sql

  • plot & compare the results of multiple benchmarks (grouped by 1D/2D/3D transforms):

    pyvkfft-benchmark --plot *.sql

  • Systematic test in OpenCL for an M1 GPU, tuning the VkFFT algorithm with the best possible 'aimthreads' low-level parameter to maximise throughput:

    pyvkfft-benchmark --backend opencl --gpu m1 --systematic --radix --ndim 2 --range 2 256 --inplace --aimThreads 16 32 64 --r2c

When testing VkFFT, each line also indicates at the end the type of algorithm used: (r)adix, (R)ader or (B)luestein, the size of the temporary buffer (if any) and the number of uploads (number of read and writes) for each axis.

Note 1: the indicated throughput is always
computed assuming a single read and write for each axis (by convention),

even if we know the number of uploads is actually larger.

Note 2: in the case of DCT1 and DST1 the throughput will be worse as these

are computed as complex systems of size 2N-2, i.e. with 4x the original size.