Counter-based Random Number Generators in Intel® oneAPI Math Kernel Library

Published: 05/25/2021

By Pavel Dyakov, Alina Elizarova

Intel® oneAPI Math Kernel Library (oneMKL) provides ARS-5 and Philox4x32-10 counter-based basic random number generators, introduced by John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw in “Parallel random numbers: as easy as 1, 2, 3” [1].

ARS-5 is Advanced Encryption Standard (AES)-based basic random number generator with five algorithm iterations (further referenced as rounds) over the generator’s state. Philox4x32-10 relies on a substitution-permutation network (SPnetwork) responsible for producing highly diffusive bijection and permutations with 10 rounds over four 32-bit inputs (see more details in “Parallel random numbers: as easy as 1, 2, 3” [1]). Both generators have a period of random number sequence 2^130 ≈ 1.4 * 10^39.

While providing good statistical properties (counter-based random number generators passed “TestU01 BigCrush” test battery as was claimed by authors in [1] and “Diehard” tests battery independently verified by oneMKL [2]) and exhibited good performance of random number sequence generation (details can be found in oneMKL Vector Statistics Performance Data [3]), ARS-5 and Philox4x32-10 have small state size that leads to easy vectorization and parallelization on different hardware. This point was also discussed in KB article “New counter-based Random Number Generators in Intel® Math Kernel Library” [4].

Recently, the oneMKL ARS-5 engine was additionally optimized with the Vector Advanced Encryption Standard (VAES) instruction set, introduced in Ice Lake (VAES - vector AES encryption / decryption instructions, more information can be found in Intel® Intrinsic Guide [5]). The performance comparison for the ARS-5 and Philox4x32-10 engines on Intel® Xeon® Platinum 8280L (Cascade Lake Server) and Intel® Xeon® Platinum 8380 (Ice Lake Server) is presented below:

Assumptions: sequential (single thread) generation mode; measured region – generation of single precision random numbers uniformly distributed with a = 0, b = 1.

By utilizing the VAES instruction set, ARS-5 shows an impressive speed-up on Ice Lake Server hardware (up to 3.9 times). The Philox4x32-10 engine also shows about a 1.12X speed-up due to other hardware characteristics (cache size and number of execution ports).

Starting from the oneMKL 2021.1 release, the ARS-5 and Philox4x32-10 generators are also available with Data Parallel C++ (DPC++) APIs, where both engines support a CPU device and Philox4x32-10 engine also supports Intel’s GPU devices:

#include <vector>
#include <CL/sycl.hpp>
#include "oneapi/mkl.hpp"
int main() {
sycl::queue queue;
const size_t n = 10000;
// create USM allocator
sycl::usm_allocator<double, sycl::usm::alloc::shared> allocator(queue);
// create vector with USM allocator
std::vector<double, decltype(allocator)> r(n, allocator);
// create basic random number generator objects
// In case of ARS-5 engine call be as follows: // oneapi::mkl::rng::ars5 engine(queue);
oneapi::mkl::rng::philox4x32x10 engine(queue);
// create distribution object
oneapi::mkl::rng::uniform distr;
// perform generation
auto event = oneapi::mkl::rng::generate(distr, engine, n, r.data());
// sycl::event object is returned by generate function for synchronization
event.wait(); // synchronization can be also done by queue.wait()
return 0;
}

You can also execute the Philox4x32-10 engine on GPUs through OpenMP offload APIs and DPC++ device APIs (which can be called from DPC++ kernels [6]). Code examples are presented below.

OpenMP offload APIs usage example:

#include "mkl.h"
#include "mkl_omp_offload.h"
int main() {
int dnum = 0;
const MKL_INT n = 10000;
float* r_dev = (float*)mkl_malloc((n) * sizeof(float), 64);
VSLStreamStatePtr stream_dev;
int i;
float a = 0.0f, b = 1.0f;
// initialize Basic Random Number Generator
vslNewStream(&stream_dev, VSL_BRNG_PHILOX4X32X10, 1);
#pragma omp target data map(tofrom:r_dev[0:N]) device(dnum)
{
// run RNG on gpu, use standard oneMKL interface within a variant dispatch construct
#pragma omp target variant dispatch device(dnum) use_device_ptr(r_dev)
{
vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream_dev, n, r_dev, 0.0f, 1.0f);
}
}
mkl_free(r_dev);
// deinitialize
vslDeleteStream(&stream_dev);
return 0;
}

Device DPC++ APIs usage example:

#include <vector>
#include <CL/sycl.hpp>
#include "oneapi/mkl/rng/device.hpp"
int main() {
const size_t n = 10000;
std::vector<float> r_dev(n);
// submit a kernel to generate on device
{
sycl::buffer<float, 1> r_buf(r_dev.data(), r_dev.size());
try {
queue.submit([&](sycl::handler& cgh) {
auto r_acc = r_buf.template get_access<sycl::access::mode::write>(cgh);
cgh.parallel_for(sycl::range<1>(n), [=](sycl::item<1> item) {
oneapi::mkl::rng::device::philox4x32x10 engine(1, item.get_id(0));
oneapi::mkl::rng::device::uniform distr;
float res = oneapi::mkl::rng::device::generate(distr, engine);
r_acc[item.get_id(0)] = res;
});
});
queue.wait_and_throw();
}
catch (sycl::exception const& e) {
std::cout << "\t\tSYCL exception\n" << e.what() << std::endl;
}
} // buffer life-time ends
return 0;
}

References:

  1. John K. Salmon, Mark A. Moraes, Ron O. Dror, and David E. Shaw. Parallel random numbers: as easy as 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 16:1–16:12, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0771-0
  2. Intel® oneAPI Math Kernel Library Vector Statistics Notes
    https://software.intel.com/content/www/us/en/develop/documentation/onemkl-vsnotes/top.html
  3. Intel® oneAPI Math Kernel Library Vector Statistics Performance Data
    https://software.intel.com/content/www/us/en/develop/documentation/onemkl-vsperfdata/top.html
  4. New counter-based Random Number Generators in Intel® Math Kernel Library
    https://software.intel.com/content/www/us/en/develop/articles/new-counter-based-random-number-generators-in-intel-math-kernel-library.html
  5. Intel® Intrinsics Guide
    https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=vaes&expand=641
  6. Intel® oneAPI Math Kernel Library (oneMKL) - Data Parallel C++ Developer Reference
    https://software.intel.com/content/www/us/en/develop/documentation/oneapi-mkl-dpcpp-developer-reference/top/random-number-generators/intel-onemkl-rng-device-usage-model.html

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.