Significant advances in performance can come when specialized hardware combines with the right software, letting developers optimally match a computing task to an available hardware device. Because there is a large and growing variety of hardware available today, there is a desire to not only target different devices, but express work in source code that can be used portably across different kinds of devices, e.g., CPUs, GPUs, and FPGAs. Also, it is natural for developers to want to do all this using a single programming language with which they are familiar.
TornadoVM* is an exciting tool that lets Java* developers explore such heterogenous programming in a very approachable way. This article will give an overview of TornadoVM and some of the libraries it leverages in implementing a heterogeneous programming stack. All software referenced in this article is open source and available via the links listed at the end.
Overview of TornadoVM
TornadoVM accepts Java code and, in conjunction with a Java Virtual Machine and system runtime software, generates high-performance native code that specifically targets a variety of hardware devices, including CPUs, GPUs, and FPGAs. While primarily offering Java language support, other languages (e.g., Python*, Ruby*, JavaScript, R) can be used with TornadoVM via GraalVM’s polyglot runtime.
Figure 1. High level stack
The TornadoVM Toolchain – Front End
Use of TornadoVM starts with a developer expressing performance-critical tasks in the form of one or more Java methods. There are two different styles in which these task methods may be written:
- Using @Parallel Java annotations on loops; this method is very easy to use.
- Using the TornadoVM Kernel API; this method offers more control and is similar in style to APIs of some existing GPU programming frameworks.
NOTE: TornadoVM is being actively developed. There may be some incompatible API changes going forward (e.g., name changes). The examples described in this article were created with TornadoVM version 0.14, available on GitHub.
Figure 2 shows a matrix multiply task, expressed as a Java method and marked up with @Parallel annotations.
static void matrixMultA(final float[] A, final float[] B, final float[] C, final int size) {
for (@Parallel int i = 0; i < size; i++) {
for (@Parallel int j = 0; j < size; j++) {
float sum = 0.0f;
for (int k = 0; k < size; k++) {
sum += A[(i * size) + k] * B[(k * size) + j];
}
C[(i * size) + j] = sum;
}
}
}
Figure 2. Example task method
The code inside task methods is legal Java, but TornadoVM currently restricts this code to being a subset of Java. Specifically, if any of the following features are used inside a task method, TornadoVM will generate a message and revert to executing the task on the host JVM:
- object instances other than those from a predefined list of supported classes
- object construction
- recursive calls
- exceptions, either throws or catches
In addition to authoring task methods, the developer constructs a TaskSchedule object to collect one or more task methods, supply their arguments, and provide other information about executing the tasks. Figure 3 shows code that creates and executes a TaskSchedule for the matrix multiplication task in Figure 2. The TaskSchedule::execute method blocks until the task is finished and, as instructed by the ::streamOut call shown, the contents of the specified output matrix are synchronized with the host before returning.
public static void main(String[] args) {
// a, b, and c are size x size float[] matrices
// . . .
TaskSchedule schedule = new TaskSchedule("schedule0")
.task("task0", MatrixMulBlog::matrixMultA, a, b, c, size)
.streamOut(c);
schedule.execute();
}
Figure 3. Running a task
In Figure 4, the same matrix-multiply work is expressed using the TornadoVM Kernel API. The innermost loop remains, and the outer loops are replaced with index parameters made available via the KernelContext object. These indices will be supplied to each running instance of the matrixMultK method. The inner loop is modified to use these index variables but is otherwise the same as the inner loop used in the parallel annotation style task method in Figure 2.
static void matrixMultK(KernelContext ctx, float[] a, float[] b, float[] c, int size) {
int globalRow = ctx.globalIdx;
int globalCol = ctx.globalIdy;
float sum = 0;
for (int k = 0; k < size; k++) {
sum += a[(k * size) + globalRow] * b[(globalCol * size) + k];
}
c[(globalCol * size) + globalRow] = sum;
}
Figure 4. Example Kernel API task method
When preparing the TaskSchedule for a task written using the Kernel API, additional information about the task is required. Fig. 5 shows code that creates and executes a TaskSchedule for the Kernel API task method above.
Note the construction of a WorkerGrid2D object that describes the global size of the task. This is required since the outer loops of the matrix multiplication are now implicit.
Note also that a local work size, 32 x 32 in this case, is set on the WorkerGrid2D. This describes the size into which the global work should be partitioned. The GridScheduler aggregates one or more WorkerGrids, one for each task. The additional control of local size lets a developer apply specialized knowledge about task parallelization and target hardware, to further optimize performance.
public static void main(String[] args) {
// a, b, and c are size x size float[] matrices
// . . .
WorkerGrid2D workerGrid = new WorkerGrid2D(size, size);
workerGrid.setLocalWork(32, 32, 1);
GridScheduler gridScheduler = new GridScheduler("schedule0.task0", workerGrid);
KernelContext context = new KernelContext();
TaskSchedule schedule = new TaskSchedule("schedule0")
.task("task0", MatrixMulBlog::matrixMultK, context, a, b, c, size)
.streamOut(c);
schedule.execute(gridScheduler);
Figure 5. Running a Kernel-style task
The TornadoVM Toolchain – Compilation and Backends
The task methods and TaskSchedules, along with the rest of a developer’s Java program, are compiled in conjunction with a supported Java VM. TornadoVM participates in the compilation via a Java Virtual Machine Compiler Interface (JVMCI) connection to the JVM. At compile time, TornadoVM treats these task methods and TaskSchedule objects specially, generating optimized, extended Java bytecode. When the bytecode for these tasks is JIT-compiled to native code, TornadoVM generates intermediate code suitable for targeting one or more of the specialized devices that TornadoVM supports.
Figure 6. TornadoVM execution flow
TornadoVM Level Zero / SPIR-V* Backend
It is a daunting task to generate specialized native code for different kinds and models of hardware. TornadoVM tackles this by making the code-generating backends modular; it currently implements several backends, each of which can target different sets of hardware devices.
To simplify the implementation of these backends and to get the best performance possible, TornadoVM leverages low-level runtime libraries such as Intel’s implementation of the oneAPI Level Zero specification. The Level Zero library accepts work expressed in an abstract-assembly language, together with task execution information, and generates and executes native code on supported devices. Intel’s current Level Zero implementation supports GPUs. Support for other devices is planned.
Figure 7. Level Zero Interface
The abstract-assembly language accepted by Level Zero is Standard Portable Intermediate Representation V (SPIR-V*), developed by The Khronos Group. TornadoVM generates the SPIR-V code using a Java library created for the purpose and submits the code and task information to Level Zero via a Java Level Zero JNI binding library. Both the SPIR-V code construction Java library and the Level Zero Java binding library are available separately as open source projects (SPIR-V Toolkit, Level Zero JNI). Figure 8 shows how TornadoVM targets an Intel GPU.
Figure 8. Targeting an Intel GPU
When compared to a framework like OpenCL™, the Level Zero API supports additional features such as virtual functions, function pointers, unified shared memory, device partitioning, instrumentation, debugging, and hardware diagnostics. It also offers control of things like power management and operating frequency. This level of control is very appealing for system programming and implementation of runtime systems and compilers. It makes heterogeneous hardware more accessible and more programmable.
Example Use Cases
Demonstrations of TornadoVM’s use in graphics, image recognition, natural language processing, and healthcare can be found here. The Ray Tracing example uses TornadoVM to implement a JavaFX-based ray tracing application that endeavors to render object dynamics in real time.
Figure 9a & 9b. Ray tracing example - Unaccelerated Java (9a) & Java accelerated with a GPU (9b)
The application uses a GPU to accelerate Java code, and the performance gain achieved changes the real-time value of the app from not very practical (e.g., 1.0 fps with sequential Java) to very practical (e.g., 60 fps with Intel integrated GPU).
Conclusion
As hardware and software evolve in search of better and faster computing, it’s challenging to make new functionality readily available to developers. Among the challenges: new features and improved performance often come with a cost of increased complexity, both in application programming and implementation of computing stacks.
TornadoVM and the oneAPI specifications and implementations are great examples of how the collective work of multiple open source projects can be brought together to help tame complexity and put very approachable heterogenous programming in the hands of Java developers.
More Resources
- TornadoVM
- TornadoVM examples
- Other language Examples:
TornadoVM Polyglot Examples - oneAPI
- Level Zero intro
- Level Zero (Intel):
Programming with the Intel oneAPI Level Zero Backend - Level Zero Java binding
- SPIR-V Toolkit
- SPIR-V definition
See Related Content
On-Demand Webinars
- Profile Your Production Java Workload in the Cloud
Watch - Tune Your Applications for Heterogeneous Architectures
Watch
Tech Articles
- Introduction to Using the oneAPI Level Zero Interface
Read - Expanding language and accelerator support in oneAPI
Read - Migrating HSOpticalFlow from CUDA to SYCL
Read - SIMD vectorization in LLVM and GCC for Intel® CPUs and GPUs
Read - Use GPU-Quicksort to Move from OpenCL™ Code to Data Parallel C++
Read - ArrayFire Interoperability with oneAPI, Libraries, and OpenCL Code
Read
Get the Software
Build, analyze, and optimize high-performance, cross-architecture applications on CPUs and XPUs with best-in-class compilers, performance libraries, frameworks, and analysis and debug tools.
Get It Now
See All Tools