TornadoVM: Running your Java programs on heterogeneous hardware

Share
  • October 17, 2019

Heterogeneous hardware is present in almost every computing system: our smartphones contain a Central Processing Unit (CPU), and a Graphics Processing Unit (GPU) with multiple cores; our laptops contain, most likely, a multi-core CPU with an integrated GPU plus a dedicated GPU; data centers are adding Field Programmable Gate Arrays (FPGAs) attached to their systems to accelerate specialized tasks, while reducing energy consumption. Moreover, companies are implementing their own hardware for accelerating specialized programs. For instance, Google has developed a processor for faster processing of TensorFlow computation, called Tensor Processing Unit (TPU). This hardware specialization and the recent popularity of hardware accelerators is due to the end of Moore’s law, in which the number of transistors per processor does not double every 2 years with every new CPU generation anymore, due to physical constraints. Therefore, the way to obtain faster hardware for accelerating applications is through hardware specialization.

The main challenge of hardware specialization is programmability. Most likely, each heterogeneous hardware has its own programming model and its own parallel programming language. Standards such as OpenCL, SYCL, and map-reduce frameworks facilitate programming for new and/or parallel hardware. However, many of these parallel programming frameworks have been created for low-level programming languages such as Fortran, C, and C++.

Although these programming languages are still widely used, the reality is that industry and academia tend to use higher-level programming languages such as Java, Python, Ruby, R, and Javascript. Therefore, the question now is, how to use new heterogeneous hardware from those high-level programming languages?

There are currently two main solutions to this question: a) via external libraries, in which users might be limited to only a set of well-known functions; and b) via a wrapper that exposes low-level parallel hardware details into the high-level programs (e.g., JOCL is a wrapper to program OpenCL from Java in which developers need to know the OpenCL programming model, data management, thread scheduling, etc.). However, many potential users of these new parallel and heterogeneous hardware are not necessarily experts on parallel computing, and perhaps, a much easier solution is required.

In this article, we discuss TornadoVM, a plug-in to OpenJDK that allows developers to automatically and transparently run Java programs on heterogeneous hardware, without any required knowledge on parallel computing or heterogeneous programming models. TornadoVM currently supports hardware acceleration on multi-core CPUs, GPUs, and FPGAs and it is able to dynamically adapt its execution to the best target device by performing code migration between multiple devices (e.g., from a multi-core system to a GPU) at runtime. TornadoVM is a research project developed at the University of Manchester (UK) and it is fully open-source and available on Github. In this article, we present an overview of TornadoVM and how programmers can automatically accelerate a photography-filter on multi-core CPUs and GPUs.

How does TornadoVM work?

The general idea of TornadoVM is to write or modify as fewer lines of code as possible, and automatically execute that code on accelerators (e.g., on a GPU). TornadoVM transparently manages the execution, memory management, and synchronization, without specifying any details about the actual hardware to run on.

TornadoVM’s architecture is composed of a traditional layered architecture combined with a microkernel architecture, in which the core component is its runtime system. The following figure shows a high-level overview of all the TornadoVM components and how they interact with each other.

java

TornadoVM-API

At the top level, TornadoVM exposes an API to Java developers. This API allows users to identify which methods they want to accelerate by running them on heterogeneous hardware. One important aspect of this programming framework is that it does not automatically detect parallelism. Instead, it exploits parallelism at the task-level, in which each task corresponds to an existing Java method.

The TornadoVM-API can also create a group of tasks, called task-schedule. All tasks within the same task-schedule (all Java methods associated with the task-schedule) are compiled and executed on the same device (e.g., on the same GPU). By having multiple tasks (methods) as part of a task-schedule, TornadoVM can further optimize data movement between the main host (the CPU) and the target device (e.g., the GPU). This is due to non-shared memory between the host and the target devices. Therefore, we need to copy the data from the CPU’s main memory to the accelerator’s memory (typically via a PCIe bus). These data transfers are indeed very expensive and can hurt the end-to-end performance of our applications. Therefore, by creating a group of tasks, data movement can be further optimized if TornadoVM detects that some data can stay on the target device, without the need of synchronizing with the host side for every kernel (Java method) that is executed.

SEE ALSO: What kind of Java developer are you? Take our Java Quiz to find out!

The following code snippet shows an example of how to program a typical map-reduce computation by using TornadoVM. The class Sample contains three methods: one method that performs the vector addition (map); another method that computes the reduction (reduce), and the last one that creates the task-schedule and executes it (compute). The methods to be accelerated are the method map and reduce. Note that the user augments the sequential code with annotations such as @Parallel and @Reduce that are used as a hint to the TornadoVM compiler to parallelize the code. The last method (compute), creates an instance of the task-schedule Java class and specifies which methods to accelerate. We will go into the details of the API with a full example in the next section.

public class Sample {
          public static void map(float[] a, float[] b, float[] c) {
                    for (@Parallel int i = 0; i < c.length; i++) {
         	         c[i] = a[i] + b[i];
    	     }
	}
             public static void reduce(float[] input, @Reduce float[] out) {
                      for (@Parallel int i = 0; i < input.length; i++) {
         	         out[0] += input[i];
    	   }
	}
           public void compute(float[] a, float[] b, float[] c, float[] output) {
       	TaskSchedule ts = new TaskSchedule("s0")
               	      .task("map", Sample::map, a, b, c)
                      .task("reduce", Sample::reduce, c, output)
            	      .streamOut(output)
                     .execute();
          }
}

TornadoVM Runtime

The TornadoVM runtime layer is split between two subcomponents: a task-optimizer and a bytecode generator. The task optimizer takes all tasks within the task-schedules and analyzes data dependencies amongst them (dataflow runtime analysis). The goal of this, as we have mentioned before, is to optimize data movement across tasks.

Once the TornadoVM runtime system optimizes the data transfers, it then generates internal TornadoVM-specific bytecodes. These bytecodes are not visible to the developers and their role is to orchestrate the execution on heterogeneous devices. We will show an example of the internal TornadoVM bytecodes in the next block.

Execution-engine

Once the TornadoVM bytecodes have been generated, the execution engine executes them in a bytecode interpreter. The bytecodes are simple instructions that can be reordered internally to perform optimizations –  for example, to overlap computation with communication.

The following code snippet shows a list of generated bytecodes for the map-reduce example shown in the previous code snippet. Every task-schedule is enclosed between BEGIN-END bytecodes. The number that follows each bytecode is the device in which all tasks within a task-schedule will execute on. However, the device can be changed at any point during runtime. Recall that we are running two tasks in this particular task-schedule (a map-method and a reduce-method). For each method (or task), TornadoVM needs to pre-allocate the data and to perform the corresponding data transfers. Therefore, TornadoVM executes COPY_IN, which will allocate and copy data for the read-only data (such as arrays a and b from the example), and allocate the space on the device buffer for the output (write-only) variables by calling the ALLOC bytecode. All bytecodes have their bytecode-index (bi) that other bytecodes can refer to. For example, since the execution of many of the bytecodes is non-blocking, TornadoVM adds a barrier by running the ADD_DEP bytecode and a list of bytecode-indexes to wait for.

SEE ALSO: Java 13 – why text blocks are worth the wait

Then, to run the kernel (Java method), TornadoVM executes the bytecode LAUNCH. The first time this bytecode is executed, TornadoVM will compile the referenced method (in our example are the methods called map and reduce) from Java bytecode to OpenCL C. Since the compiler is, in fact, a source to source (Java bytecode to OpenCL C), another compiler is needed. The latter compiler is part of the driver of each target device (e.g., the GPU driver for NVIDIA, or the Intel driver for an Intel FPGA) that will compile the OpenCL C to binary. TornadoVM then stores the final binary in its code cache. If the task-schedule is reused and executed again, TornadoVM will obtain the optimized binary from the code-cache saving the time of re-compilation.  Once all tasks are executed, TornadoVM copies the final result into the host memory by running the COPY_OUT_BLOCK bytecode.

BEGIN <0>
COPY_IN <0, bi1, a>
COPY_IN <0, bi2, b>
ALLOC <0, bi3, c>
ADD_DEP <0, b1, b2, b3>
LAUNCH <0, bi4, @map, a, b, c>
ALLOC <0, bi5, output>
ADD_DEP <0, b4, b5>
LAUNCH <0, bi7, @reduce, c, output>
COPY_OUT_BLOCK <0, bi8, output>
END <0>

The following figure shows a high-level representation of how TornadoVM executes and compiles the code from Java to OpenCL. The JIT compiler is an extension of the Graal JIT compiler for OpenCL developed at the University of Manchester. Internally, the JIT compiler builds a control flow graph (CFG) and a data flow graph (DFG) for the input program that are optimized during different tiers of compilation. In the TornadoVM JIT compiler currently exist three tiers of optimization: a) architecture-independent optimizations (HIR), such as loop unrolling, constant propagation, parallel loop exploration or parallel pattern detection; b) memory optimizations, such as alignment in the MIR, and c) architecture-dependent optimizations. Once the code is optimized, TornadoVM traverses the optimized graph and generates OpenCL C code, as shown on the right side of the following figure.

java

Additionally, the execution engine automatically handles memory and keeps consistency between the device buffers (allocated on the target device), and the host buffers (allocated on the Java heap). Since compilation and execution are automatically managed by the TornadoVM, end-users of TornadoVM do not have to worry about the internal details.

Testing TornadoVM

This section shows some examples of how to program and run TornadoVM. We show, as an example, a simple program of how to transform an input coloured JPEG image to a grayscale image. Then we show how to run it for different devices and measure its performance. All examples presented in this article are available online on Github.

Grayscale transformation Java code

The Java method that transforms a color JPEG image into grayscale is the following:

class Image {
  private static void grayScale(int[] image, final int w, final int s) {
    for (int i = 0; i < w; i++) {
        for (int j = 0; j < s; j++) { int rgb = image[i * s + j]; // get the pixel int alpha = (rgb >> 24) & 0xff;
            int red = (rgb >> 16) & 0xFF;
            int green = (rgb >> 8) & 0xFF;
            int blue = (rgb & 0xFF);
            int grayLevel = (red + green + blue) / 3;
            int gray = (alpha << 24) | (grayLevel << 16) | (grayLevel << 8) | grayLevel;
            image[i * s + j] = gray;
        }
    }
 }
}

For every pixel in the image, the alpha, red, green and blue channels are obtained. Then they are all combined into a single value to emerge the corresponding grey pixel, which is finally stored again into the image array of pixels.

Since this algorithm can be executed in parallel,  it is an ideal candidate for hardware acceleration with TorandoVM. To program the same algorithm with TornadoVM, we first use the @Parallel annotation to annotate the loops that can potentially run in parallel. TornadoVM will inspect the loops and will analyze if there is no data dependency between iterations. In this case, TornadoVM will specialize the code to use 2D indexing in OpenCL.  For this example, the code looks as follows:

class Image {
  private static void grayScale(int[] image, final int w, final int s) {
    for (@Parallel int i = 0; i < w; i++) {
        for (@Parallel int j = 0; j < s; j++) { int rgb = image[i * s + j]; // get the pixel int alpha = (rgb >> 24) & 0xff;
            int red = (rgb >> 16) & 0xFF;
            int green = (rgb >> 8) & 0xFF;
            int blue = (rgb & 0xFF);
            int grayLevel = (red + green + blue) / 3;
            int gray = (alpha << 24) | (grayLevel << 16) | (grayLevel << 8) | grayLevel;
            image[i * s + j] = gray;
        }
    }
  }

}

Note that we introduce @Parallel for the two loops. After this, we need to instruct TornadoVM to accelerate this method. To do so, we create a task-schedule as follows:

TaskSchedule ts = new TaskSchedule("s0")
    .streamIn(imageRGB)
    .task("t0", Image::grayScale, imageRGB, w, s)
    .streamOut(imageRGB);

// Execute the task-schedule (blocking call)
ts.execute();

The task-schedule is an object that describes all tasks to be accelerated. At first we pass a name to identify the task-schedule (“s0” in our case, but it could be any name). Then, we define which Java arrays we want to stream to the input tasks. This call indicates to the TornadoVM that we want to copy the contents of the array every time we invoke the execute method. Otherwise, if no variables are specified in the streamIn, TornadoVM will create a cached read-only copy for all variables needed for the tasks’ execution.

The next call is the task invocation. We can create as many tasks as we want within the same task-schedule. As we described in the previous section, each task references an existing Java method. The arguments to the task are as follows: first we pass a name (in our case we name it “t0”, but it could be any other name); then we pass either a lambda expression or a reference to a Java method. In our case we pass the method grayScale from the Java class Image. Finally, we pass all parameters to the method, as any other method call.

After that, we need to indicate to TornadoVM which variables we want to synchronize again with the host (main CPU). In our case we want the same input JPEG image to be updated with the accelerated grayscale one. This call, internally, will force a data movement in OpenCL, from device to host, and copy the data from the device’s global memory to the Java heap that resides in the host’s memory. These four lines only declare the tasks and variables to be used. However, nothing is executed until the programmer invokes the execute method.

Once we have created the program, we compile it with standard javac. In the TornadoSDK (once TornadoVM is installed the machine), utility commands are provided to compile with javac with all classpaths and libraries already set:

$ javac.py Image.java

At runtime, we use the tornado command, which is, in fact, an alias to java with all classpaths and flags required to run TornadoVM over the Java Virtual Machine (JVM). But, before running with TornadoVM, let’s check which parallel and heterogeneous hardware are available in our machine. We can query this by using the following command from the TornadoVM SDK:

$  tornadoDeviceInfo
Number of Tornado drivers: 1
Total number of devices  : 4
Tornado device=0:0
    NVIDIA CUDA -- GeForce GTX 1050
Tornado device=0:1
    Intel(R) OpenCL -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Tornado device=0:2
    AMD Accelerated Parallel Processing -- Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Tornado device=0:3
    Intel(R) OpenCL HD Graphics -- Intel(R) Gen9 HD Graphics NEO

On this laptop, we have an NVIDIA 1050 GPU, an Intel CPU and the Intel Integrated Graphics (Dell XPS 15’’, 2018). As shown, all of these devices are OpenCL compatible and all drivers are already installed. Therefore, TornadoVM can consider all these devices available for execution. Note that, on this laptop, we have two devices targeting the Intel CPUs, one using the Intel OpenCL driver and another one using the AMD OpenCL driver for CPU. If no device is specified, TornadoVM will use the default one (0:0). To run our application with TornadoVM, we simply type:

$ tornado Image

java

In order to discover on which device our program is running,  we can query basic debug information through TornadoVM by using the `–debug` flag as follows:

$ tornado --debug Image
task info: s0.t0
    platform      	: NVIDIA CUDA
    device        	: GeForce GTX 1050 CL_DEVICE_TYPE_GPU (available)
    dims          	: 2
    global work offset: [0, 0]
    global work size  : [3456, 4608]
    local  work size  : [864, 768]

This means that we used an NVIDIA 1050 GPU (the one available in our laptop) to run this Java program. What happened underneath is that TornadoVM compiled the Java method grayScale into OpenCL at runtime, and run it with the available OpenCL supported device. In this case, on the NVIDIA GPU. Additional information from the debug mode includes how many threads were used to run as well as their block size (local work size). This is automatically decided by the TornadoVM runtime and it depends on the input size of the application. In our case, we used an input image of 3456×4608 pixels.

So far we managed to get a Java program running, automatically and transparently on a GPU. This is great, but what about performance? We are using an Intel CPU i7-7700HQ on our testbed laptop. The time that it takes to run the sequential code with this input image is 1.32 seconds. On the GTX 1050 NVIDIA GPU it takes 0.017 seconds. This is 81x times faster to process the same image.

SEE ALSO: Jakarta EE 8: Past, Present, and Future

We can also change the device to run at runtime by passing the flag -D..device=0:X.

For example, the following code snippet shows how to run TornadoVM to use the Intel Integrated GPU:

$ tornado --debug -Ds0.t0.device=0:3 Image
task info: s0.t0
    platform      	: Intel(R) OpenCL HD Graphics
    device        	: Intel(R) Gen9 HD Graphics NEO CL_DEVICE_TYPE_GPU (available)
    dims          	: 2
    global work offset: [0, 0]
    global work size  : [3456, 4608]
    local  work size  : [216, 256]

By running on all devices, we get the following speedup-graph. The first bar shows the baseline (running with Java sequential code with no acceleration) which is 1. The second bar shows the speedup of TornadoVM, against the baseline) by running on a multi-core (4 core) CPU. The last bars correspond to the speedup on an integrated GPU and a dedicated GPU. By running this application with TornadoVM, we can get up to 81x performance improvements (NVIDIA GPU) over the Java sequential and up to 62x by running on the Intel integrated graphics card. Notice that on a multi-core configuration, TornadoVM is superlinear (27x on a 4 core CPU). This is due to the fact that the generated OpenCL C code can exploit the vector instructions on the CPUs, such as AVX and SSE registers available per core.

java

Use cases

The previous section showed an example of simple application, in which a quite common filter in photography is accelerated. However, TornadoVM’s functionality extends beyond simple programs. For example, TornadoVM can currently accelerate machine learning and deep learning applications, computer vision, physics simulations and financial applications.

SLAM Applications

TornadoVM has used to accelerate a complex computer vision application (Kinect Fusion) on GPUs written in pure Java, which contains around 7k lines of Java code. This application records a room with the Microsoft Kinect camera, and the goal is to do its 3D space reconstruction in real-time. In order to achieve real-time performance, the room must be rendered with at least 30 frames per second (fps). The original Java version achieves 1.7 fps, meanwhile, the TornadoVM version running on a  GTX 1050 NVIDIA GPU achieves up to 90 fps. The TornadoVM version of the Kinect Fusion application is open-sourced and available on Github.

java

Machine Learning for the UK National Health Service (NHS)

Exus Ltd. is a company based in London which is currently improving the UK NHS system by providing predictions of patients’ hospital readmissions. To do so, Exus has been correlating patients’ data that contain their profile, characteristics and medical conditions. The algorithm used for prediction is a typical logistic regression with millions of elements as data sets. So far, Exus have accelerated the training phase of the algorithm via TornadoVM for 100K patients, from 70 seconds (the pure Java application) to only 7 seconds (10x performance improvement). Furthermore, they have demonstrated that, by using a dataset of 2 million patients, the execution with TornadoVM improves by 14x.

Physics Simulation

We have also experimented with synthetic benchmarks and computations commonly used for physics simulation and signal processing, such as NBody and DFT. In these cases we have experienced speedups of up to 4500x using an NVIDIA GP100 GPU (Pascal Microarchitecture) and up to 240x using an Intel Nallatech 385a FPGA. These types of applications are computationally intensive, and the bottleneck is the kernel processing time. Thus, having a power parallel device specialized for these types of computation helps to increase the overall performance.

Present and future of TornadoVM

TornadoVM is currently a research project at the University of Manchester. Besides, TornadoVM is part of the European Horizon 2020 E2Data project in which TornadoVM is being integrated with Apache Flink (a Java framework for batch and stream data processing) to accelerate typical map-reduce operations on heterogeneous and distributed-memory clusters.

TornadoVM currently supports compilation and execution on a wide variety of devices, including Intel and AMD CPUs, NVIDIA and AMD GPUs, and Intel FPGAs. We have an ongoing work to also support also Xilinx FPGAs. With this option, we aim to cover all current offerings of cloud providers Additionally, we are integrating more compiler and runtime optimizations, such as the use of device memory tiers and the use of virtual shared memory to reduce the total execution time and increase the overall performance.

Summary

In this article, we discussed TornadoVM, a plug-in for OpenJDK for accelerating Java programs on heterogeneous devices. At first, we described how TornadoVM can compile and execute code on heterogeneous hardware such as a GPU. Then we presented an example for programming and running TornadoVM on different devices, including a multi-core CPU, an integrated GPU and a dedicated NVIDIA GPU. Finally, we showed that, with TornadoVM, developers can achieve high-performance while keeping their applications totally hardware agnostic.  We believe that TornadoVM offers an interesting approach in which the code to be added is easy to read and maintain, and at the same time, it can offer high-performance if parallel hardware is available in a system.

More information regarding the technical aspects of TornadoVM can be found below:

References

Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, James Clarkson, and Christos Kotselidis. 2019. Dynamic application reconfiguration on heterogeneous hardware. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 2019). ACM, New York, NY, USA, 165-178. DOI: https://doi.org/10.1145/3313808.3313819

Juan Fumero and Christos Kotselidis. 2018. Using compiler snippets to exploit parallelism on heterogeneous hardware: a Java reduction case study. In Proceedings of the 10th ACM SIGPLAN International Workshop on Virtual Machines and Intermediate Languages (VMIL 2018). ACM, New York, NY, USA, 16-25. DOI: https://doi.org/10.1145/3281287.3281292

James Clarkson, Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, Christos Kotselidis, and Mikel Luján. 2018. Exploiting high-performance heterogeneous hardware for Java programs using Graal. In Proceedings of the 15th International Conference on Managed Languages & Runtimes (ManLang ’18). ACM, New York, NY, USA, Article 4, 13 pages. DOI: https://doi.org/10.1145/3237009.3237016

TornadoVM with Juan Fumero. https://www.youtube.com/watch?v=nPlacnadR6k

Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet, John Mawer, and Mikel Luján. 2017. Heterogeneous Managed Runtime Systems: A Computer Vision Case Study. SIGPLAN Not. 52, 7 (April 2017), 74-82. DOI: https://doi.org/10.1145/3140607.3050764

Acknowledgments

This work is partially supported by the European Union’s Horizon 2020 E2Data 780245 and ACTiCLOUD 732366 grants. Special thanks to Gerald Mema from Exus for reporting with the NHS use case.

The post TornadoVM: Running your Java programs on heterogeneous hardware appeared first on JAXenter.

Source : JAXenter