Intel Optimizations in the Android* Marshmallow Compiler

ID 659696
Updated 7/7/2016
Version Latest
Public

author-image

By

Introduction

The Android* Marshmallow (6.0) release incorporates new features and capabilities and a number of enhancements to the Android Runtime* (ART) that result in improved application performance. Starting with Marshmallow, Intel shifted virtual machine (VM) engineering focus toward improving the user experience (UX) by optimizing for representative end-user activities that are part of applications and scenarios that application developers and users care about the most.

This article describes a few of these optimizations and the corresponding infrastructure that Intel implemented in the Marshmallow Optimizing compiler. It also describes a set of optimizations that are available only in Intel's version of Android Marshmallow and talks about Intel's enhancements of a few optimizations available in the Android Open Source Project (AOSP). These optimizations improve application performance and battery life on mobile devices. For a similar paper for the Lollipop release, see Intel Optimizations in the Android Compiler.

In Android, most or all of an application is usually written in the Java* programming language. Java source code is transformed into Dalvik* Bytecode during application creation (most commonly via Android Studio). Both Lollipop (5.0) and Marshmallow use Ahead-Of-Time (AOT) compilers to compile bytecode into native binary code during application install. As an enhancement, Marshmallow uses a more capable compiler (the Optimizing compiler) to generate optimized native code that is an improvement over Lollipop's legacy Quick compiler. Android System Framework methods are, however, compiled using the Quick compiler to provide better “debuggability” and reduce generated code size.

Marshmallow's Optimizing compiler has advanced optimizations commonly seen in mature Java compilers. For tips on writing Java code to take advantage of Optimizing compiler capabilities, see How to Optimize Java Code in Android Marshmallow.

Compiler Optimizations

Through the KitKat (4.4) release, the Dalvik VM was the primary execution engine. In Dalvik, Intel's focus was on loop optimizations to improve performance on key industry-standard benchmarks, especially optimizations that reduced the number of memory accesses within loops. For Lollipop, the Dalvik VM was replaced by ART, which introduced the Quick compiler. The Quick compiler is an AOT compiler derived from the Dalvik Just-In-Time (JIT) compiler. In order to maintain the same level of performance, the majority of Intel's loop optimizations were ported from the Dalvik VM to the Quick compiler. For Marshmallow, the same optimizations were ported from the Quick compiler to the Optimizing compiler.

The Quick compiler is designed for fast code generation and lacks infrastructure support for complex optimizations and high-quality generated code. The Optimizing compiler in Marshmallow is capable of more extensive optimizations and more efficient native code generation than the Quick compiler. In addition, Intel has added several optimizations catering to application scenarios which represent Java applications and popular Google Play* use cases.

Method Inlining and Use of Intrinsics

Method Inlining eliminates function call overhead and can provide significant speedup in programs in which small routines are called frequently. Inlining can also increase code size significantly and thus increase pressure on the instruction cache. All classic optimizing compilers have an inlining capability of one form or other.

The following example uses java.lang.String, which is a final class in Java, and its length() method, which is a public member that returns the number of characters in a String object.

String's length() method has a simple implementation.

public int length() {
    return count;
}

Count is a private final member of each String object that contains the number of characters in the object.

Given this code

String mystr = new String("Password");
char[] mychar = new char[20];
for (int i = 0; i < mystr.length(); i++) {
    mychar[i] = mystr.charAt(i);
}

the compiler can inline the body of String.length() at the call site that uses the return value as the loop trip count. The transformed loop after inlining would look like:

for (int i = 0; i < mystr.count; i++) {
    mychar[i] = mystr.charAt(i);
}

However, in Java virtual methods can have significant performance bottlenecks. Virtual public methods can be overridden by any derived type, which means that extra time is spent in vtable lookups. Due to AOT compilation, in Marshmallow runtime type information is not available for the compiler to use unlike mature JIT compilers; only statically computable types can be used in inlining.

In the Marshmallow AOSP Optimizing compiler, any method marked as final or declared within a class marked final is considered inline-capable. The Optimizing compiler takes the final keyword on the method declaration as an indicator that the method can't be overwritten by any subclass and thus knows the precise class type during compilation. In the previous example, the compiler knows that String is a final class and thus that mystr.length() cannot be overridden. Further discussion on the use of the final keyword in Java can be found at Writing Final Classes and Methods.

In Marshmallow, the AOSP optimizing compiler is able to inline small methods declared as static or virtual (and final) as long as they don't contain loops, don't call other methods, and don't throw Java exceptions. String.length() in the example above qualifies on all counts. The AOSP Optimizing compiler can inline final methods that are up to roughly 76 Dalvik bytecode units for each method, and inlining can happen up to three invocation levels deep. The compiler could have higher limits, but uses these in order to limit method code size and compile time.

It is common for the Java program to have the potential for throwing exceptions, which can prevent small leaf methods—which modify arrays or fields—from being inlined. This is a limitation in the AOSP based compiler. Intel's enhancement of the ART Optimizing compiler allows inlining of methods that may throw NullPointerException, ArrayIndexOutOfBoundsException, or ArithmeticException (integer divide by zero). The first two are fairly common in Java programs, since object field accesses require null checks and array accesses require both null and bound checks. Inlining them can allow null check elimination and potential bound check elimination in the calling method.

Here is an example of the bubble sort algorithm that uses a swap method to interchange values between array indices.

public static final void swap(int[] array, int i, int j) {
          int temp = array[i];
    array[i] = array[j];
          array[j] = temp;
}

public static void bubbleSort (int[] myArray) {
    for (i = 0; i < myArray.length - 1; i++) {
        for (j = 1; j < myArray.length - i; j++) {                                                                                                            	      if (myArray[j-1] > myArray[j]) {
                swap(myArray, j-1, j);             
            }
        }
    }
}

In this example, the compiler decides to inline the swap() method at the call site inside the bubbleSort() method, even though the method to be inlined might throw the array null and range check Java exceptions at runtime. Intel's enhancement of Method Inlining in Marshmallow ART provides a 6 percent improvement on AnTuTu* benchmark (v4.4) runtime test and a 10 percent speed improvement on the Android gaming Icy Rocks workload.

The Android java.lang.Math class has been optimized to use the libm math functions specialized for the Intel® architecture. For example, a common Java method that returns the square root of the cosine of a float value is:

public float myFunc (float x) {
    float a = (float) Math.sqrt(Math.cos(x));
    return a;

The code generated by Intel's version of the ART Optimizing compiler directly calls into math library functions in the Bionic runtime by optimizing away the Java to native (Java Native Interface, or JNI) call overhead. This optimization is specifically done for mobile CPUS based on Intel architecture. In Marshmallow, the Math Intrinsic optimization provides around a 15 percent speed up on the AnTuTu benchmark (v4.4) runtime test.

Loop Peeling

To understand Loop Peeling, you must be familiar with another optimization done by most classic optimizing compilers: Loop Invariant Code Motion (LICM). This optimization moves (hoists) instructions that compute the same value in all loop iterations (that is, they are invariant) out of the loop into the loop preheader. However, Java language semantics says that instructions that could potentially modify the Java heap memory can't be safely moved outside loops. The example below uses newString.charAt() to store characters located at String's specified indices into a character array.

const int SIZE = 64;
String newString = new String("PASSWORD");
char[] mycharArray[SIZE];
int j = 0; 
while (j < newString.length()) {
          mycharArray[j] = newString.charAt(j);
    j++;           	
}

Classic optimizing compilers would inline calls to newString.length() to produce the following:

int j = 0;
while (j < newString.count) {
          mycharArray[j] = newString.charAt(j);
    j++; 
}

After an LICM transformation to hoist the loop invariant expression newString.count out of the loop, the code would look like this:

int j = 0;
int len = newString.count; // hoist loop invariant expression
   // and newString null check
while (j < len) {
          mycharArray[j] = newString.charAt(j); // no newString null check
    j++; 
}

Loop Peeling is implemented by Intel in Marshmallow's ART Optimizing compiler to facilitate a more aggressive form of LICM. We rely upon Loop Peeling to hoist an entire loop iteration that executes once before entering the primary loop. Java exceptions, which are loop invariant, only happen in the first (peeled) iteration, so more invariant code can be hoisted out of the loop. Removing null checks on arrays and local objects not only reduces runtime overhead during loop execution, it further expands optimization scope.

Marshmallow's Optimizing compiler uses a redundancy elimination optimization called Global Value Numbering (GVN) with Alias Analysis (the latter is discussed below) to be able to eliminate redundant instructions from a loop.

In this example, mySharpenImage is a filter function used in image enhancement (using open source libraries). The input parameter is an unsigned 8-bit image object. The method sharpens the input image and returns the sharpened result in the output image.

public static void mySharpenImage(ImageUInt8 input,
     ImageUInt8 output){
          for (int y = 1; y < input.height - 1; y++) {
              int indexIn = input.startIndex + 1;
              int indexOut = output.startIndex + 1;
              for (int x = 1;
 x < input.width - 1;
       x++,indexIn++,indexOut++){
output.data[indexOut] = (byte)
  (3 * input.data[indexIn] - input.data[indexIn - 1] +
   input.data[indexIn + 1]);
              }
          }
}

Loop peeling optimizations work well here. The inner loop is peeled, which allows the compiler to eliminate null checks on both data arrays input.data[] and output.data[] in the primary loop. The result would be:

public static void mySharpenImage(ImageUInt8 input,
     ImageUInt8 output){
          for (int y = 1; y < input.height - 1; y++) {
              int indexIn = input.startIndex + 1;
              int indexOut = output.startIndex + 1;
		  int len = input.width - 1;
		  if (1 < len) {
			// null checks on output, input and the data arrays
			output.data[indexOut] = (byte)
			  (3 * input.data[indexIn] - input.data[indexIn - 1] +
   input.data[indexIn + 1]);
indexIn++, indexOut++;
              	for (int x = 2; x < len; x++, indexIn++, indexOut++) {
			    // no null checks
    output.data[indexOut] = (byte)
      (3 * input.data[indexIn] - input.data[indexIn - 1] +
       input.data[indexIn + 1]);
}
              }
          }
}

Intel's implementation of Loop Peeling improves the SmartBench* benchmark by 33 percent (around a three times speedup on the PI subtest) on different platforms based on Intel architecture with help from the store sinking optimization.

Loop Unrolling

In the ART Marshmallow Optimizing compiler, Intel has also implemented full Loop Unrolling for small loops. Loop Unrolling enables faster loop execution by significantly reducing the number of loop iterations. Unrolling loops enables simple forms of parallelization and vectorization, and can provide significant speedup. However it can increase code size and thus pressure on the instruction cache and instruction TLB, possibly causing performance issues, particularly on devices with smaller caches. Consider the array access in this loop to sum the elements of an array.

const final int ArraySize = 256;
int[] score = new int[ArraySize]; //score[] is populated elsewhere 
int totalScore = 0;
for (int i = 0; i < ArraySize; i++) {
    totalScore += score[i];
}

The length of the array score.length and the loop trip count ArraySize are the same and both are compile-time constants, so the compiler can eliminate array bound checks from the loop body. The compiler may also decide to unroll the loop by a factor of two for better performance. After partial unrolling, the loop would be:

for (int i = 0; i < 256; i += 2) {
	    totalScore += score[i];
    totalScore += score[i + 1];
}

Keeping the trade-offs in mind, in Marshmallow Intel has implemented Full Loop Unrolling only for small loops, based on the loop size and trip count (which must be a constant) information available to the compiler. The ART Optimizing compiler applies Full Loop Unrolling in particular cases where the resulting native binary executable code has no more than 50 Dalvik bytecode units. This enables faster execution of small loop bodies and eliminates iteration overhead.

After removing all loop-dependent variables, it provides scope for additional optimizations on the loop body. Intel's implementation of Loop Unrolling in Marshmallow provides an 8 percent speed improvement on the CaffeineMark Float* Android benchmark.

Both of the loop transformations (peeling and unrolling) discussed above tend to increase the generated code size. Therefore, Intel's version of the ART Optimizing compiler checks for possible redundancy elimination before optimizing the body of the loop by peeling or unrolling. Additionally, Intel has enhanced the AOSP implementation of Global Value Numbering to use Alias Analysis to identify more redundant operations.

Alias Analysis

Intel has implemented Alias Analysis in Marshmallow's ART Optimizing Compiler to enhance loop transformation. Alias analysis is used to determine whether it is possible for two objects to point to the same memory location. For example, in the Java language, an int field cannot alias to an object field since they have different types. The Alias Analysis framework provides three kinds of answers: must alias, may alias, and no alias. In the may alias case, we must make the most conservative assumption. For the other two cases, it is possible to either eliminate a redundant load (in the must alias case) or move a load across a store (in the no alias case). The compiler also uses type information to disambiguate between reference object types: as long as both objects are from different class hierarchy trees, they must be no alias.

Conclusion

In the Android Marshmallow Optimizing compiler, Intel has implemented Loop Peeling, Loop Unrolling, and Alias Analysis, and made enhancements to Method Inlining and Global Value Numbering.


Figure 1: Intel's version of ART Marshmallow performance.

We discussed these optimizations and how these optimizations are done. As shown in Figure 1, these and several other optimizations result in roughly 35 percent improved performance on the Intel® Atom™ x5 Z8000 processor compared to the stock AOSP Marshmallow Optimizing compiler on many Android workloads including CaffeineMark, AnTuTu, SmartBench, and Icy Rocks. Intel has also reduced compilation time and improved GC throughput as measured by GCW for Android.

About the Authors

Rahul Kandu is a software engineer in the Intel Software and Solutions Group (SSG), Systems Technologies & Optimizations (STO), Client Software Optimization (CSO). He focuses on Android performance and finds optimization opportunities to help Intel's performance in the Android ecosystem.

Razvan Lupusoru is a software engineer in the Intel Software and Solutions Group (SSG), Systems Technologies & Optimizations (STO), Client Software Optimization (CSO). He focuses on compiler development for the Android Runtime (ART) and develops optimizations that improve Android application performance on platforms based on Intel architecture.

Acknowledgements (alphabetical)

Johnnie L Birch Jr., Jean Christophe Beyler, Dong-Yuan Chen, Chris Elford, Jean Philippe Halimi, Paul Hohensee, Aleksey Ignatenko, Abhay Kanhere, Serguei Katkov, Maxim Kazantsev, Mark Mendell, Dmitry Petrochenko, Desikan Saravanan, Nikolay Serdjuk, Anton Shamin, Kumar Shiv, Alexei Zavjalov