Contents

Altera SDK for OpenCL Best Practices Guide................................................................. 1-1

Introduction............................................................................................................................... 1-1
  FPGA Overview.................................................................................................................... 1-1
  Pipelines............................................................................................................................... 1-2
  Single Work-Item Kernel versus NDRange Kernel.............................................................. 1-4

Multi-Threaded Host Application.......................................................................................... 1-6

Good OpenCL Kernel Design Practices................................................................................ 1-7
  Transfer Data Via AOCL Channels or OpenCL Pipes......................................................... 1-8
  Unroll Loops....................................................................................................................... 1-12
  Optimize Floating-Point Operations.................................................................................. 1-14
  Allocate Aligned Memory................................................................................................... 1-17
  Align a Struct with or without Padding............................................................................ 1-17
  Maintain Similar Structures for Vector Type Elements.................................................... 1-19
  Avoid Pointer Aliasing....................................................................................................... 1-20
  Avoid Expensive Functions.............................................................................................. 1-20
  Avoid Work-Item ID-Dependent Backward Branching..................................................... 1-21

Review Your Kernel’s Area Report to Identify Inefficiencies in Resource Usage............... 1-22
  Area Report Message for Board Interface......................................................................... 1-23
  Area Report Message for Function Overhead.................................................................... 1-23
  Area Report Message for State.......................................................................................... 1-24
  Area report Message for Feedback.................................................................................... 1-25
  Area Report Messages for Global Memory and Global Memory Interconnect................. 1-26
  Area Report Message for Constant Memory..................................................................... 1-28
  Area Report Messages for Local Memory.......................................................................... 1-29
  Area Report Messages for Private Variable Storage......................................................... 1-36
  Area Report Message for Channels................................................................................. 1-43

Profile Your Kernel to Identify Performance Bottlenecks............................................... 1-45
  Altera SDK for OpenCL Profiler GUI................................................................................ 1-46
  Interpreting the Profiling Information............................................................................... 1-52
  AOCL Profiler Limitations................................................................................................. 1-55

Strategies for Improving Single Work-Item Kernel Performance........................................ 1-55
  Optimization Report.......................................................................................................... 1-56
  Addressing Single Work-Item Kernel Dependencies Based on Optimization Report
    Feedback.............................................................................................................................. 1-66
  Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays............. 1-78
  Good Design Practices for Single Work-Item Kernel....................................................... 1-80

Strategies for Improving NDRange Kernel Data Processing Efficiency............................ 1-82
  Specify a Maximum Work-Group Size or a Required Work-Group Size......................... 1-83
  Kernel Vectorization.......................................................................................................... 1-85
  Multiple Compute Units..................................................................................................... 1-87
  Combination of Compute Unit Replication and Kernel SIMD Vectorization.................... 1-89
  Resource-Driven Optimization.......................................................................................... 1-91
Review Kernel Properties and Loop Unroll Status in the Optimization Report

Strategies for Improving Memory Access Efficiency

General Guidelines on Optimizing Memory Accesses

Optimize Global Memory Accesses

Perform Kernel Computations Using Constant, Local or Private Memory

Improve Kernel Performance by Banking the Local Memory

Optimize Accesses to Local Memory by Controlling the Memory Replication Factor

Strategies for Optimizing FPGA Area Usage

Compilation Considerations

Board Variant Selection Considerations

Memory Access Considerations

Arithmetic Operation Considerations

Data Type Selection Considerations

Additional Information

Document Revision History
The Altera SDK for OpenCL Best Practices Guide provides guidance on leveraging the functionalities of the Altera® Software Development Kit (SDK) for OpenCL™ to optimize your OpenCL(1) applications for Altera FPGAs.

This document assumes that you are familiar with OpenCL concepts and application programming interfaces (APIs), as described in the OpenCL Specification version 1.0 by the Khronos Group. It also assumes that you have experience in creating OpenCL applications.

For more information on the OpenCL Specification version 1.0, refer to the OpenCL Reference Pages on the Khronos Group website. For detailed information on the OpenCL APIs and programming language, refer to the OpenCL Specification version 1.0.

Related Information

- OpenCL Reference Pages
- OpenCL Specification version 1.0

Introduction

To achieve the highest performance of your OpenCL application for FPGAs, familiarize yourself with details of the underlying hardware. In addition, understand the compiler optimizations that convert and map your OpenCL application to FPGAs.

FPGA Overview

FPGAs are integrated circuits that you can configure repeatedly to perform an infinite number of functions. With FPGAs, low-level operations like bit masking, shifting, and addition are all configurable. Also, you can assemble these operations in any order. To implement computation pipelines, FPGAs integrate combinations of lookup tables (LUTs), registers, on-chip memories, and arithmetic hardware (for example, digital signal processor (DSP) blocks) through a network of reconfigurable connections. As a result, FPGAs achieve a high level of programmability. LUTs are responsible for implementing various functions.

(1) The Altera SDK for OpenCL (AOCL) is based on a published Khronos Specification, and has passed the Khronos Conformance Testing Process. Current conformance status can be found at www.khronos.org/conformance.

(2) OpenCL and the OpenCL logo are trademarks of Apple Inc. and used by permission of the Khronos Group™.
logic functions. For example, reprogramming a LUT can change an operation from a bit-wise AND logic function to a bit-wise XOR logic function.

The key benefit of using FPGAs for algorithm acceleration is that they support wide, heterogeneous and unique pipeline implementations. This characteristic is in contrast to many different types of processing units such as symmetric multiprocessors, DSPs, and graphics processing units (GPUs). In these types of devices, parallelism is achieved by replicating the same generic computation hardware multiple times. In FPGAs, however, you can achieve parallelism by duplicating only the logic that your algorithm exercises.

A processor implements an instruction set that limits the amount of work it can perform each clock cycle. For example, most processors do not have a dedicated instruction that can execute the following C code:

\[ E = (((A + B) \oplus C) \& D) >> 2; \]

Without a dedicated instruction for this C code example, a CPU, DSP, or GPU must execute multiple instructions to perform the operation. In contrast, you may think of an FPGA as a hardware platform that can implement any instruction set that your software algorithm requires. You can configure an FPGA to perform a sequence of operations that implements the code example above in a single clock cycle. An FPGA implementation connects specialized addition hardware with a LUT that performs the bit-wise XOR and AND operations. The device then leverages its programmable connections to perform a right shift by two bits without consuming any hardware resources. The result of this operation then becomes a part of subsequent operations to form complex pipelines.

**Pipelines**

The designs of microprocessors, digital signal processors (DSPs), hardware accelerators, and other high performance implementations of digital hardware often contain pipeline architectures. In a pipelined architecture, input data passes through a sequence of stages. Each stage performs an operation that contributes to the final result, such as memory operation, instruction decoding, or calculation.

For example, the diagram below represents the following example code fragment as a multistage pipeline:

```c
for(i = 0; i < 1024; i++)
{
    y[i] = (a[i] + b[i] + c[i] + d[i] + e[i] + f[i] + g[i] + h[i]) >> 3;
}
```

![Figure 1-1: Example Multistage Pipeline Diagram](image)

With a pipeline architecture, each arithmetic operation passes into the pipeline one at a time. Therefore, as shown in the diagram above, a saturated pipeline consists of eight stages that calculate the arithmetic operations simultaneously and in parallel. In addition, because of the large number of loop iterations, the
pipeline stages continue to perform these arithmetic instructions concurrently for each subsequent loop iteration.

**Altera SDK for OpenCL Pipeline Approach**

The AOCL pipeline does not have a set of predefined pipeline stages or instruction set. As a result, it can accommodate for the highly configurable nature of FPGAs.

Consider the following OpenCL code fragment:

```c
size_t index = get_global_id(0);
C[index] = (A[index] >> 5) + B[index];
F[index] = (D[index] - E[index]) << 3;
G[index] = C[index] + F[index];
```

You can configure an FPGA to instantiate a complex pipeline structure that executes the entire code in one iteration. In this case, the AOCL implements the code as two independent pipelined entities that feed into a pipelined adder, as shown in the figure below.

**Figure 1-2: Example of the AOCL Pipeline Approach**

The Altera Offline Compiler (AOC) provides a custom pipeline structure that speeds up computation by allowing operations within a large number of work-items to occur concurrently. The AOC can create a custom pipeline that calculates the values for variables $C$, $F$ and $G$ every clock cycle, as shown below. After a ramp-up phase, the pipeline sustains a throughput of one work-item per cycle.
Figure 1-3: An FPGA Pipeline with Five Instructions Per Clock Cycle

<table>
<thead>
<tr>
<th></th>
<th>C</th>
<th>F</th>
<th>G</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>(A[0] \gg 5 + B[0])</td>
<td>((D[0] - E[0]) \ll 3)</td>
<td>(C[0] + F[0])</td>
</tr>
</tbody>
</table>

Time in Clock Cycles

A traditional processor has a limited set of shared registers. Eventually, a processor must write the stored data out to memory to allow more data to occupy the registers. The AOC keeps data "live" by generating enough registers to store the data for all the active work-items within the pipeline. The following code example and figure illustrate a live variable \(C\) in the OpenCL pipeline:

```c
size_t index = get_global_id(0);
C = A[index] + B[index];
E[index] = C - D[index];
```

Figure 1-4: An FPGA Pipeline with a Live Variable C

Single Work-Item Kernel versus NDRange Kernel

In the early stages of design planning, consider whether constructing your OpenCL kernel as a single work-item might improve performance. Altera recommends that you structure your OpenCL kernel as a
single work-item, if possible. However, if your kernel program does not have loop and memory dependencies, you may structure your application as an NDRange kernel because the kernel can execute multiple work-items in parallel efficiently.

Create single work-item kernels for your design if it satisfies the following criteria:

- You organize your OpenCL application in multiple kernels, use channels to transfer data among the kernels, and data processing sequence is critical to your application.

  **Attention:** In this scenario, if you create NDRange kernels, ensure that you understand the way the Altera SDK for OpenCL defines the order in which multiple work-items access a channel. For more information, refer to the *Multiple Work-Item Ordering* section of the *Altera SDK for OpenCL Programming Guide*.

- You cannot break down an algorithm into separate work-items easily because of data dependencies that arise when multiple work-items are in flight.

**Related Information**

*Multiple Work-Item Ordering for Channels*

### Single Work-Item Execution

The Altera SDK for OpenCL host can execute a kernel as a single work-item, which is equivalent to launching a kernel with an NDRange size of (1, 1, 1). The OpenCL Specification version 1.0 describes this mode of operation as *task parallel programming*. A *task* refers to a kernel executed on a compute unit consisting of one work-group that contains one work-item.

Generally, the host executes multiple work-items in parallel to compute OpenCL kernel instructions. However, the *data parallel programming model* is not suitable for situations where the compiler must share fine-grained data among parallel work-items. In these cases, you can maximize throughput by expressing your kernel as a single work-item. Unlike NDRange kernels, single work-item kernels follow a natural sequential model similar to C programming. Particularly, you do not have to partition the data across work-items.

To ensure high-throughput single work-item-based kernel execution on the FPGA, the Altera Offline Compiler must process multiple pipeline stages in parallel at any given time. The mode of operation is particularly challenging in loops because by default, the AOC executes loop iterations sequentially through the pipeline.

Consider the following code example:

```c
float ai[32];
for (int i=0; i < stream_size; i++)
{
    float fir = 0;
    ai[0] = input[i];

    #pragma unroll 31
    for (int k=31; k > 0; k--)
    {
        fir += ai[k] * coeff[k];
        ai[k] = ai[k-1];
    }

    fir += ai[0] * coeff[0];
    if (i >= 31)
    {
        output[i-31] = fir;
    }
```


During each loop iteration, data values shift into the array $a_i$, and then a reduction occurs for each element of the array. An NDRange kernel requires an intricate mechanism involving local buffers and barrier constructs to pass the shifted $a_i$ values to each work-item. Implementing this intricate mechanism leads to suboptimal kernel performance. However, rewriting the kernel as a single work-item kernel allows the AOC to extract the parallelism between each loop iteration. Pipeline parallel execution saturates the kernel pipeline with multiple loop iterations, allowing the AOC to process the loop in a high-throughput fashion. This loop execution model is known as loop pipelining.

To extract the parallelism between loop iterations, optimize your kernel for loop pipelining manually. In the example above, optimize the code to enable the AOC to leverage hardware shift registers on the FPGA to generate a pipeline that shifts $a_i[k]$ into $a_i[k-1]$. This optimization allows the loop to execute in a true pipeline fashion.

If your kernel code does not access any global or local ID by calling `get_global_id()` or `get_local_id()`, respectively, the AOC assumes your kernel is a single work-item kernel and attempts to implement loop pipelining automatically.

### Limitations

The OpenCL task parallel programming model does not support the notion of a barrier in single-work-item execution. Replace barriers (`barrier`) with memory fences (`mem_fence`) in your kernel. Barriers in a pipelined loop of an OpenCL single work-item kernel cause the AOC to error out.

### Multi-Threaded Host Application

When there are parallel, independent data paths and the host processes the data between kernel executions, consider using a multi-threaded host application. The figure below illustrates how a single-threaded host application processes parallel, independent data paths between kernel executions:

**Figure 1-5: Kernel Execution by a Single-Threaded Host Application**

With a single-threaded host application, you need to build an external synchronization mechanism around the OpenCL host function calls. Using a multi-threaded host application in a thread-safe runtime environment allows you to simplify the host code. In addition, processing multiple sets of data in the host simultaneously can speed up kernel execution.
The figure below illustrates how a multi-threaded host application processes parallel, independent data paths between kernel executions:

**Figure 1-6: Kernel Execution by a Multi-Threaded Host Application in a Thread-Safe Runtime Environment**

![Diagram of kernel execution](image)

**Related Information**

**Multiple Host Threads**

**Good OpenCL Kernel Design Practices**

With the Altera Offline Compiler technology, you do not need to change your kernel to fit it optimally into a fixed hardware architecture. Instead, the AOC customizes the hardware architecture automatically to accommodate your kernel requirements.

In general, you should optimize a kernel that targets a single compute unit first. After you optimize this compute unit, increase the performance by scaling the hardware to fill the remainder of the FPGA. The hardware footprint of the kernel correlates with the time it takes for hardware compilation. Therefore, the more optimizations you can perform with a smaller footprint (that is, a single compute unit), the more hardware compilations you can perform in a given amount of time.

In addition to data processing and memory access optimizations, consider implementing the following design practices, if applicable, when you create your kernels.

**Transfer Data Via AOCL Channels or OpenCL Pipes** on page 1-8
To increase data transfer efficiency between kernels, implement the Altera SDK for OpenCL channels extension in your kernel programs. If you want to leverage the capabilities of channels but have the ability to run your kernel program using other SDKs, implement OpenCL pipes.

**Unroll Loops** on page 1-12
If your OpenCL kernel contains loop iterations, increase performance by unrolling the loop.

**Optimize Floating-Point Operations** on page 1-14
For floating-point operations, you can manually direct the Altera Offline Compiler to perform optimizations that create more efficient pipeline structures in hardware and reduce the overall hardware usage.

**Allocate Aligned Memory** on page 1-17
Allocate host-side buffers to be at least 64-byte aligned.
**Align a Struct with or without Padding** on page 1-17
A properly aligned struct helps the Altera Offline Compiler generate the most efficient hardware.

**Maintain Similar Structures for Vector Type Elements** on page 1-19
If you update one element of a vector type, update all the elements of the vector.

**Avoid Pointer Aliasing** on page 1-20
Insert the `restrict` keyword in pointer arguments whenever possible.

**Avoid Expensive Functions** on page 1-20
Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.

**Avoid Work-Item ID-Dependent Backward Branching** on page 1-21
Avoid including any work-item ID-dependent backward branching (that is, branching that occurs in a loop) in your kernel because it degrades performance.

**Transfer Data Via AOCL Channels or OpenCL Pipes**
To increase data transfer efficiency between kernels, implement the Altera SDK for OpenCL channels extension in your kernel programs. If you want to leverage the capabilities of channels but have the ability to run your kernel program using other SDKs, implement OpenCL pipes.

**Attention:** The implementation of OpenCL pipes is a beta feature for the current version of the AOCL.

Sometimes, FPGA-to-global memory bandwidth constrains the data transfer efficiency between kernels. The theoretical maximum FPGA-to-global memory bandwidth varies depending on the number of global memory banks available in the targeted Custom Platform and board. To determine the theoretical maximum bandwidth for your board, refer to your board vendor's documentation.

In practice, a kernel does not achieve 100% utilization of the maximum global memory bandwidth available. The level of utilization depends on the access pattern of the algorithm.

If global memory bandwidth is a performance constraint for your OpenCL kernel, first try to break down the algorithm into multiple smaller kernels. Secondly, as shown in the figure below, eliminate some of the global memory accesses by implementing the AOCL channels or OpenCL pipes for data transfer between kernels.
Figure 1-7: Difference in Global Memory Access Pattern as a Result of Channels or Pipes Implementation

For more information on the usage of channels, refer to the Implementing AOCL Channels Extension section of the Altera SDK for OpenCL Programming Guide.

For more information on the usage of pipes, refer to the Implementing OpenCL Pipes section of the Altera SDK for OpenCL Programming Guide.

Related Information
- Implementing AOCL Channels Extension
- Implementing OpenCL Pipes

Channels and Pipes Characteristics
To implement channels or pipes in your OpenCL kernel program, keep in mind their respective characteristics that are specific to the Altera SDK for OpenCL.

Default Behavior
The default behavior of channels is blocking. The default behavior of pipes is nonblocking.

Concurrent Execution of Multiple OpenCL Kernels
You can execute multiple OpenCL kernels concurrently. To enable concurrent execution, modify the host code to instantiate multiple command queues. Each concurrently executing kernel is associated with a separate command queue.

Important: Pipe-specific considerations:
The OpenCL pipe modifications outlined in Ensuring Compatibility with Other OpenCL SDKs in the Altera SDK for OpenCL Programming Guide allow you to run your kernel on the AOCL. However, they do not maximize the kernel throughput. The OpenCL Specification version 2.0 requires that pipe writes occur before pipe reads so that the kernel is not reading...
from an empty pipe. As a result, the kernels cannot execute concurrently. Because the AOCL supports concurrent execution, you can modify your host application and kernel program to take advantage of this capability. The modifications increase the throughput of your application; however, you can no longer port your kernel to another SDK. Despite this limitation, the modifications are minimal, and it does not require much effort to maintain both types of code.

To enable concurrent execution of kernels containing pipes, replace the depth attribute in your kernel code with the blocking attribute (that is, __attribute__((blocking))). The blocking attribute introduces a blocking behavior in the read_pipe and write_pipe function calls. The call site blocks kernel execution until the other end of the pipe is ready.

If you add both the blocking attribute and the depth attribute to your kernel, the read_pipe calls will only block when the pipe is full, and the write_pipe calls will only block when the pipe is empty. Blocking behavior causes an implicit synchronization between the kernels, which forces the kernels to run in lock step with each other.

### Implicit Kernel Synchronization

Synchronize the kernels implicitly via blocking channel calls or blocking pipe calls. Consider the following examples:

**Table 1-1: Blocking Channel and Pipe Calls for Kernel Synchronization**

<table>
<thead>
<tr>
<th>Kernel with Blocking Channel Call</th>
<th>Kernel with Blocking Pipe Call</th>
</tr>
</thead>
<tbody>
<tr>
<td>#pragma OPENCL EXTENSION</td>
<td>__kernel void producer (__global int * in_buf, write_only pipe int <strong>attribute</strong>((blocking)) c0) {</td>
</tr>
<tr>
<td>cl_altera_channels : enable</td>
<td>for (int i=0; i&lt;10; i++)</td>
</tr>
<tr>
<td>channel int c0;</td>
<td>{ write_pipe(c0, &amp;in_buf[i]);</td>
</tr>
<tr>
<td>__kernel void producer (__global int * in_buf)</td>
<td>}</td>
</tr>
<tr>
<td>{</td>
<td>__kernel void consumer (__global int * ret_buf, read_only pipe int <strong>attribute</strong>((blocking)) c0) {</td>
</tr>
<tr>
<td>for (int i=0; i&lt;10; i++)</td>
<td>for (int i=0; i&lt;10; i++)</td>
</tr>
<tr>
<td>{</td>
<td>{ int x;</td>
</tr>
<tr>
<td>write_channel_altera(c0, in_buf[i]);</td>
<td>read_pipe(c0, &amp;x);</td>
</tr>
<tr>
<td>}</td>
<td>ret_buf[i]=x;</td>
</tr>
<tr>
<td>}</td>
<td>}</td>
</tr>
</tbody>
</table>

You can synchronize the kernels such that a producer kernel writes data and a consumer kernel reads the data during each loop iteration. If the write_channel_altera or write_pipe call in producer does not write any data, consumer blocks and waits at the read_channel_altera or read_pipe call until producer sends valid data, and vice versa.

### Data Persistence Across Invocations

After the write_channel_altera call writes data to a channel or the write_pipe call writes data to a pipe, the data is persistent across work-groups and NDRange invocations. Data that a work-item writes to
a channel or a pipe remains in that channel or pipe until another work-item reads from it. In addition, the 
order of data in a channel or a pipe is equivalent to the sequence of write operations to that channel or 
pipe, and the order is independent of the work-item that performs the write operation.

For example, if multiple work-items try to access a channel or a pipe simultaneously, only one work-item 
can access it. The \texttt{write\_channel\_altera} call or \texttt{write\_pipe} call writes the particular work-item data, 
called \texttt{DATAX}, to the channel or pipe, respectively. Similarly, the first work-item to access the channel or 
pipe reads \texttt{DATAX} from it. This sequential order of read and write operations makes channels and pipes 
an effective way to share data between kernels.

**Imposed Work-Item Order**

The AOCL imposes a work-item order to maintain the consistency of the read and write operations for a 
channel or a pipe.

**Related Information**

- **Ensuring Compatibility with Other OpenCL SDKs**

**Execution Order for Channels and Pipes**

Each channel or pipe call in a kernel program translates into an instruction executed in the FPGA 
pipeline. The execution of a channel call or a pipe call occurs if a valid work-item executes through the 
pipeline. However, even if there is no control or data dependence between channel or pipe calls, their 
execution might not achieve perfect instruction-level parallelism in the kernel pipeline.

Consider the following code examples:

**Table 1-2: Kernel with Two Read Channel or Pipe Calls**

<table>
<thead>
<tr>
<th>Kernel with Two Read Channel Calls</th>
<th>Kernel with Two Read Pipe Calls</th>
</tr>
</thead>
<tbody>
<tr>
<td>_ker_nel _void _consumer _ ( _global _uint_restrict _dst) {</td>
<td>_for_ (int _i=0; _i&lt;5; _i++) {</td>
</tr>
</tbody>
</table>

The code example on the left makes two read channel calls. The code example on the right makes two 
read pipe calls. In most cases, the kernel executes these channel or pipe calls in parallel; however, channel 
and pipe call executions might occur out of sequence. Out-of-sequence execution means that the read 
operation from \texttt{c1} can occur and complete before the read operation from \texttt{c0}.

**Optimize Buffer Inference for Channels or Pipes**

In addition to the manual addition of buffered channels or pipes, the Altera Offline Compiler improves 
kernel throughput by adjusting buffer sizes whenever possible.
During compilation, the AOC computes scheduling mismatches between interacting channels or pipes. These mismatches might cause imbalances between read and write operations. The AOC performs buffer inference optimization automatically to correct the imbalance.

Consider the following examples:

### Table 1-3: Buffer Inference Optimization for Channels and Pipes

<table>
<thead>
<tr>
<th>Kernel with Channels</th>
<th>Kernel with Pipes</th>
</tr>
</thead>
</table>
| `__kernel void producer (  
  __global const uint * restrict src,  
  const uint iterations)  
{  
  for(int i=0; i < iteration; i++)  
  {  
    write_channel_altera(c0,src[2*i]);  
    write_channel_altera(c1,src[2*i+1]);  
  }  
)` | `__kernel void producer (  
  __global const uint * restrict src,  
  const uint iterations,  
  write_only pipe uint  
  __attribute__((blocking)) c0,  
  write_only pipe uint  
  __attribute__((blocking)) c1  
{  
  for(int i=0; i < iteration; i++)  
  {  
    write_pipe(c0,&src[2*i]);  
    write_pipe(c1,&src[2*i+1]);  
  }  
)` |
| `__kernel void consumer (  
  __global uint * restrict dst,  
  const uint iterations)  
{  
  for(int i=0; i < iterations; i++)  
  {  
    dst[2*i]=read_channel_altera(c0);  
    dst[2*i+1]=read_channel_altera(c1);  
  }  
)` | `__kernel void consumer (  
  __global uint * restrict dst,  
  const uint iterations,  
  read_only pipe uint  
  __attribute__((blocking)) c0,  
  read_only pipe uint  
  __attribute__((blocking)) c1  
{  
  for(int i=0; i < iterations; i++)  
  {  
    read_pipe(c0,&dst[2*i]);  
    read_pipe(c1,&dst[2*i+1]);  
  }  
)` |

The AOC performs buffer inference optimization if channels or pipes between kernels cannot form a cycle. A cycle between kernels is a path that originates from a kernel, through a write channel or a write pipe call, and returns to the original kernel. For the example, assume that the write channel or write pipe calls in the kernel `producer` are scheduled 10 cycles apart and the read channel or read pipe calls are scheduled 15 cycles apart. There exists a temporary mismatch in the read and write operations to `c1` because five extra write operations might occur before a read operation to `c1` occurs. To correct this imbalance, the AOC assigns a buffer size of five cycles to `c1` to avoid stalls. The extra buffer capacity decouples the `c1` write operations in the `producer` kernel and the `c1` read operations in the `consumer` kernel.

### Unroll Loops

You can control the way the Altera Offline Compiler translates OpenCL kernel descriptions to hardware resources. If your OpenCL kernel contains loop iterations, increase performance by unrolling the loop. Loop unrolling decreases the number of iterations that the AOC executes at the expense of increased hardware resource consumption.
Consider the OpenCL code for a parallel application in which each work-item is responsible for computing the accumulation of four elements in an array:

```opencl
__kernel void example ( __global const int * restrict x,
    __global int * restrict sum )
{
    int accum = 0;
    for (size_t i=0; i < 4; i++)
    {
        accum += x[i + get_global_id(0) * 4];
    }
    sum[get_global_id(0)] = accum;
}
```

Notice the three main operations that occur in this kernel:

- Load operations from input `x`
- Accumulation
- Store operations to output `sum`

The AOC arranges these operations in a pipeline according to the data flow semantics of the OpenCL kernel code. For example, the AOC implements loops by forwarding the results from the end of the pipeline to the top of the pipeline, depending on the loop exit condition.

The OpenCL kernel performs one loop iteration of each work-item per clock cycle. With sufficient hardware resources, you can increase kernel performance by unrolling the loop, which decreases the number of iterations that the kernel executes. To unroll a loop, add a `#pragma unroll` directive to the main loop, as shown in the code example below. Keep in mind loop unrolling significantly changes the structure of the compute unit that the AOC creates.

```opencl
__kernel void example ( __global const int * restrict x,
    __global int * restrict sum )
{
    int accum = 0;
    #pragma unroll
    for (size_t i=0; i < 4; i++)
    {
        accum += x[i + get_global_id(0) * 4];
    }
    sum[get_global_id(0)] = accum;
}
```

In this example, the `#pragma unroll` directive causes the AOC to unroll the four iterations of the loop completely. To accomplish the unrolling, the AOC expands the pipeline by tripling the number of addition operations and loading four times more data. With the removal of the loop, the compute unit assumes a feed-forward structure. As a result, the compute unit can store the `sum` elements every clock cycle after the completion of the initial load operations and additions. The AOC further optimizes this kernel by coalescing the four load operations so that the compute unit can load all the necessary input data to calculate a result in one load operation.

**Caution:** Avoid nested looping structures. Instead, implement a large single loop or unroll inner loops by adding the `#pragma unroll` directive whenever possible.
Unrolling the loop and coalescing the load operations from global memory allow the hardware implementation of the kernel to perform more operations per clock cycle. In general, the methods you use to improve the performance of your OpenCL kernels should achieve the following results:

- Increase the number of parallel operations
- Increase the memory bandwidth of the implementation
- Increase the number of operations per clock cycle that the kernels can perform in hardware

The AOC might not be able to unroll a loop completely under the following circumstances:

- You specify complete unrolling of a data-dependent loop with a very large number of iterations. Consequently, the hardware implementation of your kernel might not fit into the FPGA.
- You specify complete unrolling and the loop bounds are not constants.
- The loop consists of complex control flows (for example, a loop containing complex array indexes or exit conditions that are unknown at compilation time).

For the last two cases listed above, the AOC issues the following warning:

Full unrolling of the loop is requested but the loop bounds cannot be determined. The loop is not unrolled.

To enable loop unrolling in these situations, specify the #pragma unroll <N> directive, where <N> is the unroll factor. The unroll factor limits the number of iterations that the AOC unrolls. For example, to prevent a loop in your kernel from unrolling, add the directive #pragma unroll 1 to that loop.

Refer to Good Design Practices for Single Work-Item Kernel for tips on constructing well-structured loops.

Optimize Floating-Point Operations

For floating-point operations, you can manually direct the Altera Offline Compiler to perform optimizations that create more efficient pipeline structures in hardware and reduce the overall hardware usage. These optimizations can cause small differences in floating-point results.

Tree Balancing

Order of operation rules apply in the OpenCL language. In the following example, the AOC performs multiplications and additions in a strict order, beginning with operations within the innermost parentheses:

\[
\text{result} = (((A \times B) + C) + (D \times E)) + (F \times G);
\]
By default, the AOC creates an implementation that resembles a long vine for such computations:

**Figure 1-8: Default Floating-Point Implementation**

![Diagram of Default Floating-Point Implementation]

Long, unbalanced operations lead to more expensive hardware. A more efficient hardware implementation is a *balanced tree*, as shown below:

**Figure 1-9: Balanced Tree Floating-Point Implementation**

![Diagram of Balanced Tree Floating-Point Implementation]

In a balanced tree implementation, the AOC converts the long vine of floating-point adders into a tree pipeline structure. The AOC does not perform tree balancing of floating-point operations automatically because the outcomes of the floating-point operations might differ. As a result, this optimization is inconsistent with the IEEE Standard 754-2008.
If you want the AOC to optimize floating-point operations using balanced trees and your program can tolerate small differences in floating-point results, include the `--fp-relaxed` option in the `aoc` command, as shown below:

```
aoc --fp-relaxed <your_kernel_filename>.cl
```

### Rounding Operations

The balanced tree implementation of a floating-point operation includes multiple rounding operations. These rounding operations can require a significant amount of hardware resources in some applications. The AOC does not reduce the number of rounding operations automatically because doing so violates the results required by IEEE Standard 754-2008.

You can reduce the amount of hardware necessary to implement floating-point operations with the `--fpc` option of the `aoc` command. If your program can tolerate small differences in floating-point results, invoke the following command:

```
aoc --fpc <your_kernel_filename>.cl
```

The `--fpc` option directs the AOC to perform the following tasks:

- Remove floating-point rounding operations and conversions whenever possible.
  
  If possible, the `--fpc` argument directs the AOC to round a floating-point operation only once—at the end of the tree of the floating-point operations.

- Carry additional mantissa bits to maintain precision.
  
  The AOC carries additional precision bits through the floating-point calculations, and removes these precision bits at the end of the tree of floating-point operations.

This type of optimization results in hardware that performs a fused floating-point operation, and it is a feature of many new hardware processing systems. Fusing multiple floating-point operations minimizes the number of rounding steps, which leads to more accurate results. An example of this optimization is a fused multiply-accumulate (FMAC) instruction available in new processor architectures. The AOC can provide fused floating-point mathematical capabilities for many combinations of floating-point operators in your kernel.

### Floating-Point versus Fixed-Point Representations

An FPGA contains a substantial amount of logic for implementing floating-point operations. However, you can increase the amount of hardware resources available by using a fixed-point representation of the data whenever possible. The hardware necessary to implement a fixed-point operation is typically smaller than the equivalent floating-point operation. As a result, you can fit more fixed-point operations into an FPGA than the floating-point equivalent.

The OpenCL standard does not support fixed-point representation; you must implement fixed-point representations using integer data types. Hardware developers commonly achieve hardware savings by using fixed-point data representations and only retain a data resolution required for performing calculations. You must use an 8, 16, 32, or 64-bit scalar data type because the OpenCL standard supports only these data resolutions. However, you can incorporate the appropriate masking operations in your source code so that the hardware compilation tools can perform optimizations to conserve hardware resources.

For example, if an algorithm uses a fixed-point representation of 17-bit data, you must use a 32-bit data type to store the value. If you then direct the Altera Offline Compiler to add two 17-bit fixed-point values together, the AOC must create extra hardware to handle the addition of the excess upper 15 bits. To avoid having this additional hardware, you can use static bit masks to direct the hardware compilation tools to...
disregard the unnecessary bits during hardware compilation. The code below implements this masking operation:

```c
__kernel fixed_point_add (__global const unsigned int * restrict a,
                           __global const unsigned int * restrict b,
                           __global unsigned int * restrict result)
{
    size_t gid = get_global_id(0);
    unsigned int temp;
    temp = 0x3_FFFF & ((0x1_FFFF & a[gid]) + ((0x1_FFFF & b[gid]) + (0x1_FFFF & b[gid])));
    result[gid] = temp & 0x3_FFFF;
}
```

In this code example, the upper 15 bits of inputs `a` and `b` are masked away and added together. Because the result of adding two 17-bit values cannot exceed an 18-bit resolution, the AOC applies an additional mask to mask away the upper 14 bits of the result. The final hardware implementation is a 17-bit addition as opposed to a full 32-bit addition. The logic savings in this example are relatively minor compared to the sheer number of hardware resources available in the FPGA. However, these small savings, if applied often, can accumulate into a larger hardware saving across the entire FPGA.

**Allocate Aligned Memory**

Allocate host-side buffers to be at least 64-byte aligned. Allocating the host-side buffers allows direct memory access (DMA) transfers to occur to and from the FPGA, which improves buffer transfer efficiency.

To set up aligned memory allocations, add the following source code to your host program:

- For Windows:
  ```c
  #define AOCL_ALIGNMENT 64
  #include <malloc.h>
  void *ptr = _aligned_malloc (size, AOCL_ALIGNMENT);
  ```
  To free up an aligned memory block, include the function call `_aligned_free(ptr);`

- For Linux:
  ```c
  #define AOCL_ALIGNMENT 64
  #include <stdlib.h>
  void *ptr = NULL;
  posix_memalign (&ptr, AOCL_ALIGNMENT, size);
  ```
  To free up an aligned memory block, include the function call `free(ptr);`

**Align a Struct with or without Padding**

A properly aligned struct helps the Altera Offline Compiler generate the most efficient hardware. A proper struct alignment means that the alignment can be evenly divided by the struct size. Hardware efficiency increases with increasing alignment.
The AOC conforms with the ISO C standard which requires the alignment of a struct to satisfy all of the following criteria:

1. The alignment must be an integer multiple of the lowest common multiple between the alignments of all struct members.
2. The alignment must be a power of two.

You may set the struct alignment by including the `aligned(N)` attribute in your kernel code. Without an aligned attribute, the AOC determines the alignment of each struct in an array of struct based on the size of the struct. Consider the following example:

```c
__kernel void test (struct mystruct* A,
                   struct mystruct* B)
{
    A[get_global_id(0)] = B[get_global_id(0)];
}
```

If the size of `mystruct` is 101 bytes, each load or store access will be 1-byte aligned. If the size of `mystruct` is 128 bytes, each load or store access will be 128-byte aligned, which generates the most efficient hardware.

When the struct fields are not aligned within the struct, the AOC inserts padding to align them. Inserting padding between struct fields affects hardware efficiency in the following manner:

1. Increases the size of the struct
2. Might affect the alignment

To prevent the AOC from inserting padding, include the `packed` attribute in your kernel code. The aforementioned ISO C standard applies when determining the alignment of a packed or unpacked struct. Consider the following example:

```c
struct mystruct1
{
    char a;
    int b;
};
```

The size of `mystruct1` is 8 bytes. Therefore, the struct is 8-byte aligned, resulting in efficient accesses in the kernel. Now consider another example:

```c
struct mystruct2
{
    char a;
    int b;
    int c;
};
```

The size of `mystruct2` is 12 bytes and the struct is 4-byte aligned. Because the struct fields are padded and the struct is unaligned, accesses in the kernel are inefficient.

Below is an example of a struct that includes the `packed` attribute:

```c
__attribute__((packed))
struct mystruct3
{
    char a;
    int b;
};
```
The size of `mystruct3` is 9 bytes; therefore, the struct is 1-byte aligned. Because there is no padding between struct fields, accesses in this kernel are more efficient than accesses in `mystruct2`. However, `mystruct3` is unaligned.

Below is an example of a struct that is aligned and is not padded:

```c
__attribute__((packed))
struct mystruct4
{
    char a;
    int b;
    int c;
    char d[7];
};
```

The size of `mystruct4` is 16 bytes. Because `mystruct4` is aligned and there is no padding between struct fields, accesses in this kernel are more efficient than accesses in `mystruct3`.

To include both the `aligned(N)` and `packed` attributes in a struct, consider the following example:

```c
__attribute__((packed))
__attribute__((aligned(16)))
struct mystruct5
{
    char a;
    int b;
    int c;
};
```

The size of `mystruct5` is 9 bytes. Because of the `aligned(16)` attribute, the struct is stored at 16-byte aligned addresses in an array. Because `mystruct5` is 16-byte aligned and has no padding, accesses in this kernel will be efficient.

For more information on struct alignment and the `aligned(N)` and `packed` attributes, refer to the following documents:

- Section 6.11.1 of the OpenCL Specification version 1.2
- Disabling Insertion of Data Structure Padding section of the Altera SDK for OpenCL Programming Guide
- Specifying the Alignment of a Struct section of the Altera SDK for OpenCL Programming Guide

**Related Information**

- OpenCL Specification version 1.2
- Disabling Insertion of Data Structure Padding
- Specifying the Alignment of a Struct

**Maintain Similar Structures for Vector Type Elements**

If you update one element of a vector type, update all the elements of the vector.

The code example below illustrates a scenario where you should update a vector element:

```c
__kernel void update (__global const float4 * restrict in,
                    __global const float4 * restrict out)
```
Avoid Pointer Aliasing

Insert the restrict keyword in pointer arguments whenever possible. Including the restrict keyword in pointer arguments prevents the Altera Offline Compiler from creating unnecessary memory dependencies between non-conflicting load and store operations.

The restrict keyword informs the AOC that the pointer does not alias other pointers. For example, if your kernel has two pointers to global memory, A and B, that never overlap each other, declare the kernel in the following manner:

```c
__kernel void myKernel (__global int * restrict A,
                        __global int * restrict B)
```

**Warning:** Inserting the restrict keyword on a pointer that aliases other pointers might result in incorrect results.

Avoid Expensive Functions

Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.

The following functions are expensive:

- Integer division and modulo (remainder) operators
- Most floating-point operators except addition, multiplication, absolute value, and comparison

  **Note:** For more information on optimizing floating-point operations, refer to the Optimize Floating-Point Operations section.

- Atomic functions

In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.

The following functions are inexpensive:

- Binary logic operations such as AND, NAND, OR, NOR, XOR, and XNOR
- Logical operations with one constant argument
- Shift by constant
- Integer multiplication and division by a constant that is a power of two

If an expensive function produces a new piece of data for every work-item in a work-group, it is beneficial to code it in a kernel. On the contrary, the code example below shows a case of an expensive floating-point operation (division) executed by every work-item in the NDRange:

```c
__kernel void myKernel (__global const float * restrict a,
                        __global float * restrict b,
                        const float c, const float d)
```
size_t gid = get_global_id(0);

//inefficient since each work-item must calculate c divided by d
b[gid] = a[gid] * (c / d);
}

The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work-items in the NDRange to use. The modified code is shown below:

```c
__kernel void myKernel (__global const float * restrict a,
                        __global float * restrict b,
                        const float c_divided_by_d)
{
    size_t gid = get_global_id(0);
    /*host calculates c divided by d once and passes it into
c kernel to avoid redundant expensive calculations*/
    b[gid] = a[gid] * c_divided_by_d;
}
```

The Altera Offline Compiler consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work-items. In the first code example, the AOC creates a single divider block shared by all work-items because division of $c$ by $d$ remains constant across all work-items. This optimization helps minimize the amount of redundant hardware. However, the implementation of an integer division requires a significant amount of hardware resources. Therefore, it is beneficial to off-load the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.

**Related Information**

*Optimize Floating-Point Operations* on page 1-14

---

**Avoid Work-Item ID-Dependent Backward Branching**

The Altera Offline Compiler collapses conditional statements into single bits that indicate when a particular functional unit becomes active. The AOC completely eliminates simple control flows that do not involve looping structures, resulting in a flat control structure and more efficient hardware usage. The AOC compiles kernels that include forward branches, such as conditional statements, efficiently.

Avoid including any work-item ID-dependent backward branching (that is, branching that occurs in a loop) in your kernel because it degrades performance.

For example, the code fragment below illustrates branching that involves work-item ID such as `get_global_id` or `get_local_id`:

```c
for (size_t i = 0; i < get_global_id(0); i++)
{
    // statements
}
```
Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage

After you compile your OpenCL application, review the area report that the Altera Offline Compiler generates. In addition to summarizing the applications resource usage, the area report offers suggestions on how to modify your design to improve efficiency.

If you do not compile your kernel with the -g AOC command option, the resulting area report will not include source references. The table heading will include the following message:

Recompile with -g for detailed area breakdown by source line.

**Note:** Altera recommends that you always compile your kernels with the -g AOC command option. Compiling your kernel with the -g option has no effect on kernel performance and the final FPGA image.

Figure 1-10: Example Area Report for an OpenCL Application Compiled without the -g AOC Command Option

---

**Area Report Message for Board Interface** on page 1-23
The area report identifies the amount of logic that the Altera Offline Compiler generates for the Custom Platform, or board interface.

**Area Report Message for Function Overhead** on page 1-23
The area report identifies the amount of logic that the Altera Offline Compiler generates for tasks such as dispatching kernels.

**Area Report Message for State** on page 1-24
The area report identifies the amount of resources that your design uses for live values and control logic.

**Area report Message for Feedback** on page 1-25
The area report specifies the resources that your design uses for loop-carried dependencies.

**Area Report Messages for Global Memory and Global Memory Interconnect** on page 1-26
The area report specifies the size of the global memory interconnect. It also provides information on memory caching, if implemented.

**Area Report Message for Constant Memory** on page 1-28
The area report specifies the size of the constant cache memory. It also provides information such as data replication and the number of read operations.
Area Report Messages for Local Memory on page 1-29
The area report provides information on local memory to indicate whether the memory system is stallable, banked, replicated, or merged.

Area Report Messages for Private Variable Storage on page 1-36
The area report provides information on the implementation of private memory based on your OpenCL design.

Area Report Message for Channels on page 1-43
The area report specifies the size of the channels that Altera Offline Compiler implements based on your OpenCL design.

Area Report Message for Board Interface
The area report identifies the amount of logic that the Altera Offline Compiler generates for the Custom Platform, or board interface.

Table 1-4: Additional Information on Area Report Message

<table>
<thead>
<tr>
<th>Message</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Platform interface logic.</td>
<td></td>
</tr>
</tbody>
</table>

Figure 1-11: Example Area Report that Lists the Board Interface

Area Report Message for Function Overhead
The area report identifies the amount of logic that the Altera Offline Compiler generates for tasks such as dispatching kernels.

Table 1-5: Additional Information on Area Report Message

<table>
<thead>
<tr>
<th>Message</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Kernel dispatch logic.</td>
<td>A kernel that includes the <code>max_global_work_dim(0)</code> kernel attribute contains no overhead. As a result, this row is not present in the corresponding area report.</td>
</tr>
</tbody>
</table>

Figure 1-12: Area Report of an OpenCL Application with Function Overhead

```c
1 kernel void gl_test( global int * restrict in, global int * restrict out ) {
```
2     int i = get_global_id(0);
3     int idx = out[i];           // idx is data-dependent
4
5     int cached_value = in[idx]; // this is a cached LSU (burst-coalesced-cached).
6     out[i] = cached_value;
7 }

Figure 1-13: Area Report of an OpenCL Application that Includes a max_global_work_dim(0) Kernel Attribute

1 __attribute__((max_global_work_dim(0)))
2 kernel void dim0( global int *restrict in, global int *restrict out, int c ) {
3     *out = *in * c;
4 }

Area Report Message for State

The area report identifies the amount of resources that your design uses for live values and control logic.
Figure 1-14: Area Report of an OpenCL Application that Implements Global Memory Interconnect and Memory Cache

```c
1 kernel void gl_test( global int * restrict in, global int * restrict out ) {
2     int i = get_global_id(0);
3     int idx = out[i];           // idx is data-dependent
4
5     int cached_value = in[idx]; // this is a cached LSU (burst-coalesced-cached).
6
7     out[i] = cached_value;
8 }
```

To reduce the reported area consumption under State, modify your design as follows:

- Decrease the size of local variables
- Decrease the scope of local variables by localizing them whenever possible
- Decrease the number of nested loops in the kernel

Area report Message for Feedback

The area report specifies the resources that your design uses for loop-carried dependencies.

Figure 1-15: Area Report of an OpenCL Application with Loop-Carried Dependencies that Consume Resources

```c
1 #define FF_SIZE (64)
2 kernel void t( global int * restrict src, global int * restrict dst, int N ) {
3     int delay_fifo[FF_SIZE];
4
```
5     #pragma unroll
6     for (int k = 0; k < FF_SIZE; ++k) {
7         delay_fifo[k] = k;
8     }
9
10    for (int i = 0; i < N; ++i) {
11        dst[i] = delay_fifo[0];
12        #pragma unroll
13        for (int k = 0; k < FF_SIZE-1; ++k) {
14            delay_fifo[k] = delay_fifo[k + 1];
15        }
16        delay_fifo[FF_SIZE - 1] = src[i];
17    }
18 }

To reduce the reported area consumption under Feedback, decrease the number and size of loop-carried variables in your design.

**Area Report Messages for Global Memory and Global Memory Interconnect**

The area report specifies the size of the global memory interconnect. It also provides information on memory caching, if implemented.

**Table 1-6: Additional Information on Area Report Messages**

<table>
<thead>
<tr>
<th>Message</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Global interconnect for &lt;X&gt; global loads and &lt;Y&gt; global store. [Reduce number of global loads and stores to simplify global interconnect.]</td>
<td>Specifies the resources required to implement the global interconnect for the OpenCL system. The report also states the total number of global read and write accesses that are performed in all the kernels of the OpenCL system. The optional message appears if the OpenCL system has more than two global read and write accesses in total, in any combination.</td>
</tr>
</tbody>
</table>

Altera Corporation  
Altera SDK for OpenCL Best Practices Guide  
Send Feedback
Load with a private <N> kilobit cache. Cache is not shared with any other load. It is flushed on kernel start. Use Dynamic Profiler to verify cache effectiveness. Other kernels should not be updating the data in global memory while this kernel is using it. Cache is created when memory access pattern is data-dependent or appears to be repetitive. Simplify access pattern or mark pointer as ‘volatile’ to disable generation of this cache.

This message appears when the corresponding line of code contains a global load operation that the Altera Offline Compiler implements with a cache.

The caching size in kilobits specified in the message (that is, <N>) varies depending on the device family.

Figure 1-16: Area Report of an OpenCL Application that Implements Global Memory Interconnect and Memory Cache
Area Report Message for Constant Memory

The area report specifies the size of the constant cache memory. It also provides information such as data replication and the number of read operations.

Table 1-7: Additional Information on Area Report Message

<table>
<thead>
<tr>
<th>Message</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>(&lt;N&gt;) bytes constant cache accessible to all kernels and is persistent across kernel invocations. Data inside the cache is replicated (&lt;X&gt;) times to support (&lt;Y&gt;) reads. Cache optimized for hits, misses incur a large penalty. If amount of data in the cache is small, consider passing it by value as a kernel argument. Use Dynamic Profiler to check stalls on accesses to the cache to assess the cache’s effectiveness. Profiling actual cache hit rate is currently not supported.</td>
<td>—</td>
</tr>
</tbody>
</table>

Figure 1-17: Area Report of an OpenCL Application that Implements Constant Cache Memory

1 kernel void A( constant int *src, global int *dst ) {  
    int i = get_global_id(0);  
    dst[i] = src[i] + src[i + 1] + src[i >> 1];  
}  
5
6 kernel void B( constant int *src, global int *dst ) {  
    int i = get_global_id(0);  
    dst[i] = src[i] + src[i + 1] + src[i >> 1];  
}  
5

| Area Report  
(area utilization values are estimated) | LEs | FFs | RAMs | DSPs |
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>System Total (Logic: 16%)</td>
<td>56902 (10%)</td>
<td>72936 (7%)</td>
<td>391 (15%)</td>
<td>0 (0%)</td>
</tr>
<tr>
<td>Board interface</td>
<td>38262</td>
<td>44528</td>
<td>257</td>
<td>0</td>
</tr>
<tr>
<td>Global interconnect</td>
<td>5034</td>
<td>9568</td>
<td>52</td>
<td>0</td>
</tr>
<tr>
<td>Constant cache interconnect</td>
<td>894</td>
<td>9500</td>
<td>44</td>
<td>0</td>
</tr>
<tr>
<td>[*] A (Logic: 1%)</td>
<td>3356 (1%)</td>
<td>4550 (0%)</td>
<td>19 (1%)</td>
<td>0 (0%)</td>
</tr>
<tr>
<td>[*] B (Logic: 1%)</td>
<td>3356 (1%)</td>
<td>4550 (0%)</td>
<td>19 (1%)</td>
<td>0 (0%)</td>
</tr>
</tbody>
</table>
# Area Report Messages for Local Memory

The area report provides information on local memory to indicate whether the memory system is stallable, banked, replicated, or merged.

## Table 1-8: Additional Information on Area Report Messages

<table>
<thead>
<tr>
<th>Message</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Local memory: <code>&lt;comment&gt;</code></td>
<td>Evaluates the efficiency of local memory accesses, where <code>&lt;comment&gt;</code> can be one of the following: Optimal—stall-free accesses with no replication, or the replication does not use extra block RAM Good but replicated—stall-free accesses with replication Possibly inefficient configuration—one or more accesses might be stallable.</td>
</tr>
</tbody>
</table>
| Requested size `<N>` bytes (rounded up to nearest power of 2), implemented size `<M>` bytes, replicated `<X>` times total, stall-free|stallable, `<Y>` reads and `<Z>` writes. | Reports the following:  
• The requested size of the local memory system (`<N>`)  
• The implemented size of the local memory system after replication (`<M>`)  
• The replication factor (`<X>`)  
• The possibility of memory stalls occurring  
• The number of read accesses (`<Y>`)  
• The number of write accesses (`<Z>`) |
| Merged with another memory system declared at `<kernel_filename>.cl:<line_number>`. | The AOC cannot split the local memory system from another local memory system that is declared earlier in the code. In this message, *merged* means that the local memory system in question is assigned to the same address space as the specified local memory system. |

## Additional Information:

- No additional details.

- Merged with memory systems declared at: `<kernel_filename>.cl:<A>, <kernel_filename>.cl:<B>`. The local memory system is merged with other local memory systems that are declared later on in the code. The message reports a comma-separated list of all other merged memory systems. `<A>` and `<B>` represent the line numbers in the corresponding kernel source code that declare local memory systems.

- Reduce the number of write accesses or fix banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension. Banking only occurs based on the lowest array dimension. A banked local memory system might still stall. In this case, only use indexes that are known at compilation time when accessing elements in the lowest (rightmost) array dimension. Doing so might improve memory banking and fix stalls.
<table>
<thead>
<tr>
<th>Message</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Banked on lowest dimension into (&lt;N&gt;) separate banks (this is a good thing).</td>
<td>If the AOC implements memory banks based on the lowest array dimension, the report specifies the number of banks (that is, (&lt;N&gt;)).</td>
</tr>
<tr>
<td>Replicated (&lt;N&gt;) times to efficiently support multiple simultaneous workgroups. This replication resulted in (&lt;M&gt;) times no increase in actual block RAM usage. Reducing the number of barriers or increasing max_work_group_size may help reduce this replication factor.</td>
<td>Provides an estimate of the actual cost of local memory replication as a result of the kernel having to process simultaneous work-groups. (&lt;N&gt;) is the replication factor, and (&lt;M&gt;) represents how many times block RAM usage has increased. Number of simultaneous work-groups is the maximum number of work-groups that the kernel can process at the same time. To increase throughput, the kernel might execute threads from different work-groups simultaneously (that is, that kernel does not wait to fully complete one work-group before starting another work-group). If a kernel can process multiple simultaneous work-groups and has local memory, the size of the local memory must increase to store data from each simultaneous work-group. This local memory replication might increase the usage of block RAM. Currently, you do not have the ability to modify the number of simultaneous work-groups directly.</td>
</tr>
<tr>
<td>Replicated (&lt;N&gt;) times to efficiently support multiple accesses. To reduce this replication factor, reduce number of read and write accesses.</td>
<td>The local memory system is replicated to provide the number of read and write ports necessary to accommodate the read and write accesses to local memory.</td>
</tr>
<tr>
<td>Reducing accesses to exactly one read and one write for all local memory systems may increase overall system performance.</td>
<td>The local memory system is stall-free and has no replication, but has more than one read or write ports. If each local memory system in an OpenCL system has exactly one read port and one write port, the overall speed of all the kernels within the OpenCL system might increase. Note for advanced users: Single-pumped local memories with two ports in total run at the kernel clock frequency. Double-pumped local memories with more than two ports require a clock that runs at exactly twice the frequency of the kernel clock (that is, 2x clock). In some cases, this 2x clock might limit the performance of the overall OpenCL system.</td>
</tr>
<tr>
<td>Replicated (&lt;X&gt;) times to create private copies for simultaneous execution of (&lt;Y&gt;) threads in the loop containing accesses to the array.</td>
<td>The local memory system is replicated to support parallel execution of a loop. The AOC achieves this parallel execution via array privatization.</td>
</tr>
</tbody>
</table>

Figure 1-18: Area Report of an OpenCL Application that Implements Three Local Memory Systems that Are Merged

```
1  #define NUM_READS 2
```
```c
#define NUM_WRITES 1
#define NUM_BARRIERS 1
#define ALLOW_SPLITTING 0

__attribute__((reqd_work_group_size(1024,1,1)))
kernelpool void big_lmem( global int* restrict in, global int* restrict out ) {
    local int lmem0[1024];
    local int lmem1[1024];
    local int lmem2[1024];
    int gi = get_global_id(0);
    int gs = get_global_size(0);
    int li = get_local_id(0);

    int res = in[gi];
    #pragma unroll
    for (int i = 0; i < NUM_WRITES; ++i) {
        lmem0[li - i] = res;
        lmem1[li - i] = res;
        lmem2[li - i] = res;
        res >>= 1;
    }

    // successive barriers are not optimized away
    #pragma unroll
    for (int i = 0; i < NUM_BARRIERS; ++i) {
        barrier(CLK_GLOBAL_MEM_FENCE);
    }

    res = 0;
    #pragma unroll 1
    for (int i = 0; i < NUM_READS; ++i) {
        local int *10 = lmem0;
        local int *11 = lmem1;
        local int *12 = lmem2;
        #if ALLOW_SPLITTING==0
            if (i % 3 == 1) {
                10 = lmem1;
                11 = lmem2;
                12 = lmem0;
            }
        #endif
        res ^= 10[li - i];
        res ^= 11[li - i];
        res ^= 12[li - i];
    }

    out[gi] = res;
}
```
Figure 1-19: Area Report of an OpenCL Application with Memory Replication and No Increase in Resource Usage

```c
#define NUM_READS 1
#define NUM_WRITES 1
#define NUM_BARRIERS 1

__attribute__((reqd_work_group_size(128,1,1)))
kernel void big_lmem ( global int* restrict in, global int* restrict out ) {

    local int lmem[128];
    int gi = get_global_id(0);
    int gs = get_global_size(0);
    int li = get_local_id(0);

    int res = in[gi];
    #pragma unroll
    for (int i = 0; i < NUM_WRITES; ++i) {
        lmem[li - i] = res;
        res >>= 1;
    }

    // successive barriers are not optimized away
    #pragma unroll
```
for (int i = 0; i < NUM_BARRIERS; ++i) {
    barrier(CLK_GLOBAL_MEM_FENCE);
}

res = 0;
#pragma unroll
for (int i = 0; i < NUM_READS; ++i) {
    res ^= lmem[li - i];
}
out[gi] = res;

Figure 1-20: Area Report of an OpenCL Application that Implements a 2D Local Memory System with Suboptimal Memory Banks

#define NUM_READS 4
#define NUM_BARRIERS 1
#define ALLOW_SPLIT 0

__attribute__((reqd_work_group_size(1024,1,1)))
kern void big_lmem( global int* restrict in, global int* restrict out ) {
    // Banking only works on lowest dimension
    #if ALLOW_SPLIT
        local int lmem[1024][NUM_READS];
    #else
        local int lmem[NUM_READS][1024];
    #endif

    int gi = get_global_id(0);
    int gs = get_global_size(0);
    int li = get_local_id(0);
    int ls = get_local_size(0);

    int res = in[gi];
#pragma unroll
    for (int i = 0; i < NUM_READS; ++i) {
        #if ALLOW_SPLIT
            lmem[(li * i) % ls][i] = res;
        #else
            lmem[i][(li * i) % ls] = res;
        }
```c
#define NUM_READS 4
#define NUM_BARRIERS 1
#define ALLOW_SPLIT 1

__attribute__((reqd_work_group_size(1024,1,1)))
kernel void big_lmem( global int* restrict in, global int* restrict out ) {

    // Banking only works on lowest dimension
    #if ALLOW_SPLIT
        local int lmem[1024][NUM_READS];
    #else
        local int lmem[NUM_READS][1024];
    #endif

    res = 0;
    for (int i = 0; i < NUM_READS; ++i) {
        res ^= lmem[i][((ls - li) * i) % ls];
    }

    out[gi] = res;
```

---

### Area Report Messages for Local Memory

<table>
<thead>
<tr>
<th>Area Report (area utilization values are estimated)</th>
<th>LEs</th>
<th>FFs</th>
<th>RAMs</th>
<th>DSPs</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>System Total (Logic: 14%)</td>
<td>48538 (9%)</td>
<td>62768 (8%)</td>
<td>384 (15%)</td>
<td>0 (0%)</td>
<td>Platform interface logic.</td>
</tr>
<tr>
<td>Board interface</td>
<td>38262</td>
<td>44628</td>
<td>257</td>
<td>0</td>
<td>Global interconnect for 1 global load and 1 global store.</td>
</tr>
<tr>
<td>Global interconnect</td>
<td>5034</td>
<td>9598</td>
<td>52</td>
<td>0</td>
<td>Kernel dispatch logic.</td>
</tr>
<tr>
<td>[] big_lmem (Logic: 2%)</td>
<td>5242 (1%)</td>
<td>8672 (1%)</td>
<td>75 (3%)</td>
<td>0 (0%)</td>
<td>Local memory: Potentially inefficient configuration. Requested size 1024 bytes (rounded up to nearest power of 2). Implemented size 512 bytes, replicated 2 times total, stallable, 4 reads and 4 writes. Additional information: - Reduce the number of writes accesses or for banking to make this memory system stall-free. Banking may be improved by using compile-time known indexing on lowest array dimension. - Replicated 2 times to efficiently support multiple simultaneous workgroups. This replication resulted in 2 times increase in actual block RAM usage. Reducing the number of banks or increasing max_work_group_size may help reduce this replication factor. - Banked on lowest dimension into 2 separate banks (this is a good thing).</td>
</tr>
<tr>
<td>Function overhead</td>
<td>1600</td>
<td>1800</td>
<td>0</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>not_banked_2d:12 inmem</td>
<td>66</td>
<td>512</td>
<td>16</td>
<td>0</td>
<td></td>
</tr>
<tr>
<td>[] Block0 (Logic: 1%)</td>
<td>3548 (1%)</td>
<td>6360 (1%)</td>
<td>59 (2%)</td>
<td>0 (0%)</td>
<td></td>
</tr>
</tbody>
</table>

---

Figure 1-21: Area Report of an OpenCL Application that Implements a 2D Local Memory System with Optimal Memory Banks
12    local int lmem[NUM_READS][1024];
13    #endif
14
15    int gi = get_global_id(0);
16    int gs = get_global_size(0);
17    int li = get_local_id(0);
18    int ls = get_local_size(0);
19
20    int res = in[gi];
21    #pragma unroll
22    for (int i = 0; i < NUM_READS; ++i) {
23        #if ALLOW_SPLIT
24            lmem[(li * i) % ls][i] = res;
25        #else
26            lmem[i][(li * i) % ls] = res;
27        #endif
28        res >>= 1;
29    }
30
31    // successive barriers are not optimized away
32    #pragma unroll
33    for (int i = 0; i < NUM_BARRIERS; ++i) {
34        barrier(CLK_GLOBAL_MEM_FENCE);
35    }
36
37    res = 0;
38    #pragma unroll
39    for (int i = 0; i < NUM_READS; ++i) {
40        #if ALLOW_SPLIT
41            res ^= lmem[((ls - li) * i) % ls][i];
42        #else
43            res ^= lmem[i][((ls - li) * i) % ls];
44        #endif
45    }
46
47    out[gi] = res;
Figure 1-22: Area Report of an OpenCL Application with Array Privatization within Loops

```c
1 kernel void array_priv( global int* restrict A,
2                         global int* restrict B,
3                         uint N ) {
4     for(unsigned i = 1; i < N; i++) {
5         int T[20];
6         for (unsigned j = 0; j < N; ++j) {
7             T[j] = A[B[i * N + j]];
8         }
9         for (unsigned j = 0; j < N; ++j) {
10            B[A[i * N + j]] = T[j];
11        }
12    }
13 }
```

### Area Report

<table>
<thead>
<tr>
<th>Component</th>
<th>LEs</th>
<th>FFs</th>
<th>RAMs</th>
<th>DSPs</th>
<th>Details</th>
</tr>
</thead>
<tbody>
<tr>
<td>System Total (Logic: 18%)</td>
<td>59131</td>
<td>77288</td>
<td>468</td>
<td>2</td>
<td>- Platform interface logic.</td>
</tr>
<tr>
<td>Board interface</td>
<td>30206</td>
<td>44028</td>
<td>257</td>
<td>0</td>
<td>- Global interconnect for 3 global loads and 1 global store. Reduce number of global loads and stores to simplify global interconnect.</td>
</tr>
<tr>
<td>Global interconnect</td>
<td>12524</td>
<td>15502</td>
<td>104</td>
<td>0</td>
<td>- Global interconnect for 3 global loads and 1 global store. Reduce number of global loads and stores to simplify global interconnect.</td>
</tr>
<tr>
<td>arrayPriv (Logic: 3%)</td>
<td>8345</td>
<td>17238</td>
<td>107</td>
<td>2</td>
<td>- Kernel dispatch logic.</td>
</tr>
<tr>
<td>Function overhead</td>
<td>1570</td>
<td>1685</td>
<td>0</td>
<td>0</td>
<td>- Implemented using registers of the following size:</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>- 1 register of width 32 and depth 2 (depth was increased by a factor of 2 due to a loop initiation interval of 2)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td>- Reducing the scope of the variable may reduce its depth (e.g. moving declaration inside a loop or using it as soon as possible).</td>
</tr>
<tr>
<td>Private Variable: T (arrayPriv.0)</td>
<td>9</td>
<td>101</td>
<td>0</td>
<td>0</td>
<td>- Implemented using registers of the following size:</td>
</tr>
<tr>
<td>Private Variable: Y (arrayPriv.0)</td>
<td>8</td>
<td>69</td>
<td>0</td>
<td>0</td>
<td>- Implemented using registers of the following size:</td>
</tr>
<tr>
<td>arrayPriv.0 (T)</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>- Private memory implemented in on-chip block RAM.</td>
</tr>
<tr>
<td>Block1 (Logic: 0%)</td>
<td>275</td>
<td>682</td>
<td>2</td>
<td>0</td>
<td>- Private memory: Good but replicated.</td>
</tr>
<tr>
<td>Block2 (Logic: 2%)</td>
<td>4571</td>
<td>8399</td>
<td>63</td>
<td>2</td>
<td>- Requested size 128 bytes (rounded up to nearest power of 2).</td>
</tr>
<tr>
<td>Block3 (Logic: 1%)</td>
<td>1455</td>
<td>6511</td>
<td>39</td>
<td>0</td>
<td>- Implemented size 384 bytes, replicated 3 times total, stall-free, 1 read and 1 write. Additional information:</td>
</tr>
<tr>
<td>Block4 (Logic: 0%)</td>
<td>445</td>
<td>842</td>
<td>2</td>
<td>0</td>
<td>- Replicated 3 times to create private copies for concurrent execution of 3 threads in the loop containing accesses to the array.</td>
</tr>
</tbody>
</table>

### Area Report Messages for Private Variable Storage

The area report provides information on the implementation of private memory based on your OpenCL design. For single work-item kernels, the Altera Offline Compiler implements private memory differently, depending on the types of variable. The AOC implements scalars and small arrays in registers of various configurations (for example, plain registers, shift registers, and barrel shifter). The AOC implements larger arrays in block RAM.
<table>
<thead>
<tr>
<th>Message</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Implementation of Private Memory Using On-Chip Block RAM</strong></td>
<td>Private memory implemented in on-chip block RAM. The block RAM implementation creates a system that is similar to local memory for NDRange kernels.</td>
</tr>
<tr>
<td><strong>Implementation of Private Memory Using On-Chip Block ROM</strong></td>
<td>For each usage of an on-chip block ROM, the AOC creates another instance of the same ROM. There is no explicit annotation for private variables that the AOC implements in on-chip block ROM.</td>
</tr>
<tr>
<td><strong>Implementation of Private Memory Using Registers</strong></td>
<td>Implemented using registers of the following size: - (&lt;X&gt;) registers of width (&lt;Y&gt;) and depth (&lt;Z&gt;) [{depth was increased by a factor of (&lt;N&gt;) due to a loop initiation interval of (&lt;M&gt;)}] - ... Reports that the AOC implements a private variable in many registers. The AOC might implement a private variable in their specific widths and depths.</td>
</tr>
<tr>
<td><strong>Implementation of Private Memory Using Shift Registers</strong></td>
<td>Implemented as a shift register with (&lt;N&gt;) or fewer tap points. This is a very efficient storage type. Implemented using registers of the following sizes: - (&lt;X&gt;) register(s) of width (&lt;Y&gt;) and depth (&lt;Z&gt;) - ... Reports that the AOC implements a private variable in shift registers. This message provides a list of shift registers with their specific widths and depths. The AOC might break a single array into several smaller shift registers depending on its tap points. <strong>Note:</strong> The AOC might overestimate the number of tap points.</td>
</tr>
<tr>
<td><strong>Implementation of Private Memory Using Barrel Shifters with Registers</strong></td>
<td></td>
</tr>
</tbody>
</table>
## Area Report Messages for Private Variable Storage

<table>
<thead>
<tr>
<th>Message</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Implemented as a barrel shifter with registers due to dynamic indexing. This is a high overhead storage type. If possible, change to compile-time known indexing. The area cost of accessing this variable is shown on the lines where the accesses occur.</td>
<td>Reports that the AOC implements a private variable in a barrel shifter with registers because of dynamic indexing. This row in the report does not specify the full area usage of the private variable. The report shows additional area usage information on the lines where the variable is accessed.</td>
</tr>
<tr>
<td>Implemented using registers of the following size:</td>
<td></td>
</tr>
<tr>
<td>- &lt;X&gt; registers of width &lt;Y&gt; and depth &lt;Z&gt; [(depth was increased by a factor of &lt;N&gt; due to a loop initiation interval of &lt;M&gt;.)]</td>
<td></td>
</tr>
<tr>
<td>- ...</td>
<td></td>
</tr>
</tbody>
</table>

**Note:**
- The area report annotates memory information on the line of code that declares or uses private memory, depending on its implementation.
- When the AOC implements private memory in on-chip block RAM, the area report displays relevant local-memory-specific messages to private memory systems.

**Figure 1-23: Area Report of an OpenCL Application that Implements Private Variables in On-Chip Block RAM**

```c
#define N 128
define N 128

kernel void promote( global float * restrict output, const int num_times ) {

    private int A[N];

    for (int k = 0; k < num_times; ++k) {
        for (int i = 0; i < N / 2; ++i) {
            A[i + k] = A[k + num_times];
        }
    }

    for (int i = 0; i < N; ++i) {
        output[k + i] = A[i + num_times];
    }
}
```

1-38
Figure 1-24: Area Report of an OpenCL Application that Implements Private Variables in On-Chip Block ROM

1 constant int tbl[] = {0,1,2,3,4,5,6,7};
2
3 kernel void t(global int * dst) {
4     int i = get_global_id(0);
5     int res = tbl[i];
6     res += tbl[i + 2];
7     dst[i] = res;
8 }
Figure 1-25: Area Report of an OpenCL Application that Implements Private Variables Using Shift Registers

```
1 #define FF_SIZE (16*1024)
2 #define FF_SIZE_DECL (2*FF_SIZE)
3
4 kernel void t( global int * restrict src, global int * restrict dst, int N ) {
5     int sr[FF_SIZE_DECL];
6
7 #pragma unroll
8     for (int k = 0; k < FF_SIZE; ++k) {
9         sr[k] = N;
10     }
11
12     for (int i = 0; i < N; ++i) {
14 #pragma unroll
15         for (int k = 0; k < FF_SIZE - 1; ++k) {
16             sr[k] = sr[k + 1];
17         }
18         sr[FF_SIZE - 1] = src[i];
19     }
20 }
```
When there are multiple private variables in your OpenCL application, the AOC might coalesce the variables to form a shift register, as shown in the example below.

Figure 1-26: Area Report of an OpenCL Application with a Shift Register Containing Multiple Private Variables

```c
1 kernel void t( global int * restrict src, global int * restrict dst, int N ) {
  2     int a = N, b = N, c = N, d = N, e = N;
  3     for (int i = 0; i < N; ++i) {
  4         dst[i] = a;
  5         a = b;
  6         b = c;
  7         c = d;
  8         d = e;
  9         e = src[i];
 10     }
 11 }
```
Figure 1-27: Area Report of an OpenCL Application that Implements Private Variables Using Barrel Shifter with Registers

```
1 kernel void barrel_shifter( int n,
2     global int * restrict in,
3     global int * restrict out ) {
4     int x[3];
5     for (int i = 0; i < n; ++i) {
6         x[0] += in[0];
7         x[1] += in[1];
8         x[2] += in[2];
9         x[i % 3]++;
10     }
11 }
12 out[0] = x[0];
13 out[1] = x[1];
14 out[2] = x[2];
```
Area Report Message for Channels

The area report specifies the size of the channels that Altera Offline Compiler implements based on your OpenCL design.

Table 1-10: Additional Information on Area Report Message

<table>
<thead>
<tr>
<th>Message</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>[Channel array with (&lt;N&gt;) elements.] Channel is implemented (&lt;X&gt;) bits wide by (&lt;Y&gt;) deep. [Requested depth was (&lt;Z&gt;). Channel depth was changed for the following reasons: - instruction scheduling requirements - nature of underlying FIFO implementation]</td>
<td>Reports the channel area. The report can include one or more of the optional messages. If the AOC does not implement channels with the requested channel depth, the area report states the reasons behind the discrepancy.</td>
</tr>
</tbody>
</table>

Figure 1-28: Area Report of an OpenCL Application that Implements Channels with Zero Depth

1 #pragma OPENCL EXTENSION cl_altera_channels : enable
2 #define N 5
3
4 channel long16 ch[N] __attribute__((depth(0)));
5
6 kernel void producer( global long16 *src ) {
7     int i = get_global_id(0);
Area Report Message for Channels

1 #pragma unroll
2 for (int j = 0; j < N; ++j) {
3     write_channel_altera(ch[j], src[i + j]);
4 }
5
6 kernel void producer( global long16 *src ) {
7     int i = get_global_id(0);
8     write_channel_altera(ch, src[i]);
9 }
10
11 kernel void consumer ( global long16 *dst ) {
12     int i = get_global_id(0);
13     long16 val = read_channel_altera(ch);
14     dst[i] = val;
15 }
16
17

Figure 1-29: Area Report of an OpenCL Application that Specifies 10-Data-Value-Deep Channels

1 #pragma OPENCL EXTENSION cl_altera_channels : enable
2
3 channel long16 ch __attribute__((depth(10)));
4
5 kernel void producer( global long16 *src ) {
6     int i = get_global_id(0);
7     write_channel_altera(ch, src[i]);
8 }
9
10 kernel void consumer ( global long16 *dst ) {
11     int i = get_global_id(0);
12     long16 val = read_channel_altera(ch);
13     dst[i] = val;
14 }
Profile Your Kernel to Identify Performance Bottlenecks

The Altera SDK for OpenCL Profiler generates data that helps you assess OpenCL kernel performance. The Profiler instruments the kernel pipeline with performance counters. These counters collect kernel performance data, which you can review via the profiler GUI.

Consider the following OpenCL kernel program:

```c
__kernel void add (__global int * a,
        __global int * b,
        __global int * c)
{
    int gid = get_global_id(0);
    c[gid] = a[gid]+b[gid];
}
```

As shown in the figure below, the profiler instruments and connects performance counters in a daisy chain throughout the pipeline generated for the kernel program. The host then reads the data collected by these counters. For example, in PCI Express® (PCIe®)-based systems, the host reads the data via the PCIe control register access (CRA) or control and status register (CSR) port.
Work-item execution stalls might occur at various stages of an AOCL pipeline. Applications with large amounts of memory accesses or load and store operations might stall frequently to enable the completion of memory transfers. The Profiler helps identify the load and store operations that cause the majority of stalls within a kernel pipeline.

For usage information on the AOCL Profiler, refer to the Profiling Your OpenCL Kernel section of the Altera SDK for OpenCL Programming Guide.

**Related Information**

Profiling Your OpenCL Kernel

**Altera SDK for OpenCL Profiler GUI**

The Altera SDK for OpenCL Profiler GUI displays statistical information collected from memory and channel or pipe accesses.

**Table 1-11: Summary Heading in the AOCL Profiler GUI**

<table>
<thead>
<tr>
<th>Heading</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Board</td>
<td>Name of the accelerator board that the Altera Offline Compiler uses during kernel emulation and execution.</td>
</tr>
<tr>
<td>Global Memory BW (DDR)</td>
<td>Maximum theoretical global memory bandwidth available for each memory type (for example, DDR).</td>
</tr>
</tbody>
</table>

Directly below the summary heading, you can view detailed profile information by clicking on the available tabs.

**Important:** In the following sections, information that relates to the AOCL channels also applies to OpenCL pipes.
**Source Code Tab** on page 1-47

The **Source Code** tab in the Altera SDK for OpenCL Profiler GUI contains source code information and detailed statistics about memory and channel accesses.

**Kernel Execution Tab** on page 1-51

The **Kernel Execution** tab in the Altera SDK for OpenCL Profiler GUI provides a graphical representation of the overall kernel program execution process.

**Source Code Tab**

The **Source Code** tab in the Altera SDK for OpenCL Profiler GUI contains source code information and detailed statistics about memory and channel accesses.

**Figure 1-31: The Source Code tab in the AOCL Profiler GUI**

![Image of the Source Code tab]

The **Source Code** tab provides detailed information on specific lines of kernel code.

**Table 1-12: Types of Information Available in the Source Code Tab**

<table>
<thead>
<tr>
<th>Column</th>
<th>Description</th>
<th>Access Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>Attributes</td>
<td>Memory or channel attributes information such as memory type (local or global), corresponding memory system (DDR or quad data rate (QDR)), and read or write access.</td>
<td>All memory and channel accesses</td>
</tr>
<tr>
<td>Stall%</td>
<td>Percentage of time the memory or channel access is causing pipeline stalls.</td>
<td>All memory and channel accesses</td>
</tr>
<tr>
<td>Occupancy%</td>
<td>Percentage of the overall profiled time frame when a valid work-item executes the memory or channel instruction.</td>
<td>All memory and channel accesses</td>
</tr>
<tr>
<td>Column</td>
<td>Description</td>
<td>Access Type</td>
</tr>
<tr>
<td>--------------</td>
<td>------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------</td>
<td>-----------------------------</td>
</tr>
<tr>
<td>Bandwidth</td>
<td>Average memory bandwidth that the memory access uses and its overall efficiency. For each global memory access, FPGA resources are assigned to acquire data from the global memory system. However, the amount of data a kernel program uses might be less than the acquired data. The overall efficiency is the percentage of total bytes, acquired from the global memory system, that the kernel program uses.</td>
<td>Global memory accesses</td>
</tr>
</tbody>
</table>

If a line of source code instructs more than one memory or channel operations, the profile statistics appear in a drop-down list box and you may select to view the relevant information.

**Figure 1-32: Source Code Tab: Drop-Down List for Multiple Memory or Channel Operations**

<table>
<thead>
<tr>
<th>Line</th>
<th>Source Code</th>
<th>Attributes</th>
<th>Start%</th>
<th>Occupancy%</th>
<th>Bandwidth</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>#include &quot;ini&quot;</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>#include &quot;hostdiskcommon.h&quot;</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>3</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>4</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>#define DEPTH0 <em>attribute</em> (depth0())</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>6</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>channel TYPE DEPTH0 input_stream0, DEPTH0 input_stream1, DEPTH0</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>8</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>9</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>10</td>
<td>kernel void input_kernel(global TYPE *in) {</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>11</td>
<td>int i = get_global_U0();</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>12</td>
<td>if (i + 1 &gt;= (LOGN - 2)) = LOGN;</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>13</td>
<td>int offset = i * (N/4 - 1);</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>14</td>
<td>write_channel_input(input_stream0, srclarget base + offset);</td>
<td>0.312%</td>
<td>100%</td>
<td>100%</td>
<td>0: 1836.0MB/s, 0: 100.0% Efficiency</td>
</tr>
<tr>
<td>15</td>
<td>write_channel_input(input_stream1, srclarget base + N/2 * offset);</td>
<td>0.625%</td>
<td>100%</td>
<td>100%</td>
<td>0: 1836.0MB/s, 0: 100.0% Efficiency</td>
</tr>
<tr>
<td>16</td>
<td>write_channel_input(input_stream2, srclarget base + N/4 * offset);</td>
<td>0.938%</td>
<td>100%</td>
<td>100%</td>
<td>0: 1836.0MB/s, 0: 100.0% Efficiency</td>
</tr>
<tr>
<td>17</td>
<td>write_channel_input(input_stream3, srclarget base + 3 * N/4 * offset);</td>
<td>1.25%</td>
<td>100%</td>
<td>100%</td>
<td>0: 1836.0MB/s, 0: 100.0% Efficiency</td>
</tr>
<tr>
<td>18</td>
<td>}</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>19</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>20</td>
<td>kernel void output_kernel(global TYPE *out) {</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Tool Tip Options**

To obtain additional information about the kernel source code, hover your mouse over channel accesses in the code to activate the tool tip.
Attention: If your kernel undergoes memory optimization that consolidates hardware resources that implement multiple memory operations, statistical data might not be available for each memory operation. One set of statistical data will map to the point of consolidation in hardware.
<table>
<thead>
<tr>
<th>Column</th>
<th>Tool Tip</th>
<th>Description</th>
<th>Example Message</th>
<th>Access Type</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Cache Hits</td>
<td>The number of memory accesses using the cache.</td>
<td>Cache Hit%=30%</td>
<td>Global memory</td>
</tr>
<tr>
<td></td>
<td></td>
<td>A high cache hit rate reduces memory bandwidth utilization.</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Attributes</td>
<td>Unaligned Access</td>
<td>The percentage of unaligned memory accesses.</td>
<td>Unaligned Access % = 20%</td>
<td>Global memory</td>
</tr>
<tr>
<td></td>
<td></td>
<td>A high unaligned access percentage signifies inefficient memory accesses. Consider modifying the access patterns in the kernel code to improve efficiency.</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Statically Coalesced</td>
<td>Indication of whether the load or store memory operation is statically coalesced. Generally, static memory coalescing merges multiple memory accesses that access consecutive memory addresses into a single wide access.</td>
<td>Coalesced</td>
<td>Global or local memory</td>
</tr>
<tr>
<td>Occupancy%</td>
<td>Activity</td>
<td>The percentage of time a predicated channel or memory instruction is enabled (that is, when conditional execution is true).</td>
<td>Activity = 20%</td>
<td>Global or local memory, and channels</td>
</tr>
<tr>
<td></td>
<td></td>
<td><strong>Note:</strong> The activity percentage might be less than the occupancy of the instruction.</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Kernel Execution Tab

The **Kernel Execution** tab in the Altera SDK for OpenCL Profiler GUI provides a graphical representation of the overall kernel program execution process. It illustrates the execution time of each kernel and provides insight into the interactions between different kernel executions.

For example, if you run the host application from a networked directory with slow network disk accesses, the GUI can display the resulting delays between kernel launches while the runtime stores profile output data to disk.

**Attention:** To avoid potential delays between kernel executions and increases in the overall execution time of the host application, run your host application from a local disk.

**Figure 1-34: The Kernel Execution Tab in the Profiler GUI**

![Kernel Execution Tab](image)

The horizontal bar graph represents kernel execution through time. The combination of the two bars shown in the first entry (fft1d) represents the total time. The second and last entries show kernel executions that occupy the time span. These bars represent the concurrent execution of output_kernel and input_kernel, and indicate that the kernels share common resources such as memory bandwidth.

**Tip:** You can examine profile data for specific execution times. In the example above, when you double-click the bar on the left for fft1d, another window opens to display profile data for that specific kernel execution event.
The **Kernel Execution** tab also displays information on memory transfers between the host and your devices, shown below:

**Figure 1-35: Kernel Execution Tab: Host-Device Memory Transfer Information**

---

**Attention:** Adjusting the magnification by zooming in or out might cause subtle changes to the granularity of the time scale.

To enable the display of memory transfer information, set the environment variable `ACL_PROFILE_TIMER` to a value of 1 and then run your host application. Setting the `ACL_PROFILE_TIMER` environment variable enables the recording of memory transfers. The information is stored in the `profile.mon` file and is then parsed by the Profiler GUI.

### Interpreting the Profiling Information

The profiling information helps you identify poor memory or channel behaviors that lead to unsatisfactory kernel performance.

Below are examples of how you can interpret the statistics that the Profiler reports.

**Important:** In the following sections, information that relates to the Altera SDK for OpenCL channels also applies to OpenCL pipes.

- **Occupancy** on page 1-52
  Occanpncy measures the percentage of time a new work-item enters a memory instruction.

- **High Stall Percentage** on page 1-54
  A high stall percentage implies that the memory or channel instruction is unable to fulfill the access request because of contention for memory bandwidth or channel buffer space.

- **Low Bandwidth Efficiency** on page 1-54
  Low bandwidth efficiency occurs when excessive amount of bandwidth is necessary to obtain useful data.

- **Stalling Channels** on page 1-54
  If an I/O channel stalls, it implies that the I/O channel cannot keep up with the kernel.

**Occupancy**

Occupancy measures the percentage of time a new work-item enters a memory instruction.

The Altera Offline Compiler generates a pipeline architecture where work-items traverse through the pipeline stages sequentially (that is, in a pipeline-parallel manner). As soon as a pipeline stage becomes
empty, a work-item enters and occupies the stage. Pipeline parallelism also applies to iterations of pipelined loops, where iterations enter a pipelined loop sequentially.

- The profiler reports a high occupancy percentage if the AOC generates a highly efficient pipeline from your kernel, where work-items or iterations are moving through the pipeline stages without stalling.

- Occupancy percentage decreases with increasing amount of stalls in the pipeline.
  - If work-items cannot enter the pipeline consecutively, they insert bubbles into the pipeline.
  - In loop pipelining, loop-carried dependencies also form bubbles in the pipeline because of bubbles that exist between iterations.

Related Information
Source Code Tab on page 1-47

Low Occupancy Percentage
A low occupancy percentage implies that a work-item is accessing the load and store operations or the channel infrequently. This behavior is expected for load and store operations or channels that are in non-critical loops. However, if the memory or channel instruction is in critical portions of the kernel code and the occupancy or activity percentage is low, it implies that a performance bottleneck exists because work-items or loop iterations are not being issued in the hardware.

Consider the following code example:

```c
__kernel void proc (__global int * a, ...)
{
  for (int i=0; i < N; i++)
  {
    for (int j=0; j < 1000; j++)
    {
      write_channel_altera (c0, data0);
    }
    for (int k=0; k < 3; k++)
    {
      write_channel_altera (c1, data1);
    }
  }
}
```

Assuming all the loops are pipelined, the first inner loop with a trip count of 1000 is the critical loop. The second inner loop with a trip count of three will be executed infrequently. As a result, you can expect that the occupancy and activity percentages for channel `c0` are high and for channel `c1` are low.

Also, occupancy percentage might be low if you define a small work-group size, the kernel might not receive sufficient work-items. This is problematic because the pipeline is empty generally for the duration of kernel execution, which leads to poor performance.

Occupancy versus Activity
Activity measures the percentage of time that a predicated instruction is enabled. The primary difference between occupancy and activity relates to predication.

A work-item or loop iteration can occupy a memory instruction even if it is predicated. If the branch statements do not contain loops, the Altera Offline Compiler converts the branches to minimize control flow, which leads to more efficient hardware. As part of the conversion, memory and channel instructions must be predicated and the output results much be selected through multiplexer logic.
Activity percentages available in the tool tips do not account for predicated accesses. Therefore, you can identify predicated instructions based on low activity percentages. Despite having low activity percentages, these instructions might have high occupancies.

**Related Information**

*ToolTip Options* on page 1-48

### High Stall Percentage

A high stall percentage implies that the memory or channel instruction is unable to fulfill the access request because of contention for memory bandwidth or channel buffer space.

Memory instructions stall often whenever bandwidth usage is inefficient or if a large amount of data transfer is necessary during the execution of your application. Inefficient memory accesses lead to suboptimal bandwidth utilization. In such cases, analyze your kernel memory accesses for possible improvements.

Channel instructions stall whenever there is a strong imbalance between read and write accesses to the channel. Imbalances might be caused by channel reads or writes operating at different rates.

For example, if you find that the stall percentage of a write channel call is high, check to see if the occupancy and activity of the read channel call are low. If they are, the performing speed of the kernel controlling the read channel call is too slow for the kernel controlling the write channel call, leading to a performance bottleneck.

**Related Information**

- *Transfer Data Via AOCL Channels or OpenCL Pipes* on page 1-8
- *Source Code Tab* on page 1-47

### Low Bandwidth Efficiency

Low bandwidth efficiency occurs when excessive amount of bandwidth is necessary to obtain useful data. Excessive bandwidth usage generally occurs when memory accesses are poor (for example, random accesses), leading to unsatisfactory coalescing opportunities.

Review your memory accesses to see if you can rewrite them such that accesses to memory sites address consecutive memory regions.

**Related Information**

- *Strategies for Improving Memory Access Efficiency* on page 1-92
- *Source Code Tab* on page 1-47

### Stalling Channels

Channels provide a point-to-point communication link between either two kernels, or between a kernel and an I/O channel. If an I/O channel stalls, it implies that the I/O channel cannot keep up with the kernel.

For example, if a kernel has a read channel call to an Ethernet I/O and the Profiler identifies a stall, it implies that the write channel is not writing data to the Ethernet I/O at the same rate as the read rate of the kernel.

For kernel-to-kernel channels, stalls occur if there is an imbalance between the read and write sides of the channel, or if the read and write kernels are not running concurrently.
For example, if the kernel that reads is not launched concurrently with the kernel that writes, or if the read operations occur much slower than the write operations, the Profiler identifies a stall for the `write_channel_altera` call in the write kernel.

**Related Information**

Transfer Data Via AOCL Channels or OpenCL Pipes on page 1-8

**AOCL Profiler Limitations**

The Altera SDK for OpenCL Profiler has some limitations.

- The Profiler can only extract one set of profile data from a kernel while it is running.
  
  If the Profiler collects the profile data after kernel execution completes, you can call the host API to generate the `profile.mon` file multiple times.
  
  For more information on how to collect profile data during kernel execution, refer to the Collecting Profile Data During Kernel Execution section of the Altera SDK for OpenCL Programming Guide.

- Profile data is not persistent across OpenCL programs or multiple devices.
  
  You can request profile data from a single OpenCL program and on a single device only. If your host swaps a new kernel program in and out of the FPGA, the Profiler will not save the profile data.

- Instrumenting the Verilog code with performance counters increases hardware resource utilization (that is, FPGA area usage) and typically decreases performance.
  
  For information on instrumenting the Verilog code with performance counters, refer to the Instrumenting the Kernel Pipeline with Performance Counters section of the Altera SDK for OpenCL Programming Guide.

**Related Information**

- Collecting Profile Data During Kernel Execution
- Instrumenting the Kernel Pipeline with Performance Counters (--profile)

**Strategies for Improving Single Work-Item Kernel Performance**

**Optimization Report** on page 1-56

When you compile your OpenCL kernels with loops, the Altera Offline Compiler generates an optimization report that lists select kernel properties, loop unroll status, and problem areas such as loop-carried dependencies that might degrade performance.

**Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback** on page 1-66

In many cases, designing your OpenCL application as a single work-item kernel is sufficient to maximize performance without performing additional optimization steps.

**Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays** on page 1-78

Include the `ivdep` pragma in your single work-item kernel to assert that accesses to memory arrays will not cause loop-carried dependencies.
Good Design Practices for Single Work-Item Kernel on page 1-80

If your OpenCL kernels contain loop structures, follow the Altera-recommended guidelines to construct the kernels in a way that allows the Altera Offline Compiler to analyze them effectively.

Optimization Report

When you compile your OpenCL kernels with loops, the Altera Offline Compiler generates an optimization report that lists select kernel properties, loop unroll status, and problem areas such as loop-carried dependencies that might degrade performance.

By default, the optimization report does not contain source mapping information such as variable names, line numbers, and loop unroll status. To include the source mapping information in the optimization report, compile your kernel with the `-g` AOC command option. If you do not compile your kernel with the `-g` option, the optimization report displays the following warning message:

```
===================================================================================
*** Optimization Report ***
Warning: Compile with `-g` to get line number, variable name, and additional loop information
===================================================================================
```

Altera recommends that you always compile your OpenCL kernels with the `-g` option. Compiling your kernel with the `-g` option has no significant effect on compilation time and memory usage.
The optimization report provides the following information:

- Describes whether the AOC compiles the kernel as a single work-item kernel or an NDRange kernel.
- Explains how the AOC handles loops differently for single work-item kernels and NDRange kernels.

For example:

```
*** Optimization Report ***

Kernels that do not use any work-item built-in functions, such as get_global_id(), are compiled for single work-item execution (a task). Otherwise, a kernel is compiled as an ND-Range.

For tasks, the compiler will attempt to pipeline every loop in the kernel to allow multiple iterations of the loop to execute concurrently. If some loops are not pipelined, or not pipelined well, you may not get good performance.

For ND-Range kernels, the loops are not pipelined. Instead, they are built to accept multiple work-items simultaneously. Kernel throughput is usually reduced by the largest total number of iterations of nested loops. A large number of threads is usually required to efficiently utilize ND-Range kernels.
```

- For each kernel, lists the kernel properties.
  - Whether the kernel is a single work-item kernel or an NDRange kernel.
  - Any other properties or optimizations that the AOC performed on the kernel.

For example:

```
Kernel: nd

The kernel is compiled as an ND-Range.

The kernel does not use any work-group information (such as get_local_id() or get_group_id()). Local work-group size will be automatically modified to match global work-group size on launch. This is a hardware optimization.
```
• Provides a Loop Report with information about all the loops in the source code, including loop nesting relationship and loop unroll status.

• Loop nesting relationship is shown via indentation and ASCII art.

• Loop unroll status indicates the achieved unroll factor and the reason for loop unrolling.

A loop might be completely eliminated because its computation result is not used, it has exactly one iteration, or the computation does not require a loop. Note that the optimization report might not include information about the eliminated loop.

For example:

Loop Report:

+ Fully unrolled loop (file nd_full_nested.cl line 4)
  Loop was automatically and fully unrolled.
  Add "#pragma unroll 1" to prevent automatic unrolling.

  + Fully unrolled loop (file nd_full_nested.cl line 6)
    Loop was fully unrolled due to "#pragma unroll" annotation.

+ Loop "Block2" (file nd_full_nested.cl line 14)
  Loop unroll was requested with "#pragma unroll" but failed. Check warning messages for details.

+ Loop "Block4" (file nd_full_nested.cl line 18)
  Loop was partially unrolled 2 times due to "#pragma unroll" annotation.

Note: About loop locations:

• For the AOC to report on the loop unroll status, you must compile your kernel with the -g AOC command option.

• The loop location (that is, the file name and the line number) is the line that contains the loop latch. The loop latch is the condition that determines whether the loop continues or exits. For loops and while loops have latches at the top of the loop body. As a result, the loop location is the same as the location of the for or while statements. On the other hand, a do-while loop has the loop latch at the bottom of the loop body. Therefore, the loop location for a do-while loop is at the bottom of the loop.

• To distinguish between loops in the optimization report, in your kernel code, start and end each loop on a separate line. For multiple nested loops, ensure that the closing bracket (}) for each loop is on its own line.

• Indicates whether the AOC infers pipelined execution successfully.

• For each loop, the report identifies:
  
  • Operations that contribute the largest delay to the computation of the loop-carried dependency.
  
  • The launch frequency of a new loop iteration.
  
  • If a loop iteration launches every clock cycle, the kernel achieves maximum pipeline efficiency and yields the best performance.
  
  • If a loop iteration launches once every few clock cycles, it might degrade performance. The optimization report identifies the corresponding loop-carried dependency.
The optimization report addresses two types of loop-carried dependencies:

- **Data dependencies**
  
  The report identifies the variable name whose computation is dependent on other operations.

- **Memory dependencies**
  
  The report shows a memory dependency as a memory operation that is dependent on other operation(s).

**Important:** There might be other dependencies that do not affect the launch frequency of a new loop iteration but cause serial execution of certain regions of the kernel program. These dependencies also limit performance.

For more information on the \(-g\) option of the \(aoc\) command, refer to the *Adding Source References to Optimization Reports (-g)* section of the *Altera SDK for OpenCL Programming Guide*.

**Optimization Report Messages** on page 1-59

The single work-item kernel optimization report provides detailed information on the effectiveness of pipelined execution. It also identifies the nature of loop-carried dependencies to help you pinpoint the sources of performance bottleneck.

**Related Information**

*Adding Source References to Optimization Reports (-g)*

**Optimization Report Messages**

The single work-item kernel optimization report provides detailed information on the effectiveness of pipelined execution. It also identifies the nature of loop-carried dependencies to help you pinpoint the sources of performance bottleneck.

**Optimization Report Messages for Pipelined Execution Inference** on page 1-60

When the Altera Offline Compiler attempts to infer pipelined execution, it generates a message in the optimization report in the \(<your_kernel_filename>/<your_kernel_filename>.log\) file.

**Optimization Report Message Detailing Initiation Interval** on page 1-62

The launch frequency of a new loop iteration is called the *initiation interval* (II).

**Optimization Report Messages for Loop-Carried Dependencies Affecting Initiation Interval** on page 1-63

The optimization report messages provide details of data and memory dependencies that affect the II of a pipelined loop.

**Optimization Report Messages for Loop-Carried Dependencies Not Affecting the Initiation Interval** on page 1-63

The optimization report generates messages that identify loop-carried dependencies not affecting the II of a pipelined loop.

**Optimization Report Messages for Simplified Analysis of a Complex Design** on page 1-64

For complex designs that have long compilation times, the Altera Offline Compiler performs simplified analyses for determining II bottlenecks.

**Optimization Report Message for Speed-Limiting Constructs** on page 1-65

The Altera Offline Compiler might make a trade-off between circuit speed (Fmax) and II value. When this trade-off happens, the optimization report issues a warning.
Optimization Report Messages for Pipelined Execution Inference

When the Altera Offline Compiler attempts to infer pipelined execution, it generates a message in the optimization report in the <your_kernel_filename>/your_kernel_filename.log file.

Successful Pipelined Execution Inference

If the AOC successfully extracts the loop for pipelined execution, it prints the following message in the optimization report:

Pipelined well. Successive iterations are launched every cycle.

In the case where the AOC fails to infer pipelined execution, it specifies the reason of the failure in the optimization report message.

Unable to Resolve Loop Exit Condition at Iteration Initiation

The AOC normally evaluates the exit condition of loops at the end of the loop. During loop pipelining, the AOC moves these operations to the beginning of the loop where iteration initiation occurs. If the exit condition involves operations such as memory accesses, the AOC cannot move these operations safely.

Consider the following code example:

```c
#define N 128
__kernel void exitcond( __global unsigned* restrict input,
__global unsigned* restrict result )
{
  unsigned i = 0;
  unsigned sum = 0;
  while( input[ i++ ] < N )
  {
    for ( unsigned j = 0; j < N; j++ )
      sum += input[i+j];
  }
  *result = sum;
}
```

The exit condition of the outer loop (that is, `input[ i++ ] < N`) contains a memory instruction. As such, the AOC cannot move the exit condition to the beginning of the outer loop. Therefore, the AOC cannot infer pipelining, and it prints the following message in the optimization report:

```plaintext
===================================================================================
Kernel: exitcond
===================================================================================
The kernel is compiled for single work-item execution.
Loop Report:
+ Loop "for.cond.preheader" (file test.cl line 9)
  NOT pipelined due to:
  Loop exit condition unresolvable at iteration initiation.
  Simplify loop exit condition to fix this problem.
  See "Unable to Resolve Loop Exit Condition at Iteration Initiation" section of
  the Best Practices Guide for more information.
  Not pipelining this loop will most likely lead to poor performance.

-+ Loop "for.body" (file test.cl line 10)
  Pipelined well. Successive iterations are launched every cycle.
```
Loop Structure Does Not Support Linear Execution

In order for the AOC to infer pipelined execution, loops must execute in a linear fashion. Consider the following code example:

```c
__kernel void structure( __global unsigned* restrict output1,
                        __global unsigned* restrict output2,
                        int N )
{
    for ( unsigned i = 0; i < N; i++ )
    {
        if ( (i & 3) == 0 )
        {
            for ( unsigned j = 0; j < N; j++ )
            {
                output1[ i+j ] = i * j;
            }
        }
        else
        {
            for ( unsigned j = 0; j < N; j++ )
            {
                output2[ i+j ] = i * j;
            }
        }
    }
}
```

The outer loop (i) contains two divergent inner loops. Each iteration of the outer loop may execute one inner loop or the other, which is a nonlinear execution. Therefore, the AOC cannot infer pipelining and prints the following message in the optimization report:

```
Kernel: structure
The kernel is compiled for single work-item execution.
Loop Report:
+ Loop "Block1" (file b.cl line 4)
  NOT pipelined due to:
  Loop structure: loop contains divergent inner loops.
  Making all inner loops unconditional should fix this problem.
  Not pipelining this loop will most likely lead to poor performance.

-+ Loop "Block2" (file b.cl line 8)
  Pipelined well. Successive iterations are launched every cycle.

-+ Loop "Block3" (file b.cl line 15)
  Pipelined well. Successive iterations are launched every cycle.
```
Out-of-Order Loop Iterations

A loop-carried dependency occurs when the iterations of an outer loop become out of order with respect to the inner loop. In such cases, the AOC cannot infer pipelining because it might lead to functionally incorrect results. Consider the following code example:

```c
__kernel void order( __global unsigned* restrict input,
                      __global unsigned* restrict output
                      int N )
{
    unsigned sum = 0;
    for ( unsigned i = 0; i < N; i++ )
    {
        for ( unsigned j = 0; j < i; j++ )
        {
            sum += input[ i+j ];
        }
    }
    output[ 0 ] = sum;
}
```

The number of iterations of the inner loop is different for each iteration of the outer loop. For example, for i = 0, j iterates zero times; for i = i, j iterates once, and so on. As such, the AOC cannot infer pipelining, and it prints the following message in the optimization report:

```text
===================================================================================
Kernel: order
===================================================================================
The kernel is compiled for single work-item execution.
Loop Report:
+ Loop "for.cond1.preheader" (file test.cl line 38)
  NOT pipelined due to:
    Loop iteration ordering: iterations may get out of order with respect to the
    listed inner loop, as the number of iterations of the listed inner loop may be
    different for different iterations of this loop.
    Loop "for.body3" (file test.cl line 39)
    To fix this, make sure the listed inner loop has the same number of iterations
    for each iteration of this loop.
    See "Out-of-Order Loop Iterations" section of the Best Practices Guide for
    more information.
    Not pipelining this loop will most likely lead to poor performance.

-+ Loop "for.body3" (file test.cl line 39)
  Pipelined well. Successive iterations are launched every cycle.
```

Optimization Report Message Detailing Initiation Interval

The launch frequency of a new loop iteration is called the *initiation interval* (II).

The following optimization report message provides details of the II of a loop iteration:

```
Pipelined with successive iterations launched every <N> cycles due to:
where <N> is the number of hardware clock cycles for which the pipeline must wait before it can process
the next loop iteration. The ideal value for <N> is 1, which means II equals to 1.
```

**Note:** When II equals 1, the Altera Offline Compiler does not display this optimization report message.
Optimization Report Messages for Loop-Carried Dependencies Affecting Initiation Interval

The optimization report messages provide details of data and memory dependencies that affect the II of a pipelined loop.

Data Dependency

Kernel: test10

The kernel is compiled for single work-item execution.

Loop Report:

+ Loop "for.body" (file float.cl line 5)
  Pipelined with successive iterations launched every 9 cycles due to:

  Data dependency on variable sum  (file float.cl line 6)
  Largest Critical Path Contributor:
    96%: Fadd Operation  (file float.cl line 6)

The message Data dependency on variable <variable_name> identifies a loop-carried dependency where the computation of a variable is dependent on the result from a previous loop iteration.

Memory Dependency

+ Loop "for.body8" (file test.cl line 138)
  Pipelined with successive iterations launched every 7 cycles due to:

  Memory dependency on Load Operation from: (file test.cl line 140)
  Store Operation (file test.cl line 140)
  Largest Critical Path Contributors:
    73%: Load Operation  (file test.cl line 140)
    26%: Store Operation  (file test.cl line 140)

The message Memory dependency on <memory_operation> from:<other_memory_operation> identifies a loop-carried dependency where <memory_operation> cannot occur before the execution of <other_memory_operation> from a previous loop iteration completes.

The message Largest Critical Path Contributor(s): specifies the operations that contribute the largest delay to the computation of the loop-carried dependency.

Optimization Report Messages for Loop-Carried Dependencies Not Affecting the Initiation Interval

The optimization report generates messages that identify loop-carried dependencies not affecting the II of a pipelined loop.

Consider the following optimization report example:

Kernel: test

The kernel is compiled for single work-item execution.

Loop Report:

+ Loop "Block1"
  Pipelined with successive iterations launched every cycle.
Iterations executed serially across the region listed below.
Only a single loop iteration will execute inside the listed region.
This will cause performance degradation unless the region is pipelined well
(can process an iteration every cycle).

Loop "Block2" (file ii_1_serial.cl line 5)
due to:
Memory dependency on Load Operation from: (file ii_1_serial.cl line 6)
Store Operation (file ii_1_serial.cl line 8)

The following message indicates that the computation of the variable or memory operation spans across
the specified loop(s), resulting in the serial execution of the loop iterations over the specified region.

Iterations will be executed serially across the following region:
Loop <loop_name>
<Loop <loop_name_2>>
due to:
<data/memory_dependency_message>

where <Loop <loop_name_2>> is the name of the inner loop, if applicable, and <data/m
memory_dependency_message> is the data or memory dependency report message described in the
Optimization Report Messages for Loop-Carried Dependencies Affecting Initiation Interval section.

There might be memory dependencies in the pipelined loop that might become performance bottlenecks
after you address other performance bottlenecks in your optimizations. In this case, the optimization
report might issue the following message:

Additional memory dependency:
Memory dependency on Load Operation from: (file test.cl line 137)
Store Operation (file test.cl line 137)

Related Information
Optimization Report Messages for Loop-Carried Dependencies Affecting Initiation Interval on page
1-63

Optimization Report Messages for Simplified Analysis of a Complex Design
For complex designs that have long compilation times, the Altera Offline Compiler performs simplified
analyses for determining II bottlenecks.

If the AOC has to analyze a simpler design model to identify II bottlenecks, it prints the following
messages in the optimization report:

Kernel: test
The kernel is compiled for single work-item execution.
Loop Report:
+ Loop "for.body" (file simplecycle.cl line 5)
Pipelined with successive iterations launched every 300 cycles due to:
Memory dependency on Load Operation from: (file simplecycle.cl line 6)
Store Operation (file simplecycle.cl line 6)
Largest Critical Path Contributors:
53%: Load Operation  (file simplecycle.cl line 6)
45%: Store Operation  (file simplecycle.cl line 6)
(Simple model used for II bottleneck estimation due to design complexity.)

===================================================================================

If the AOC cannot find the II bottleneck, it prints the following messages in the optimization report:

+ Loop "Block1" (file test.cl line 12)
  Pipelined with successive iterations launched every 2 cycles due to:
    An undetermined cause.

Optimization Report Message for Speed-Limiting Constructs
The Altera Offline Compiler might make a trade-off between circuit speed (Fmax) and II value. When this trade-off happens, the optimization report issues a warning. Because the optimal II value is 1, the AOC might force a complex loop-carried dependency calculation to complete in one clock cycle. The resulting hardware will operate at a lower speed (Fmax) but still process one loop iteration each clock cycle. The final Fmax after a full compilation to hardware is available in the acl_quartus_report.txt file.

To minimize the impact on Fmax that is caused by a loop-carried dependency calculation, simplify the calculation.

In the following kernel code, the variable res is the loop-carried dependency. For the loop to have an II value of 1, res must increment nine times in one clock cycle.

```c
kernel void nd (global int *dst, int N) {
    int res = N;
    #pragma unroll 9
    for (int i = 0; i < N; i++) {
        res += i;
        res ^= i;
    }
    dst[0] = res;
}
```

The resulting optimization report identifies the critical path that is limiting Fmax.

===================================================================================

Kernel: nd
The kernel is compiled for single work-item execution.

Loop Report:
+ Loop "Block1" (file one_task_9.cl line 10)
  Loop was partially unrolled 9 times due to "#pragma unroll" annotation.
  Pipelined with successive iterations launched every cycle.

  The compiler had to sacrifice circuit frequency (fmax) to achieve initiation interval of one.
  Speed-limiting critical path is shown below. To increase fmax,
  simplify the code on the critical path to do fewer computations.

  Data dependency on variable
  Largest Critical Path Contributors:
  9%: Add Operation  (file one_task_9.cl line 11)
  9%: Add Operation  (file one_task_9.cl line 11)
Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback

In many cases, designing your OpenCL application as a single work-item kernel is sufficient to maximize performance without performing additional optimization steps. To further improve the performance of your single work-item kernel, you can optimize it by addressing dependencies that the optimization report identifies.

The following flowchart outlines the approach you can take to iterate on your design and optimize your single work-item kernel. For usage information on the Altera SDK for OpenCL Emulator and the Profiler, refer to the Emulating and Debugging Your OpenCL Kernel and Profiling Your OpenCL Kernel sections of the Altera SDK for OpenCL Programming Guide, respectively. For information on the Profiler GUI and profiling information, refer to the Profile Your Kernel to Identify Performance Bottlenecks section.

Altera recommends the following optimization options to address single work-item kernel loop-carried dependencies, in order of applicability: removal, relaxation, simplification, and transfer to local memory.
1. **Removing Loop-Carried Dependency** on page 1-68
   Based on the feedback from the optimization report, you can remove a loop-carried dependency by implementing a simpler memory access pattern.

2. **Relaxing Loop-Carried Dependency** on page 1-70
   Based on the feedback from the optimization report, you can relax a loop-carried dependency by increasing the dependence distance.

3. **Simplifying Loop-Carried Dependency** on page 1-72
   In cases where you cannot remove or relax the loop-carried dependency in your kernel, you might be able to simplify the dependency to improve single work-item kernel performance.
4. **Transferring Loop-Carried Dependency to Local Memory** on page 1-75
   For a loop-carried dependency that you cannot remove, improve the II by moving the array with the loop-carried dependency from global memory to local memory.

5. **Removing Loop-Carried Dependency by Inferring Shift Registers** on page 1-76
   To enable the Altera Offline Compiler to handle single work-item kernels that carry out double precision floating-point operations efficiently, remove loop-carried dependencies by inferring a shift register.

**Related Information**
- Emulating and Debugging Your OpenCL Kernel
- Profiling Your OpenCL Kernel
- Profile Your Kernel to Identify Performance Bottlenecks on page 1-45

**Removing Loop-Carried Dependency**
Based on the feedback from the optimization report, you can remove a loop-carried dependency by implementing a simpler memory access pattern.

Consider the following kernel:

```c
#define N 128

__kernel void unoptimized (__global int * restrict A,
                          __global int * restrict B,
                          __global int* restrict result)
{
    int sum = 0;
    for (unsigned i = 0; i < N; i++) {
        for (unsigned j = 0; j < N; j++) {
            sum += A[i*N+j];
        }
        sum += B[i];
    }
    * result = sum;
}
```

The optimization report for kernel `unoptimized` resembles the following:

```
Kernel: unoptimized
The kernel is compiled for single work-item execution.
Loop Report:
+ Loop "Block1" (file k.cl line 9)
  Pipelined with successive iterations launched every 2 cycles due to:
    Pipeline structure: every terminating loop with subloops has iterations launched at least 2 cycles apart.
    Having successive iterations launched every two cycles should still lead to good performance if the inner loop is pipelined well and has sufficiently high number of iterations.
  Iterations executed serially across the region listed below.
  Only a single loop iteration will execute inside the listed region.
  This will cause performance degradation unless the region is pipelined well
```
(can process an iteration every cycle).

Loop "Block2" (file k.cl line 10)
due to:
Data dependency on variable sum (file k.cl line 7)

-- Loop "Block2" (file k.cl line 10)
Pipelined well. Successive iterations are launched every cycle.

- The first row of the report indicates that the Altera Offline Compiler successfully infers pipelined execution for the outer loop, and a new loop iteration will launch every other cycle.
- The message **due to Pipeline structure** indicates that the AOC creates a pipeline structure that causes an outer loop iteration to launch every two cycles. The behavior is not a result of how you structure your kernel code.

**Note:** For recommendations on how to structure your single work-item kernel, refer to the *Good Design Practices for Single Work-Item Kernel* section.

- The remaining messages in the first row of report indicate that the loop executes a single iteration at a time across the subloop because of data dependency on the variable *sum*. This data dependency exists because each outer loop iteration requires the value of *sum* from the previous iteration to return before the inner loop can start executing.
- The second row of the report notifies you that the inner loop executes in a pipelined fashion with no performance-limiting loop-carried dependencies.

To optimize the performance of this kernel, remove the data dependency on variable *sum* so that the outer loop iterations do not execute serially across the subloop. Perform the following tasks to decouple the computations involving *sum* in the two loops:

1. Define a local variable (for example, *sum2*) for use in the inner loop only.
2. Use the local variable from Step 1 to store the cumulative values of *A[i*N + j]* as the inner loop iterates.
3. In the outer loop, store the variable *sum* to store the cumulative values of *B[i]* and the value stored in the local variable.

Below is the restructured kernel **optimized**:

```c
#define N 128

__kernel void optimized (__global int * restrict A,
                         __global int * restrict B,
                         __global int * restrict result)
{
    int sum = 0;

    for (unsigned i = 0; i < N; i++) {
        // Step 1: Definition
        int sum2 = 0;

        // Step 2: Accumulation of array A values for one outer loop iteration
        for (unsigned j = 0; j < N; j++) {
            sum2 += A[i*N+j];
        }

        // Step 3: Addition of array B value for an outer loop iteration
        sum += sum2;
        sum += B[i];
    }
```

An optimization report similar to the one below indicates the successful removal of the loop-carried dependency on the variable `sum`:

```
Kernel: optimized
The kernel is compiled for single work-item execution.
Loop Report:
+ Loop "Block1" (file optimized.cl line 9)
  Pipelined with successive iterations launched every 2 cycles due to:
  Pipeline structure: every terminating loop with subloops has iterations launched at least 2 cycles apart.
  Having successive iterations launched every two cycles should still lead to good performance if the inner loop is pipelined well and has sufficiently high number of iterations.

-> Loop "Block2" (file optimized.cl line 14)
  Pipelined well. Successive iterations are launched every cycle.
```

You have addressed all the loop-carried dependence issues successfully when you see only the following messages in the optimization report:

- Pipelined execution inferred for innermost loops.
- Pipelined execution inferred. Successive iterations launched every 2 cycles due to: Pipeline structure for all other loops.

Related Information

**Good Design Practices for Single Work-Item Kernel** on page 1-80

**Relaxing Loop-Carried Dependency**

Based on the feedback from the optimization report, you can relax a loop-carried dependency by increasing the dependence distance. Increase the dependence distance by increasing the number of loop iterations that occurs between the generation of a loop-carried value and its usage.

Consider the following code example:

```c
#define N 128
__kernel void unoptimized (__global float * restrict A,
                           __global float * restrict result)
{ 
  float mul = 1.0f;
  for (unsigned i = 0; i < N; i++)
    mul *= A[i];
  * result = mul;
}
```
The optimization report above shows that the Altera Offline Compiler infers pipelined execution for the loop successfully. However, the loop-carried dependency on the variable \( \text{mul} \) causes loop iterations to launch every six cycles. In this case, the floating-point multiplication operation on line 9 (that is, \( \text{mul} *= \text{A}[i] \)) contributes the largest delay to the computation of the variable \( \text{mul} \).

To relax the loop-carried data dependency, instead of using a single variable to store the multiplication results, operate on \( M \) copies of the variable and use one copy every \( M \) iterations:

1. Declare multiple copies of the variable \( \text{mul} \) (for example, in an array called \( \text{mul}_\text{copies} \)).
2. Initialize all the copies of \( \text{mul}_\text{copies} \).
3. Use the last copy in the array in the multiplication operation.
4. Perform a shift operation to pass the last value of the array back to the beginning of the shift register.
5. Reduce all the copies to \( \text{mul} \) and write the final value to \( \text{result} \).

Below is the restructured kernel:

```c
#define N 128
#define M 8

__kernel void optimized (__global float * restrict A,
                          __global float * restrict result)
{
    float mul = 1.0f;
    
    // Step 1: Declare multiple copies of variable mul
    float mul_copies[M];
    
    // Step 2: Initialize all copies
    for (unsigned i = 0; i < M; i++)
        mul_copies[i] = 1.0f;

    // Step 3: Perform multiplication on the last copy
    float cur = mul_copies[M-1] * A[i];

    // Step 4a: Shift copies
    #pragma unroll
    for (unsigned j = M-1; j > 0; j--)
        mul_copies[j] = mul_copies[j-1];

    // Step 4b: Insert updated copy at the beginning
    mul_copies[0] = cur;
```

Simplifying Loop-Carried Dependency

In cases where you cannot remove or relax the loop-carried dependency in your kernel, you might be able to simplify the dependency to improve single work-item kernel performance.

Consider the following kernel example:

```c
#define N 128
#define NUM_CH 3

channel uchar CH_DATA_IN[NUM_CH];
channel uchar CH_DATA_OUT;

__kernel void unoptimized()
{
  unsigned storage = 0;
  unsigned num_bytes = 0;
  for (unsigned i = 0; i < N; i++) {
    #pragma unroll
    for (unsigned j = 0; j < NUM_CH; j++) {
      if (num_bytes < NUM_CH) {
        bool valid = false;
        uchar data_in = read_channel_nb_altera(CH_DATA_IN[j], &valid);
        if (valid) {
          storage <<= 8;
          storage |= data_in;
      }
    }
  }
}
```

An optimization report similar to the one below indicates the successful relaxation of the loop-carried dependency on the variable `mul`:

===================================================================================
Kernel: optimized
===================================================================================
The kernel is compiled for single work-item execution.

Loop Report:
+ Fully unrolled loop (file optimized2.cl line 13)
  Loop was automatically and fully unrolled.
  Add "#pragma unroll 1" to prevent automatic unrolling.

+ Loop "Block1" (file optimized2.cl line 16)
  Pipelined well. Successive iterations are launched every cycle.

  + Fully unrolled loop (file optimized2.cl line 22)
    Loop was fully unrolled due to "#pragma unroll" annotation.

+ Fully unrolled loop (file optimized2.cl line 31)
  Loop was fully unrolled due to "#pragma unroll" annotation.
This kernel reads one byte of data from three input channels in a nonblocking fashion. It then writes the data one byte at a time to an output channel. It uses the variable `storage` to store up to 4 bytes of data, and uses the variable `num_bytes` to keep track of how many bytes are stored in `storage`. If `storage` has space available, then the kernel reads a byte of data from one of the channels and stores it in the least significant byte of `storage`.

The optimization report below indicates that there is a loop-carried dependency on the variable `num_bytes`:

```
num_bytes++;
if (num_bytes >= 1) {
    num_bytes -= 1;
    uchar data_out = storage >> (num_bytes*8);
    write_channel_altera(CH_DATA_OUT, data_out);
}
```

The computation path of `num_bytes` is as follows:
1. Comparison on line 16 (`if (num_bytes < NUM_CH)`).
2. Computation of variable `valid` by the nonblocking channel read operation on line 18 (`uchar data_in = read_channel_nb_altera(CH_DATA_IN[j], &valid)`) for the comparison on line 19.
3. Addition on line 22 (`num_bytes++`).
4. Comparison on line 27 (`if (num_bytes >= 1)`).
5. Subtraction on line 28 (`num_bytes -= 1`).

Because of the `unroll` pragma on line 14, the Altera Offline Compiler unrolls the loop, causing the comparisons and additions in the loop body to replicate three times. The optimization report shows that
the comparisons are the most expensive operations on the computation path of `num_bytes`, followed by the additions on line 22.

To simplify the loop-carried dependency on `num_bytes`, consider restructuring the application to perform the following tasks:

1. Ensure that the kernel reads from the channels only if there is enough space available in `storage`, in the event that all channel read operations return data (that is, there is at least 3 bytes of empty space in `storage`).
   Setting this condition simplifies the computation path of the variable `num_bytes` by reducing the number of comparisons.
2. Increase the size of `storage` from 4 bytes to 8 bytes to satisfy the 3-byte space threshold more easily.

Below is the restructured kernel `optimized`:

```c
#define N 128
#define NUM_CH 3

channel uchar CH_DATA_IN[NUM_CH];
channel uchar CH_DATA_OUT;

__kernel void optimized()
{
  // Change storage to 64 bits
  ulong storage = 0;
  unsigned num_bytes = 0;

  for (unsigned i = 0; i < N; i++) {
    // Ensure that we have enough space if we read from ALL channels
    if (num_bytes <= (8-NUM_CH)) {
      #pragma unroll
      for (unsigned j = 0; j < NUM_CH; j++) {
        bool valid = false;
        uchar data_in = read_channel_nb_altera(CH_DATA_IN[j], &valid);
        if (valid) {
          storage <<= 8;
          storage |= data_in;
          num_bytes++;
        }
      }
    }
    else {
      num_bytes -= 1;
      uchar data_out = storage >> (num_bytes*8);
      write_channel_altera(CH_DATA_OUT, data_out);
    }
  }
}
```

An optimization report similar to the one below indicates the successful simplification of the loop-carried dependency on the variable `num_bytes`:

```
==================================================================
Kernel: optimized
==================================================================
The kernel is compiled for single work-item execution.

Loop Report:
+ Loop "Block1" (file optimized3.cl line 13)
```
Pipelined well. Successive iterations are launched every cycle.

-+ Fully unrolled loop (file optimized3.cl line 18)
  Loop was fully unrolled due to "#pragma unroll" annotation.

Transferring Loop-Carried Dependency to Local Memory

For a loop-carried dependency that you cannot remove, improve the II by moving the array with the loop-carried dependency from global memory to local memory.

Consider the following kernel example:

```c
#define N 128

__kernel void unoptimized( __global int* restrict A )
{
    for (unsigned i = 0; i < N; i++)
        A[N-i] = A[i];
}
```

Global memory accesses have long latencies. In this example, the loop-carried dependency on the array `A[i]` causes the long latency. This latency is reflected by an II of 324 in the optimization report. To reduce the II value by transferring the loop-carried dependency from global memory to local memory, perform the following tasks:

1. Copy the array with the loop-carried dependency to local memory. In this example, array `A[i]` becomes array `B[i]` in local memory.
2. Execute the loop with the loop-carried dependence on array `B[i].`
3. Copy the array back to global memory.

When you transfer array `A[i]` to local memory and it becomes array `B[i]`, the loop-carried dependency is now on `B[i]`. Because local memory has a much lower latency than global memory, the II value improves.

Below is the restructured kernel optimized:

```c
#define N 128

__kernel void optimized( __global int* restrict A )
{
    int B[N];

    for (unsigned i = 0; i < N; i++)
        B[i] = A[i];
}
```
for (unsigned i = 0; i < N; i++)
    B[N-i] = B[i];

for (unsigned i = 0; i < N; i++)
    A[i] = B[i];
}

An optimization report similar to the one below indicates the successful reduction of II from 324 to 2:

===================================================================================
<table>
<thead>
<tr>
<th>Kernel: optimized</th>
</tr>
</thead>
<tbody>
<tr>
<td>The kernel is compiled for single work-item execution.</td>
</tr>
<tr>
<td>Loop Report:</td>
</tr>
<tr>
<td>+ Loop &quot;Block1&quot; (file optimized4.cl line 7)</td>
</tr>
<tr>
<td>Pipelined well. Successive iterations are launched every cycle.</td>
</tr>
<tr>
<td>+ Loop &quot;Block2&quot; (file optimized4.cl line 10)</td>
</tr>
<tr>
<td>Pipelined with successive iterations launched every 2 cycles due to:</td>
</tr>
<tr>
<td>Memory dependency on Load Operation from: (file optimized4.cl line 11)</td>
</tr>
<tr>
<td>Store Operation (file optimized4.cl line 11)</td>
</tr>
<tr>
<td>Largest Critical Path Contributors:</td>
</tr>
<tr>
<td>65%: Load Operation (file optimized4.cl line 11)</td>
</tr>
<tr>
<td>34%: Store Operation (file optimized4.cl line 11)</td>
</tr>
<tr>
<td>+ Loop &quot;Block3&quot; (file optimized4.cl line 13)</td>
</tr>
<tr>
<td>Pipelined well. Successive iterations are launched every cycle.</td>
</tr>
</tbody>
</table>

Removing Loop-Carried Dependency by Inferring Shift Registers

To enable the Altera Offline Compiler to handle single work-item kernels that carry out double precision floating-point operations efficiently, remove loop-carried dependencies by inferring a shift register.

Consider the following kernel:

```c
__kernel void double_add_1 (__global double *arr, 
                          int N, 
                          __global double *result)
{
    double temp_sum = 0;
    for (int i = 0; i < N; ++i)
    {
        temp_sum += arr[i];
    }
    *result = temp_sum;
}
```

The optimization report for kernel unoptimized resembles the following:

===================================================================================
<table>
<thead>
<tr>
<th>Kernel: double_add_1</th>
</tr>
</thead>
<tbody>
<tr>
<td>The kernel is compiled for single work-item execution.</td>
</tr>
<tr>
<td>Loop Report:</td>
</tr>
</tbody>
</table>
The kernel unoptimized is an accumulator that sums the elements of a double precision floating-point array \( arr[i] \). For each loop iteration, the AOC takes 11 cycles to compute the result of the addition and then stores it in the variable \( temp_sum \). Each loop iteration requires the value of \( temp_sum \) from the previous loop iteration, which creates a data dependency on \( temp_sum \).

- To remove the data dependency, infer the array \( arr[i] \) as a shift register.

Below is the restructured kernel optimized:

```c
1 //Shift register size must be statically determinable
2 #define II_CYCLES 12
3
4 __kernel void double_add_2 (__global double *arr,
5     int N,
6     __global double *result)
7 {
8     //Create shift register with II_CYCLE+1 elements
9     double shift_reg[II_CYCLES+1];
10
11     //Initialize all elements of the register to 0
12     for (int i = 0; i < II_CYCLES + 1; i++)
13     {
14         shift_reg[i] = 0;
15     }
16
17     //Iterate through every element of input array
18     for(int i = 0; i < N; ++i)
19     {
20         //Load ith element into end of shift register
21         //if N > II_CYCLE, add to shift_reg[0] to preserve values
22         shift_reg[II_CYCLES] = shift_reg[0] + arr[i];
23 
24 #pragma unroll
25         //Shift every element of shift register
26         for(int j = 0; j < II_CYCLES; ++j)
27         {
28             shift_reg[j] = shift_reg[j + 1];
29         }
30     }
31
32     //Sum every element of shift register
33     double temp_sum = 0;
34 
35 #pragma unroll
36     for(int i = 0; i < II_CYCLES; ++i)
37     {
38         temp_sum += shift_reg[i];
39     }
40
41     *result = temp_sum;
42 }
```

The following optimization report indicates that the inference of the shift register \( shift_reg[II_CYCLES] \) successfully removes the data dependency on the variable \( temp_sum \):
Kernel: double_add_2
==================================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

+ Fully unrolled loop (file optimized5.cl line 12)
  Loop was automatically and fully unrolled.
  Add "#pragma unroll 1" to prevent automatic unrolling.

+ Loop "Block1" (file optimized5.cl line 18)
  Pipelined well. Successive iterations are launched every cycle.
  ...
  -- Fully unrolled loop (file optimized5.cl line 26)
  Loop was fully unrolled due to "#pragma unroll" annotation.

+ Fully unrolled loop (file optimized5.cl line 36)
  Loop was fully unrolled due to "#pragma unroll" annotation.

Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays

Include the ivdep pragma in your single work-item kernel to assert that accesses to memory arrays will not cause loop-carried dependencies.

During compilation, the Altera Offline Compiler creates hardware that ensures load and store instructions operate within dependency constraints. An example of a dependency constraint is that dependent load and store instructions must execute in order. The presence of the ivdep pragma instructs the AOC to remove this extra hardware between load and store instructions in the loop that immediately
follows the pragma declaration in the kernel code. Removing the extra hardware might reduce logic utilization and lower the II value in single work-item kernels.

- If all accesses to memory arrays that are inside a loop will not cause loop-carried dependencies, add the line `#pragma ivdep` before the loop in your kernel code.

Example kernel code:

```c
// no loop-carried dependencies for A and B array accesses
#pragma ivdep
for(int i = 0; i < N; i++) {
    A[i] = A[i - X[i]];
    B[i] = B[i - Y[i]];
}
```

- To specify that accesses to a particular memory array inside a loop will not cause loop-carried dependencies, add the line `#pragma ivdep array (array_name)` before the loop in your kernel code.

The array specified by the `ivdep` pragma must be a local or private memory array, or a pointer variable that points to a global, local, or private memory storage. If the specified array is a pointer, the `ivdep` pragma also applies to all arrays that may alias with specified pointer.

The array specified by the `ivdep` pragma can also be an array or a pointer member of a struct.

Example kernel code:

```c
// No loop-carried dependencies for A array accesses
// The AOC will insert hardware that reinforces dependency constraints for B
#pragma ivdep array(A)
for(int i = 0; i < N; i++) {
    A[i] = A[i - X[i]];
    B[i] = B[i - Y[i]];
}

// No loop-carried dependencies for array A inside struct
#pragma ivdep array(S.A)
for(int i = 0; i < N; i++) {
}

// No loop-carried dependencies for array A inside the struct pointed by S
#pragma ivdep array(S->X[2][3].A)
for(int i = 0; i < N; i++) {
}

// No loop-carried dependencies for A and B because ptr aliases
// with both arrays
int *ptr = select ? A : B;
#pragma ivdep array(ptr)
for(int i = 0; i < N; i++) {
    A[i] = A[i - X[i]];
    B[i] = B[i - Y[i]];
}

// No loop-carried dependencies for A because ptr only aliases with A
int *ptr = &A[10];
#pragma ivdep array(ptr)
for(int i = 0; i < N; i++) {
    A[i] = A[i - X[i]];
    B[i] = B[i - Y[i]];
}
```
Good Design Practices for Single Work-Item Kernel

If your OpenCL kernels contain loop structures, follow the Altera-recommended guidelines to construct the kernels in a way that allows the Altera Offline Compiler to analyze them effectively. Well-structured loops are particularly important when you direct the AOC to perform pipeline parallelism execution in loops.

Avoid Pointer Aliasing

Insert the `restrict` keyword in pointer arguments whenever possible. Including the `restrict` keyword in pointer arguments prevents the AOC from creating unnecessary memory dependencies between non-conflicting read and write operations. Consider a loop where each iteration reads data from one array, and then it writes data to another array in the same physical memory. Without including the `restrict` keyword in these pointer arguments, the AOC might assume dependence between the two arrays, and extracts less pipeline parallelism as a result.

Construct "Well-Formed" Loops

A "well-formed" loop has an exit condition that compares against an integer bound, and has a simple induction increment of one per iteration. Including "well-formed" loops in your kernel improves performance because the AOC can analyze these loops efficiently.

The following example is a "well-formed" loop:

```c
for(i=0; i < N; i++)
{
    //statements
}
```

Important: "Well-formed" nested loops also contribute to maximizing kernel performance.

The following example is a "well-formed" nested loop structure:

```c
for(i=0; i < N; i++)
{
    //statements
    for(j=0; j < M; j++)
    {
        //statements
    }
}
```

Minimize Loop-Carried Dependencies

The loop structure below creates a loop-carried dependence because each loop iteration reads data written by the previous iteration. As a result, each read operation cannot proceed until the write operation from the previous iteration completes. The presence of loop-carried dependencies decreases the extent of pipeline parallelism that the AOC can achieve, which reduces kernel performance.

```c
for(int i = 0; i < N; i++)
{
}
```

The AOC performs a static memory dependence analysis on loops to determine the extent of parallelism that it can achieve. In some cases, the AOC might assume dependence between two array accesses, and extracts less pipeline parallelism as a result. The AOC assumes loop-carried dependence if it cannot
resolve the dependencies at compilation time because of unknown variables, or if the array accesses involve complex addressing.

To minimize loop-carried dependencies, following the guidelines below whenever possible:

- **Avoid pointer arithmetic.**
  
  Compiler output is suboptimal when the kernel accesses arrays by dereferencing pointer values derived from arithmetic operations. For example, avoid accessing an array in the following manner:

```c
for(int i = 0; i < N; i++)
{
    int t = *(A++);
    *A = t;
}
```

- **Introduce simple array indexes.**
  
  Avoid the following types of complex array indexes because the AOC cannot analyze them effectively, which might lead to suboptimal compiler output:
  
  - Nonconstants in array indexes.
    
    For example, `A[K + i]`, where `i` is the loop index variable and `K` is an unknown variable.
  
  - Multiple index variables in the same subscript location.
    
    For example, `A[i + 2 * j]`, where `i` and `j` are loop index variables for a double nested loop.

  
  **Note:** The AOC can analyze the array index `A[i][j]` effectively because the index variables are in different subscripts.

  - Nonlinear indexing.
    
    For example, `A[i & C]`, where `i` is a loop index variable and `C` is a constant or a nonconstant variable.

  - **Use loops with constant bounds in your kernel whenever possible.**

    Loops with constant bounds allow the AOC to perform range analysis effectively.

**Avoid Complex Loop Exit Conditions**

The AOC evaluates exit conditions to determine if subsequent loop iterations can enter the loop pipeline. There are times when the AOC requires memory accesses or complex operations to evaluate the exit condition. In these cases, subsequent iterations cannot launch until the evaluation completes, decreasing overall loop performance.

**Convert Nested Loops into a Single Loop**

To maximize performance, combine nested loops into a single form whenever possible. Restructuring nested loops into a single loop reduces hardware footprint and computational overhead between loop iterations.
The following code examples illustrate the conversion of a nested loop into a single loop:

<table>
<thead>
<tr>
<th>Nested Loop</th>
<th>Converted Single Loop</th>
</tr>
</thead>
<tbody>
<tr>
<td>for (i = 0; i &lt; N; i++)</td>
<td></td>
</tr>
<tr>
<td>{ //statements</td>
<td></td>
</tr>
<tr>
<td>for (j = 0; j &lt; M; j++)</td>
<td></td>
</tr>
<tr>
<td>{ //statements</td>
<td></td>
</tr>
<tr>
<td>} //statements</td>
<td></td>
</tr>
<tr>
<td>} //statements</td>
<td></td>
</tr>
<tr>
<td>for (i = 0; i &lt; N*M; i++)</td>
<td></td>
</tr>
<tr>
<td>{ //statements</td>
<td></td>
</tr>
<tr>
<td>}</td>
<td></td>
</tr>
</tbody>
</table>

### Declare Variables in the Deepest Scope Possible

To reduce the hardware resources necessary for implementing a variable, declare the variable prior to its use in a loop. Declaring variables in the deepest scope possible minimizes data dependencies and hardware usage because the AOC does not need to preserve the variable data across loops that do not use the variables.

Consider the following example:

```c
int a[N];
for (int i = 0; i < m; ++i) {
    int b[N];
    for (int j = 0; j < n; ++j) {
        // statements
    }
}
```

The array `a` requires more resources to implement than the array `b`. To reduce hardware usage, declare array `a` outside the inner loop unless it is necessary to maintain the data through iterations of the outer loop.

**Tip:** Overwriting all values of a variable in the deepest scope possible also reduces the resources necessary to present the variable.

### Strategies for Improving NDRange Kernel Data Processing Efficiency

Consider the following kernel code:

```c
__kernel void sum (__global const float * restrict a,
                   __global const float * restrict b,
                   __global float * restrict answer)
{
    size_t gid = get_global_id(0);
    answer[gid] = a[gid] + b[gid];
}
```

This kernel adds arrays `a` and `b`, one element at a time. Each work-item is responsible for adding two elements, one from each array, and storing the sum into the array `answer`. Without optimization, the kernel performs one addition per work-item.
To maximize the performance of your OpenCL kernel, consider implementing the applicable optimization techniques to improve data processing efficiency.

1. **Specify a Maximum Work-Group Size or a Required Work-Group Size** on page 1-83
   Specify the `max_work_group_size` or `reqd_work_group_size` attribute for your kernels whenever possible. These attributes allow the Altera Offline Compiler to perform aggressive optimizations to match the kernel to hardware resources without any excess logic.

2. **Kernel Vectorization** on page 1-85
   Kernel vectorization allows multiple work-items to execute in a single instruction multiple data (SIMD) fashion.

3. **Multiple Compute Units** on page 1-87
   To achieve higher throughput, the Altera Offline Compiler can generate multiple compute units for each kernel.

4. **Combination of Compute Unit Replication and Kernel SIMD Vectorization** on page 1-89
   If your replicated or vectorized OpenCL kernel does not fit in the FPGA, you can modify the kernel by both replicating the compute unit and vectorizing the kernel.

5. **Resource-Driven Optimization** on page 1-91
   The Altera Offline Compiler analyzes automatically the effects of combining various values of kernel attributes and performs resource-driven optimizations.

   When you compile an NDRange kernel, the Altera Offline Compiler generates an optimization report that provides information on select kernel properties and loop unroll status.

### Specify a Maximum Work-Group Size or a Required Work-Group Size

Specify the `max_work_group_size` or `reqd_work_group_size` attribute for your kernels whenever possible. These attributes allow the Altera Offline Compiler to perform aggressive optimizations to match the kernel to hardware resources without any excess logic.

The AOC assumes a default work-group size for your kernel depending on certain constraints imposed during compilation time and runtime.

The AOC imposes the following constraints at compilation time:

- If you specify a value for the `reqd_work_group_size` attribute, the work-group size must match this value.
- If you specify a value for the `max_work_group_size` attribute, the work-group size must not exceed this value.
- If you do not specify values for `reqd_work_group_size` and `max_work_group_size`, and the kernel contains a barrier, the AOC defaults to a maximum work-group size of 256 work-items.
- If you do not specify values for both attributes and the kernel does not contain any barrier, the AOC does not impose any constraint on the work-group size at compilation time.

**Tip:** Use the `CL_KERNEL_WORK_GROUP_SIZE` and `CL_KERNEL_COMPILE_WORK_GROUP_SIZE` queries to the `clGetKernelWorkGroupInfo` API call to determine the work-group size constraints that the AOC imposes on a particular kernel at compilation time.
The OpenCL standard imposes the following constraints at runtime:

- The work-group size in each dimension must divide evenly into the requested NDRange size in each dimension.
- The work-group size must not exceed the device constraints specified by the 
  `CL_DEVICE_MAX_WORK_GROUP_SIZE` and `CL_DEVICE_MAX_WORK_ITEM_SIZES` queries to the `clGetDeviceInfo` API call.

Caution: If the work-group size you specify for a requested NDRange kernel execution does not satisfy all of the constraints listed above, the `clEnqueueNDRangeKernel` API call fails with the error `CL_INVALID_WORK_GROUP_SIZE`.

If you do not specify values for both the `reqd_work_group_size` and `max_work_group_size` attributes, the runtime determines a default work-group size as follows:

- If the kernel contains a barrier or refers to the local work-item ID, or if you use the `clGetKernelWorkGroupInfo` and `clGetDeviceInfo` API calls in your host code to query the work-group size, the runtime defaults the work-group size to one work-item.
- If the kernel does not contain a barrier or refer to the local work-item ID, or if your host code does not query the work-group size, the default work-group size is the global NDRange size.

When queuing an NDRange kernel (that is, not a single work-item kernel), specify an explicit work-group size under the following conditions:

- If your kernel uses memory barriers, local memory, or local work-item IDs.
- If your host program queries the work-group size.

If your kernel uses memory barriers, perform one of the following tasks to minimize hardware resources:

- Specify a value for the `reqd_work_group_size` attribute.
- Assign to the `max_work_group_size` attribute the smallest work-group size that accommodates all your runtime work-group size requests.

Caution: Including a memory barrier at the end of your NDRange kernel causes compilation to fail.

Specifying a smaller work-group size than the default at runtime might lead to excessive hardware consumption. Therefore, if you require a work-group size other than the default, specify the `max_work_group_size` attribute to set a maximum work-group size. If the work-group size remains constant through all kernel invocations, specify a required work-group size by including the `reqd_work_group_size` attribute. The `reqd_work_group_size` attribute instructs the AOC to allocate exactly the correct amount of hardware to manage the number of work-items per work-group you specify. This allocation results in hardware resource savings and improved efficiency in the implementation of kernel compute units. By specifying the `reqd_work_group_size` attribute, you also prevent the AOC from implementing additional hardware to support work-groups of unknown sizes.

For example, the code fragment below assigns a fixed work-group size of 64 work-items to a kernel:

```c
__attribute__((reqd_work_group_size(64,1,1)))
__kernel void sum (__global const float * restrict a,
                   __global const float * restrict b,
                   __global float * restrict answer)
{
    size_t gid = get_global_id(0);
    answer[gid] = a[gid] + b[gid];
}
```
Kernel Vectorization

To achieve higher throughput, you can vectorize your kernel. Kernel vectorization allows multiple work-items to execute in a single instruction multiple data (SIMD) fashion. You can direct the Altera Offline Compiler to translate each scalar operation in the kernel, such as addition or multiplication, to an SIMD operation.

Include the `num_simd_work_items` attribute in your kernel code to direct the AOC to perform more additions per work-item without modifying the body of the kernel. The following code fragment applies a vectorization factor of four to the original kernel code:

```c
__attribute__((num_simd_work_items(4)))
__attribute__((reqd_work_group_size(64,1,1)))
__kernel void sum (__global const float * restrict a,
                  __global const float * restrict b,
                  __global float * restrict answer)
{
    size_t gid = get_global_id(0);
    answer[gid] = a[gid] + b[gid];
}
```

To use the `num_simd_work_items` attribute, you must also specify a required work-group size of the kernel using the `reqd_work_group_size` attribute. The work-group size you specify for `reqd_work_group_size` must be divisible by the value you assign to `num_simd_work_items`. In the code example above, the kernel has a fixed work-group size of 64 work-items. Within each work-group, the work-items are distributed evenly among the four SIMD vector lanes. After the AOC implements the four SIMD vector lanes, each work-item now performs four times more work.

The AOC vectorizes the code and might coalesce memory accesses. You do not need to change any kernel code or host code because the AOC applies these optimizations automatically.

You can vectorize your kernel code manually, but you must adjust the NDRange in your host application to reflect the amount of vectorization you implement. The following example shows the changes in the code when you duplicate operations in the kernel manually:

```c
__kernel void sum (__global const float * restrict a,
                  __global const float * restrict b,
                  __global float * restrict answer)
{
    size_t gid = get_global_id(0);

    answer[gid * 4 + 0] = a[gid * 4 + 0] + b[gid * 4 + 0];
    answer[gid * 4 + 1] = a[gid * 4 + 1] + b[gid * 4 + 1];
    answer[gid * 4 + 2] = a[gid * 4 + 2] + b[gid * 4 + 2];
    answer[gid * 4 + 3] = a[gid * 4 + 3] + b[gid * 4 + 3];
}
```

In this form, the kernel loads four elements from arrays `a` and `b`, calculates the sums, and stores the results into the array `answer`. Because the FPGA pipeline loads and stores data to neighboring locations in memory, you can manually direct the AOC to coalesce each group of four load and store operations.

**Attention:** Each work-item handles four times as much work after you implement the manual optimizations. As a result, the host application must use an NDRange that is four times smaller than in the original example. On the contrary, you do not need to adjust the NDRange size when you exploit the automatic vectorization capabilities of the AOC. You can adjust the vector width with minimal code changes by using the `num_simd_work_items` attribute.
Static Memory Coalescing

Static memory coalescing is an Altera Offline Compiler optimization step that attempts to reduce the number of times a kernel accesses non-private memory.

The figure below shows a common case where kernel performance might benefit from static memory coalescing:

Figure 1-37: Static Memory Coalescing

Consider the following vectorized kernel:

```c
__attribute__((num_simd_work_items(4)))
__attribute__((reqd_work_group_size(64,1,1)))
__kernel void sum (__global const float * restrict a,
    __global const float * restrict b,
    __global float * restrict answer)
{
    size_t gid = get_global_id(0);
    answer[gid] = a[gid] + b[gid];
}
```

The OpenCL kernel performs four load operations that access consecutive locations in memory. Instead of performing four memory accesses to competing locations, the AOC coalesces the four loads into a single wider vector load. This optimization reduces the number of accesses to a memory system and potentially leads to better memory access patterns.
Although the AOC performs static memory coalescing automatically when it vectorizes the kernel, you should use wide vector loads and stores in your OpenCL code whenever possible to ensure efficient memory accesses. To implement static memory coalescing manually, you must write your code in such a way that a sequential access pattern can be identified at compilation time. The original kernel code shown in the figure above can benefit from static memory coalescing because all the indexes into buffers \( a \) and \( b \) increment with offsets that are known at compilation time. In contrast, the following code does not allow static memory coalescing to occur:

```c
__kernel void test (__global float * restrict a,
    __global float * restrict b,
    __global float * restrict answer,
    __global int * restrict offsets)
{
    size_t gid = get_global_id(0);

    answer[gid*4 + 0] = a[gid*4 + 0 + offsets[gid]] + b[gid*4 + 0];
    answer[gid*4 + 1] = a[gid*4 + 1 + offsets[gid]] + b[gid*4 + 1];
    answer[gid*4 + 2] = a[gid*4 + 2 + offsets[gid]] + b[gid*4 + 2];
    answer[gid*4 + 3] = a[gid*4 + 3 + offsets[gid]] + b[gid*4 + 3];
}
```

The value \( offsets[gid] \) is unknown at compilation time. As a result, the AOC cannot statically coalesce the read accesses to buffer \( a \).

**Multiple Compute Units**

To achieve higher throughput, the Altera Offline Compiler can generate multiple compute units for each kernel. The AOC implements each compute unit as a unique pipeline. Generally, each kernel compute unit can execute multiple work-groups simultaneously.

To increase overall kernel throughput, the hardware scheduler in the FPGA dispatches work-groups to additional available compute units. A compute unit is available for work-group assignments as long as it has not reached its full capacity.

Assume each work-group takes the same amount of time to complete its execution. If the AOC implements two compute units, each compute unit executes half of the work-groups. Because the hardware scheduler dispatches the work-groups, you do not need to manage this process in your own code.

The AOC does not automatically determine the optimal number of compute units for a kernel. To increase the number of compute units for your kernel implementation, you must specify the number of compute units that the AOC should create using the `num_compute_units` attribute, as shown in the code sample below.

```c
__attribute__((num_compute_units(2)))
__kernel void sum (__global const float * restrict a,
    __global const float * restrict b,
    __global float * restrict answer)
{
    size_t gid = get_global_id(0);

    answer[gid] = a[gid] + b[gid];
}
```

Increasing the number of compute units achieves higher throughput. However, as shown in the figure below, you do so at the expense of increasing global memory bandwidth among the compute units. You also increase hardware resource utilization.
Compute Unit Replication versus Kernel SIMD Vectorization

In most cases, you should implement the `num_simd_work_items` attribute to increase data processing efficiency before using the `num_compute_units` attribute.

Both the `num_compute_units` and `num_simd_work_items` attributes increase throughput by increasing the amount of hardware that the Altera Offline Compiler uses to implement your kernel. The `num_compute_units` attribute modifies the number of compute units to which work-groups can be scheduled, which also modifies the number of times a kernel accesses global memory. In contrast, the `num_simd_work_items` attribute modifies the amount of work a compute unit can perform in parallel on a single work-group. The `num_simd_work_items` attribute duplicates only the datapath of the compute unit by sharing the control logic across each SIMD vector lane.

Generally, using the `num_simd_work_items` attribute leads to more efficient hardware than using the `num_compute_units` attribute to achieve the same goal. The `num_simd_work_items` attribute also allows the AOC to coalesce your memory accesses.
Multiple compute units competing for global memory might lead to undesired memory access patterns. You can alter the undesired memory access pattern by introducing the `num_simd_work_items` attribute instead of the `num_compute_units` attribute. In addition, the `num_simd_work_items` attribute potentially offers the same computational throughput as the equivalent kernel compute unit duplication that the `num_compute_units` attribute offers.

You cannot implement the `num_simd_work_items` attribute in your kernel under the following circumstances:

- The value you specify for `num_simd_work_items` is not 2, 4, 8 or 16.
- The value of `reqd_work_group_size` is not divisible by `num_simd_work_items`.

For example, the following declaration is incorrect because 50 is not divisible by 4:

    __attribute__((num_simd_work_items(4)))
    __attribute__((reqd_work_group_size(50,0,0)))

- Kernels with complex control flows. You cannot vectorize kernels in which different work-items follow different control paths (for example, the control paths depend on `get_global_ID` or `get_local_ID`).

During kernel compilation, the AOC issues messages informing you whether the implementation of vectorization optimizations is successful. Kernel vectorization is successful if the reported vectorization factor matches the value you specify for the `num_simd_work_items` attribute.

**Combination of Compute Unit Replication and Kernel SIMD Vectorization**

If your replicated or vectorized OpenCL kernel does not fit in the FPGA, you can modify the kernel by both replicating the compute unit and vectorizing the kernel. Include the `num_compute_units` attribute to
modify the number of compute units for the kernel, and include the num_simd_work_items attribute to take advantage of kernel vectorization.

Consider a case where a kernel with a num_simd_work_items attribute set to 16 does not fit in the FPGA. The kernel might fit if you modify it by duplicating a narrower SIMD kernel compute unit. Determining the optimal balance between the number of compute units and the SIMD width might require some experimentation. For example, duplicating a four lane-wide SIMD kernel compute unit three times might achieve better throughput than duplicating an eight lane-wide SIMD kernel compute unit twice.

The following example code shows how you can combine the num_compute_units and num_simd_work_items attributes in your OpenCL code:

```c
__attribute__((num_simd_work_items(4)))
__attribute__((num_compute_units(3)))
__attribute__((reqd_work_group_size(8,8,1)))
__kernel void matrixMult(__global float * restrict C, __global float * restrict A, ...
```

The figure below illustrates the data flow of the kernel described above. The num_compute_units implements three replicated compute units. The num_simd_work_items implements four SIMD vector lanes.

**Figure 1-40: Optimizing Throughput by Combining Compute Unit Replication and Kernel SIMD Vectorization**

**Attention:** You can also enable the resource-driven optimizer to determine automatically the best combination of num_compute_units and num_simd_work_items.

**Important:** It is more time-consuming to compile a hardware design that fills the entire FPGA than smaller designs. When you adjust your kernel optimizations, remove the increased number of SIMD vector lanes and compute units prior to recompiling the kernel.
Resource-Driven Optimization

The Altera Offline Compiler analyzes automatically the effects of combining various values of kernel attributes and performs resource-driven optimizations.

During compilation, the AOC examines multiple values of the `num_compute_units` and `num_simd_work_items` kernel attributes in various combinations, and applies a set of heuristics to improve a base design incrementally. The AOC implements this set of values to maximize kernel performance in terms of work-items executed per second.

Based on the result of its analysis, the AOC optimizes code blocks that work-items execute frequently. For these code blocks, the AOC uses additional hardware resources to achieve an implementation with higher throughput. For code blocks that work-items execute infrequently, the AOC attempts to reuse the same hardware to implement multiple operations.

The amount of hardware sharing that occurs is called the *sharing degree*. It is the number of times an operation is shared by work-items executing within the same compute unit. Code blocks that work-items execute infrequently might lead to a higher sharing degree.

The AOC does not modify values of kernel attributes or pragmas that you specify in kernel declarations. The AOC modifies only unspecified attributes and pragmas.

Optimization Behavior

The following are examples of resource-driven optimization:

- Attempts resource sharing of infrequently-executed code blocks only if the kernel does not fit the FPGA.

  After the AOC identifies an optimized kernel that fits within the FPGA, it applies optimizations that increase performance.

- In a multi-kernel design, improves the kernel(s) with minimum performance first.

  The order in which kernel optimization occurs is based on the work-items per second metric. When these kernels cannot be optimized any further, subsequent kernels are improved in order of their throughput estimates. During resource-driven optimization, the AOC maintains a set of high-performance candidates and attempts to apply incremental optimizations to each of them. Loop unrolling and SIMD vectorization are the preferred optimization strategies over compute unit replication because these optimizations generally result in more efficient hardware implementations.

- During resource-driven optimization, the AOC iterates on a predetermined set of optimization steps.

  In many cases, the AOC infers optimization ranges ahead of time. For example, it determines the maximum number of compute units based on the available memory bandwidth. Anytime the AOC fails to perform an optimization, it skips that step and attempts other optimizations.

Limitations

Static optimizations are subjected to some inherent limitations. The control flow analyses assume values of kernel arguments, passed from the host, that are unknown at compilation time. For example, the AOC assumes that loops with unknown bounds iterate 1024 times. Based on these assumptions, the AOC might guide the optimizations towards code blocks that work-items execute less often than estimated. In the case of loops with unknown bounds, you can override the amount of unrolling by specifying an unroll factor in the code using the `unroll` pragma. If you do not want to unroll a loop, you can specify an unroll factor of 1 to indicate no loop unrolling.
Another limiting factor is that all optimizations take place before hardware compilation occurs. The performance estimation might not accurately capture the maximum operating frequency that the hardware compiler achieves. Similarly, the estimated resource usage used in resource-driven optimization might not reflect the actual hardware resource usage.

There are also range limitations on the amount of sharing and vectorization. Currently, the maximum sharing degree is 8, and the maximum number of SIMD vector lanes is 16.

**Review Kernel Properties and Loop Unroll Status in the Optimization Report**

When you compile an NDRange kernel, the Altera Offline Compiler generates an optimization report that provides information on select kernel properties and loop unroll status.

**Related Information**

*Optimization Report* on page 1-56

**Strategies for Improving Memory Access Efficiency**

Memory access efficiency often dictates the overall performance of your OpenCL kernel. When developing your OpenCL code, it is advantageous to minimize the number of global memory accesses. The *OpenCL Specification version 1.0* describes four memory types: *global*, *constant*, *local*, and *private* memories.

An interconnect topology connects shared global, constant, and local memory systems to their underlying memory.

Memory accesses compete for shared memory resources (that is, global, local, and constant memories). If your OpenCL kernel performs a large number of memory accesses, the Altera Offline Compiler must generate complex arbitration logic to handle the memory access requests. The complex arbitration logic might cause a drop in the maximum operating frequency (Fmax), which degrades kernel performance.

The following sections discuss memory access optimizations in detail. In summary, minimizing global memory accesses is beneficial for the following reasons:

- Typically, increases in OpenCL kernel performance lead to increases in global memory bandwidth requirements.
- The maximum global memory bandwidth is much smaller than the maximum local memory bandwidth.
- The maximum computational bandwidth of the FPGA is much larger than the global memory bandwidth.

**Attention:** Use local, private or constant memory whenever possible to increase the memory bandwidth of the kernel.

1. **General Guidelines on Optimizing Memory Accesses** on page 1-93
   Optimizing the memory accesses in your OpenCL kernels can improve overall kernel performance.

2. **Optimize Global Memory Accesses** on page 1-93
   The AOC interleaves global memory across each of the external memory banks.

3. **Perform Kernel Computations Using Constant, Local or Private Memory** on page 1-96
   To optimize memory access efficiency, minimize the number for global memory accesses by performing your OpenCL kernel computations in constant, local, or private memory.
4. **Improve Kernel Performance by Banking the Local Memory** on page 1-99
   Specifying the `numbanks(N)` and `bankwidth(M)` advanced kernel attributes allows you to configure the local memory banks for parallel memory accesses.

5. **Optimize Accesses to Local Memory by Controlling the Memory Replication Factor** on page 1-102
   To control the memory replication factor, include the `singlepump` or `doublepump` kernel attribute in your OpenCL kernel.

**General Guidelines on Optimizing Memory Accesses**

Optimizing the memory accesses in your OpenCL kernels can improve overall kernel performance.

Consider implementing the following techniques for optimizing memory accesses, whenever possible:

- If your OpenCL program has a pair of kernels—one produces data and the other one consumes that data—convert them into a single kernel that performs both functions. Also, implement helper functions to logically separate the functions of the two original kernels. FPGA implementations favor one large kernel over separate smaller kernels. Kernel unification removes the need to write the results from one kernel into global memory temporarily before fetching the same data in the other kernel.

- The Altera Offline Compiler implements local memory in FPGAs very differently than in GPUs. If your OpenCL kernel contains code to avoid GPU-specific local memory bank conflicts, remove that code because the AOC generates hardware that avoids local memory bank conflicts automatically whenever possible.

**Optimize Global Memory Accesses**

The Altera Offline Compiler uses SDRAM as global memory. By default, the AOC configures global memory in a burst-interleaved configuration. The AOC interleaves global memory across each of the external memory banks.

In most circumstances, the default burst-interleaved configuration leads to the best load balancing between the memory banks. However, in some cases, you might want to partition the banks manually as two non-interleaved (and contiguous) memory regions to achieve better load balancing.

The figure below illustrates the differences in memory mapping patterns between burst-interleaved and non-interleaved memory partitions.
Contiguous Memory Accesses

Contiguous memory access optimizations analyze statically the access patterns of global load and store operations in a kernel. For sequential load or store operations that occur for the entire kernel invocation, the Altera Offline Compiler directs the kernel to access consecutive locations in global memory.

Consider the following code example:

```c
__kernel void sum ( __global const float * restrict a,
                   __global const float * restrict b,
                   __global float * restrict c )
{
    size_t gid = get_global_id(0);
    c[gid] = a[gid] + b[gid];
}
```

The load operation from array `a` uses an index that is a direct function of the work-item global ID. By basing the array index on the work-item global ID, the AOC can direct contiguous load operations. These load operations retrieve the data sequentially from the input array, and sends the read data to the pipeline as required. Contiguous store operations then store elements of the result that exits the computation pipeline in sequential locations within global memory.
Tip: Use the `const` qualifier for any read-only global buffer so that the AOC can perform more aggressive optimizations on the load operation.

The following figure illustrates an example of the contiguous memory access optimization:

**Figure 1-42: Contiguous Memory Access**

Contiguous load and store operations improve memory access efficiency because they lead to increased access speeds and reduced hardware resource needs. The data travels in and out of the computational portion of the pipeline concurrently, allowing overlaps between computation and memory accesses. If possible, use work-item IDs that index consecutive memory locations for load and store operations that access global memory. Sequential accesses to global memory increase memory efficiency because they provide an ideal access pattern.

**Manual Partitioning of Global Memory**

You can partition the memory manually so that each buffer occupies a different memory bank.
The default burst-interleaved configuration of the global memory prevents load imbalance by ensuring that memory accesses do not favor one external memory bank over another. However, you have the option to control the memory bandwidth across a group of buffers by partitioning your data manually.

- The Altera Offline Compiler cannot burst-interleave across different memory types. To manually partition a specific type of global memory, compile your OpenCL kernels with the `--no-interleaving <global_memory_type>` flag to configure each bank of a certain memory type as non-interleaved banks.

If your kernel accesses two buffers of equal size in memory, you can distribute your data to both memory banks simultaneously regardless of dynamic scheduling between the loads. This optimization step might increase your apparent memory bandwidth.

If your kernel accesses heterogeneous global memory types, include the `--no-interleaving <global_memory_type>` option in the `aoc` command for each memory type that you want to partition manually.

For more information on the usage of the `--no-interleaving <global_memory_type>` option, refer to the Disabling Burst-Interleaving of Global Memory (`--no-interleaving <global_memory_type>`) section of the Altera SDK for OpenCL Programming Guide.

Related Information

Disabling Burst-Interleaving of Global Memory (`--no-interleaving <global_memory_type>`)
To minimize global memory accesses, you must first preload data from a group of computations from global memory to constant, local, or private memory. You perform the kernel computations on the preloaded data, and then write the results back to global memory.

**Constant Cache Memory**

Constant memory resides in global memory, but the kernel loads it into an on-chip cache shared by all work-groups at runtime. For example, if you have read-only data that all work-groups use, and the data size of the constant buffer fits into the constant cache, allocate the data to the constant memory. The constant cache is most appropriate for high-bandwidth table lookups that are constant across several invocations of a kernel. The constant cache is optimized for high cache hit performance.

By default, the constant cache size is 16 kB. You can specify the constant cache size by including the `--const-cache-bytes <N>` option in your `aoc` command, where `<N>` is the constant cache size in bytes.

Unlike global memory accesses that have extra hardware for tolerating long memory latencies, the constant cache suffers large performance penalties for cache misses. If the `__constant` arguments in your OpenCL kernel code cannot fit in the cache, you might achieve better performance with `__global const` arguments instead. If the host application writes to constant memory that is already loaded into the constant cache, the cached data is discarded (that is, invalidated) from the constant cache.

For more information on the `--const-cache-bytes <N>` option, refer to the Configuring Constant Memory Cache Size (`--const-cache-bytes <N>`) section of the Altera SDK for OpenCL Programming Guide.

**Related Information**

*Configuring Constant Memory Cache Size (`--const-cache-bytes <N>`)*

**Preloading Data to Local Memory**

Local memory is considerably smaller than global memory, but it has significantly higher throughput and much lower latency. Unlike global memory accesses, the kernel can access local memory randomly without any performance penalty. When you structure your kernel code, attempt to access the global memory sequentially, and buffer that data in on-chip local memory before your kernel uses the data for calculation purposes.

The Altera Offline Compiler implements OpenCL local memory in on-chip memory blocks in the FPGA. On-chip memory blocks have two read and write ports, and they can be clocked at an operating frequency that is double the operating frequency of the OpenCL kernels. This doubling of the clock frequency allows the memory to be “double pumped,” resulting in twice the bandwidth from the same memory. As a result, each on-chip memory block supports up to four simultaneous accesses.

Ideally, the accesses to each bank are distributed uniformly across the on-chip memory blocks of the bank. Because only four simultaneous accesses to an on-chip memory block are possible in a single clock cycle, distributing the accesses helps avoid bank contention.

This banking configuration is usually effective; however, the AOC must create a complex memory system to accommodate a large number of banks. A large number of banks might complicate the arbitration network and can reduce the overall system performance.

Because the AOC implements local memory that resides in on-chip memory blocks in the FPGA, the AOC must choose the size of local memory systems at compilation time. The method the AOC uses to determine the size of a local memory system depends on the local data types used in your OpenCL code.
Optimizing Local Memory Accesses

To optimize local memory access efficiency, consider the following guidelines:

- Implementing certain optimizations techniques, such as loop unrolling, might lead to more concurrent memory accesses.
  
  **Caution:** Increasing the number of memory accesses can complicate the memory systems and degrade performance.

- Simplify the local memory subsystem by limiting the number of unique local memory accesses in your kernel to four or less, whenever possible.

  You achieve maximum local memory performance when there are four or less memory accesses to a local memory system. If the number of accesses to a particular memory system is greater than four, the AOC arranges the on-chip memory blocks of the memory system into a banked configuration.

- If you have function scope local data, the AOC statically sizes the local data that you define within a function body at compilation time. You should define local memories by directing the AOC to set the memory to the required size, rounded up to the closest value that is a power of two.

- For pointers to __local kernel arguments, the host assigns their memory sizes dynamically at runtime through clSetKernelArg calls. However, the AOC must set these physical memory sizes at compilation time.

  By default, pointers to __local kernel arguments are 16 kB in size. You can specify an allocation size by including the local_mem_size attribute in your pointer declaration.

  **Note:** clSetKernelArg calls can request a smaller data size than has been physically allocated at compilation time, but never a larger size.

- When accessing local memory, use the simplest address calculations possible and avoid pointer math operations that are not mandatory.

  Altera recommends this coding style to reduce FPGA resource utilization and increase local memory efficiency by allowing the AOC to make better guarantees about access patterns through static code analysis. Complex address calculations and pointer math operations can prevent the AOC from creating independent memory systems representing different portions of your data, leading to increased area usage and decreased runtime performance.

- Avoid storing pointers to memory whenever possible. Stored pointers often prevent static compiler analysis from determining the data sets accessed, when the pointers are subsequently retrieved from memory. Storing pointers to memory almost always leads to suboptimal area and performance results.

For usage information on the local_mem_size attribute, refer to the Specifying Pointer Size in Local Memory section of the Altera SDK for OpenCL Programming Guide.

Related Information

**Specifying Pointer Size in Local Memory**

Storing Variables and Arrays in Private Memory

The Altera Offline Compiler implements private memory using FPGA registers. Typically, private memory is useful for storing single variables or small arrays. Registers are plentiful hardware resources in FPGAs, and it is almost always better to use private memory instead of other memory types whenever possible. The kernel can access private memories in parallel, allowing them to provide more bandwidth than any other memory type (that is, global, local, and constant memories).
For more information on the implementation of private memory using registers, refer to the *Inferring a Register* section of the *Altera SDK for OpenCL Programming Guide*.

**Related Information**

*Inferring a Register*

**Improve Kernel Performance by Banking the Local Memory**

Specifying the `numbanks(M)` and `bankwidth(M)` advanced kernel attributes allows you to configure the local memory banks for parallel memory accesses. The banking geometry described by these advanced kernel attributes determines which elements of the local memory system your kernel can access in parallel.

The following code example depicts an 8 x 4 local memory system that is implemented in a single bank. As a result, no two elements in the system can be accessed in parallel.

```c
local int lmem[8][4];
#pragma unroll
for(int i = 0; i<4; i+=2)
{
    lmem[i][x] = ...;
}
```

*Figure 1-43: Serial Accesses to an 8 x 4 Local Memory System*

To improve performance, you can add `numbanks(N)` and `bankwidth(M)` in your code to define the number of memory banks and the bank widths in bytes. The following code implements eight memory banks, each 16-bytes wide. This memory bank configuration enables parallel memory accesses down the 8 x 4 array.

```c
local int __attribute__((numbanks(8),
    bankwidth(16)))
```

Altera SDK for OpenCL Best Practices Guide

Altera Corporation

Send Feedback
By specifying different values for the `numbanks(N)` and `bankwidth(M)` kernel attributes, you can change the parallel access pattern. The following code implements four memory banks, each 4-bytes wide. This memory bank configuration enables parallel memory accesses across the 8 x 4 array.

```
local int _attribute_( (numbanks(4),
                     bankwidth(4)))
    lmem[8][4];

#pragma unroll
for(int i = 0; i<4; i+=2)
{
    lmem[i][x & 0x3] = ...;
}
```
Optimize the Geometric Configuration of Local Memory Banks Based on Array Index

By default, the Altera Offline Compiler might attempt to improve performance by automatically banking a local memory system. The Altera SDK for OpenCL includes advanced features that allow you to customize the banking geometry of your local memory system. To configure the geometry of local memory banks, include the `numbanks(N)` and `bankwidth(M)` kernel attributes in your OpenCL kernel.

The following code examples illustrate how the bank geometry changes based on the values you assign to `numbanks` and `bankwidth`.

Table 1-14: Effects of numbanks and bankwidth on the Bank Geometry of 2 x 4 Local Memory System

The first and last rows of this table illustrate how to bank memory on the upper and lower indexes of a 2D array, respectively.
### Code Example

```c
local int __attribute__((
  numbanks(2),
  bankwidth(8))
) lmem[2][4];
```

**Bank Geometry**

<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0,0</td>
<td>0,1</td>
<td>0,2</td>
<td>0,3</td>
</tr>
<tr>
<td>1,0</td>
<td>1,1</td>
<td>1,2</td>
<td>1,3</td>
</tr>
</tbody>
</table>

- **Bank 0**
- **Bank 1**

```c
local int __attribute__((
  numbanks(2),
  bankwidth(4))
) lmem[2][4];
```

**Bank Geometry**

<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0,0</td>
<td>0,1</td>
<td>0,2</td>
<td>0,3</td>
</tr>
<tr>
<td>1,0</td>
<td>1,1</td>
<td>1,2</td>
<td>1,3</td>
</tr>
</tbody>
</table>

- **Bank 0**
- **Bank 1**

```c
local int __attribute__((
  numbanks(4),
  bankwidth(8))
) lmem[2][4];
```

**Bank Geometry**

<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0,0</td>
<td>0,1</td>
<td>0,2</td>
<td>0,3</td>
</tr>
<tr>
<td>1,0</td>
<td>1,1</td>
<td>1,2</td>
<td>1,3</td>
</tr>
</tbody>
</table>

- **Bank 0**
- **Bank 1**
- **Bank 2**
- **Bank 3**

```c
local int __attribute__((
  numbanks(4),
  bankwidth(4))
) lmem[2][4];
```

**Bank Geometry**

<p>| | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0,0</td>
<td>0,1</td>
<td>0,2</td>
<td>0,3</td>
</tr>
<tr>
<td>1,0</td>
<td>1,1</td>
<td>1,2</td>
<td>1,3</td>
</tr>
</tbody>
</table>

- **Bank 0**
- **Bank 1**
- **Bank 2**
- **Bank 3**

---

### Related Information

**Kernel Attributes for Configuring Local Memory System**

**Optimize Accesses to Local Memory by Controlling the Memory Replication Factor**

The memory replication factor is the number of M20K memory blocks that your design uses to implement the local memory system. To control the memory replication factor, include the `singlepump` or `doublepump` kernel attribute in your OpenCL kernel. The `singlepump` and `doublepump` kernel attributes are part of Altera SDK for OpenCL’s advanced features.

Altera's M20K memory blocks have two physical ports. The number of logical ports that are available in each M20K block depends on the degree of pumping. Pumping is a measure of the clock frequency of the M20K blocks relative to the rest of the design.

Consider an example design where the kernel specifies three read ports and one write port for the local memory system, `lmem`. As shown in the code example below, including the `singlepump` kernel attribute in the local variable declaration indicates that the M20K blocks will run at the same frequency as the rest of the design.

```c
int __attribute__((memory,
  numbanks(1),
  bankwidth(64),
  singlepump,
  ...));
```
Each single-pumped M20K block will have two logical ports available. Each write port in the local memory system must be connected to all the M20K blocks that your design uses to implement the memory system. Each read port in the local memory system must be connected to one M20K block. Because of these connection constraints, there needs to be three M20K blocks to implement the specified number of ports in \texttt{lmem}.

If you include the \texttt{doublepump} kernel attribute in your local variable declaration, you specify that the M20K memory blocks will run at double the frequency as the rest of the design.

Each double-pumped M20K block will have four logical ports available. As such, there only needs to be one M20K block to implement all three read ports and one write port in \texttt{lmem}.

**Attention:** Double pumping the memory increases resource overhead. Use the \texttt{doublepump} kernel attribute only if it results in actual M20K savings or improves performance, or both.
Strategies for Optimizing FPGA Area Usage

Area usage is an important design consideration if your OpenCL kernels are executable on FPGAs of different sizes. When you design your OpenCL application, Altera recommends that you follow certain design strategies for optimizing hardware area usage.

Optimizing kernel performance generally requires additional FPGA resources. In contrast, area optimization often results in performance decreases. During kernel optimization, Altera recommends that you run multiple versions of the kernel on the FPGA board to determine the kernel programming strategy that generates the best size versus performance trade-off.

Compilation Considerations

You can direct the Altera Offline Compiler to perform area usage analysis during kernel compilation.

1. To review the estimated resource usage summary on-screen, compile your kernel by including the --report flag in your aoc command. To review kernel-specific area usage information, refer to the `<your_kernel_filename>.aoco-area-report.html` or the `<your_kernel_filename>.aoctx-area-report.html` file in the current working directory.

2. If possible, perform floating-point computations by compiling your OpenCL kernel with the --fpc or --fp-relaxed option of the aoc command.

For more usage information on the --report, --fp-relaxed and --fpc options, refer to the Displaying Estimated Resource Usage Summary (--report), Relaxing Order of Floating-Point Operations (--fp-relaxed), and Reducing Floating-Point Operations (--fpc) sections of the Altera SDK for OpenCL Programming Guide.

For more information on floating-point operations, refer to Optimize Floating-Point Operations.

Related Information

- Review Your Kernel’s Area Report to Identify Inefficiencies in Resource Usage on page 1-22
- Displaying the Estimated Resource Usage Summary On-Screen (--report)
- Relaxing the Order of Floating-Point Operations (--fp-relaxed)
- Reducing Floating-Point Rounding Operations (--fpc)
- Optimize Floating-Point Operations on page 1-14
- Accessing the Area Report

Board Variant Selection Considerations

Target a board variant in your Custom Platform that provides only the external connectivity resources you require.

For example, if your kernel requires one external memory bank, target a board variant that only supports a single external memory bank. Targeting a board with multiple external memory banks increases the area usage of your kernel unnecessarily.
If your Custom Platform does not provide a board variant that meets your needs, consider creating a board variant. Consult the *Altera SDK for OpenCL Custom Platform Toolkit User Guide* for more information.

**Related Information**

*Altera SDK for OpenCL Custom Platform Toolkit User Guide*

### Memory Access Considerations

Altera recommends kernel programming strategies that can improve memory access efficiency and reduce area usage of your OpenCL kernel.

1. Minimize the number of access points to external memory.
   
   If possible, structure your kernel such that it reads its input from one location, processes the data internally, and then writes the output to another location.

2. Instead of relying on local or global memory accesses, structure your kernel as a single work-item with shift register inference whenever possible.

3. Instead of creating a kernel that writes data to external memory and a kernel that reads data from external memory, implement the Altera SDK for OpenCL channels extension between the kernels for direct data transfer.

4. If your OpenCL application includes many separate constant data accesses, declare the corresponding pointers using `__constant` instead of `__global const`. Declaration using `__global const` creates a private cache for each load or store operation. On the other hand, declaration using `__constant` creates a single constant cache on the chip only.

   **Caution:** If your kernel targets a Cyclone® V device (for example, Cyclone V SoC), declaring `__constant` pointer kernel arguments might degrade FPGA performance.

5. If your kernel passes a small number of constant arguments, pass them as values instead of pointers to global memory.

   For example, instead of passing `__constant int * coef` and then dereferencing `coef` with index 0 to 10, pass `coef` as a value (`int16 coef`). If `coef` was the only `__constant` pointer argument, passing it as a value eliminates the constant cache and the corresponding load and store operations completely.

6. Conditionally shifting large shift registers inside pipelined loops leads to the creation of inefficient hardware. For example, the following kernel consumes more resources when the `if (K > 5)` condition is present:

   ```c
   #define SHIFT_REG_LEN 1024
   __kernel void bad_shift_reg (__global int * restrict src,
                               __global int * restrict dst,
                               int K)
   {
     float shift_reg[SHIFT_REG_LEN];
     int sum = 0;

     for (unsigned i = 0; i < K; i++)
     {
       sum += shift_reg[0];
       shift_reg[SHIFT_REG_LEN-1] = src[i];

       // This condition will cause severe area bloat.
       if (K > 5)
   ```
\begin{verbatim}
#pragma unroll
for (int m = 0; m < SHIFT_REG_LEN-1 ; m++)
{
    shift_reg[m] = shift_reg[m + 1];
}
dst[i] = sum;
\end{verbatim}

**Attention:** Conditionally accessing a shift register does not degrade hardware efficiency. If it is necessary to implement conditional shifting of a large shift register in your kernel, consider modifying your code so that it uses local memory.

**Arithmetic Operation Considerations**

Select the appropriate arithmetic operation for your OpenCL application to avoid excessive FPGA area usage.

1. Introduce floating-point arithmetic operations only when necessary.

2. The Altera Offline Compiler defaults floating-point constants to double data type. Add an \( \varepsilon \) designation to the constant to make it a single precision floating-point operation.

   For example, the arithmetic operation \( \sin(1.0) \) represents a double precision floating-point sine function. The arithmetic operation \( \sin(1.0\varepsilon) \) represents a single precision floating-point sine function.

3. If you do not require full precision result for a complex function, compute simpler arithmetic operations to approximate the result. Consider the following example scenarios:
   a. Instead of computing the function \( \text{pow}(x,n) \) where \( n \) is a small value, approximate the result by performing repeated squaring operations because they require much less hardware resources and area.
   b. Ensure you are aware of the original and approximated area usages because in some cases, computing a result via approximation might result in excess area usage. For example, the \( \sqrt{} \) function is not resource-intensive. Other than a rough approximation, replacing the \( \sqrt{} \) function with arithmetic operations that the host has to compute at runtime might result in larger area usage.
   c. If you work with a small set of input values, consider using a LUT instead.

4. If your kernel performs a complex arithmetic operation with a constant that the AOC computes at compilation time (for example, \( \log(\pi/2.0) \)), perform the arithmetic operation on the host instead and pass the result as an argument to the kernel at runtime.

**Data Type Selection Considerations**

Select the appropriate data type to optimize the FPGA area usage by your OpenCL application.

1. Select the most appropriate data type for your application.

   For example, do not define your variable as `float` if the data type `short` is sufficient.

2. Ensure that both sides of an arithmetic expression belong to the same data type.
Consider an example where one side of an arithmetic expression is a floating-point value and the other side is an integer. The mismatched data types cause the Altera Offline Compiler to create implicit conversion operators, which can become expensive if they are present in large numbers.

3. Take advantage of padding if it exists in your data structures.

For example, if you only need float3 data type, which has the same size as float4, you may change the data type to float4 to make use of the extra dimension to carry an unrelated value.

**Additional Information**

For additional information, demonstrations and training options, visit the Altera SDK for OpenCL product page on the Altera website.

**Related Information**

Altera SDK for OpenCL product page on the Altera website

**Document Revision History**
<table>
<thead>
<tr>
<th>Date</th>
<th>Version</th>
<th>Changes</th>
</tr>
</thead>
</table>
| May 2016   | 2016.05.02 | • Added the topic *Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays* to introduce the `ivdep` pragma.  
• Under *Strategies for Improving Memory Access Efficiency*, added the following topics to explain how to use the `numbanks` and `bankwidth` kernel attributes to configure the geometry of local memory system:  
  • *Improve Kernel Performance by Banking the Local Memory*  
  • *Optimize the Geometric Configuration of Local Memory Banks Based on Array Index*  
• Under *Strategies for Improving Memory Access Efficiency*, added the topic *Optimize Accesses to Local Memory by Controlling the Memory Replication Factor* to explain the usage of the `singlepump` and `doublepump` kernel attributes.  
• Added information on the area report messages. Refer to the *Review Your Kernel’s Area Report to Identify Inefficiencies in Resource Usage* section for more information.  
• Removed the *Kernel-Specific Area Report* section because it is replaced by the enhanced area report. Refer to the *Review Your Kernel’s Area Report to Identify Inefficiencies in Resource Usage* section for more information.  
• Updated the subsections under *Optimization Report* to include the enhanced optimization report messages.  
  • Added the *Optimization Report Message for Speed-Limiting Constructs*  
• Updated the subsections under *Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback* to include the enhanced optimization report messages.  
• Updated the figure *Optimization Work Flow for a Single Work-Item Kernel* to include steps on accessing the enhanced area report to review resource usage.  
• Under *Strategies for Improving NDRange Kernel Data Processing Efficiency*, added the *Review Kernel Properties and Loop Unroll Status in the Optimization Report* section. |
| November 2015 | 2015.11.02 | • Added the topic *Multi-Threaded Host Application*.  
• Added Caution note regarding memory barrier in *Specify a Maximum Work-Group Size or a Required Work-Group Size*. |
<table>
<thead>
<tr>
<th>Date</th>
<th>Version</th>
<th>Changes</th>
</tr>
</thead>
</table>
| May 2015   | 15.0.0  | - In *Memory Access Considerations*, added Caution note regarding performance degradation that might occur when declaring `__constant` pointer arguments in kernels targeting Cyclone® V devices.  
- In *Good Design Practices for Single Work-Item Kernel*, removed the *Initialize Data Prior to Usage in a Loop* section and added a *Declare Variables in the Deepest Scope Possible* section.  
- Added *Removing Loop-Carried Dependency by Inferring Shift Registers*. The topic discusses how, in single work-item kernels, inferring double precision floating-point array as a shift register can remove loop-carried dependencies.  
- Added *Kernel-Specific Area Reports* to show examples of kernel-specific `.area` files that the Altera Offline Compiler generates during compilation.  
- Renamed *Transfer Data Via AOCL Channels* to *Transfer Data Via AOCL Channels or OpenCL Pipes* and added the following:  
  - More information on how channels can help improve kernel performance.  
  - Information on OpenCL pipes.  
- Renamed *Data Type Considerations* to *Data Type Selection Considerations*.  |
- Included new optimization report messages detailing the reasons for unsuccessful and suboptimal pipelined executions.  
- Added the *Transferring Loop-Carried Dependency to Local Memory* subsection under *Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback* to describe new strategy for resolving loop-carried dependency.  
- Updated the Resource-Driven Optimization and Compilation Considerations sections to reflect the deprecation of the `-O3` and `--util <N>` Altera Offline Compiler (AOC) command options.  
- Consolidated and simplified the *Heterogeneous Memory Buffers* and *Host Application Modifications for Heterogeneous Memory Accesses* sections.  
- Added the section *Align a Struct and Remove Padding between Struct Fields*.  
- Removed the section *Ensure 4-Byte Alignment to All Data Structures*.  
- Modified the figure *Single Work-Item Optimization Work Flow* to include emulation and profiling. |
<table>
<thead>
<tr>
<th>Date</th>
<th>Version</th>
<th>Changes</th>
</tr>
</thead>
<tbody>
<tr>
<td>June 2014</td>
<td>14.0.0</td>
<td>• Renamed document as the <em>Altera SDK for OpenCL Best Practices Guide</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Reorganized information flow.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Renamed <em>Good Design Practices</em> to <em>Good OpenCL Kernel Design Practices</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Added channels information in <em>Transfer data via AOCL Channels</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Added profiler information in <em>Profile Your Kernel to Identify Performance Bottlenecks</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Added the section <em>Single Work-Item Kernel Versus NDRange Kernel</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Updated <em>Single Work-Item Execution</em> section.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Removed <em>Performance Warning Messages</em> section.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Added the section <em>Strategies for Improving Single Work-Item Kernel Performance</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Renamed <em>Optimization of Data Processing Efficiency</em> to <em>Strategies for Improving NDRange Kernel Data Processing Efficiency</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Removed <em>Resource Sharing</em> section.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Renamed <em>Floating-Point Operations</em> to <em>Optimize Floating-Point Operations</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Renamed <em>Optimization of Memory Access Efficiency</em> to <em>Strategies for Improving Memory Access Efficiency</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Updated <em>Manual Partitioning of Global Memory</em> section.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Added the section <em>Strategies for Optimizing FPGA Area Usage</em>.</td>
</tr>
<tr>
<td>December 2013</td>
<td>13.1.1</td>
<td>• Updated the section <em>Specify a Maximum Work-Group Size or a Required Work-Group Size</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Added the section <em>Heterogeneous Memory Buffers</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Updated the section <em>Single Work-Item Execution</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Added the section <em>Performance Warning Messages</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Updated the section <em>Single Work-Item Kernel Programming Considerations</em>.</td>
</tr>
</tbody>
</table>
• Reorganized information flow.
• Updated the section Altera SDK for OpenCL Compilation Flow.
• Updated the section Pipelines; inserted the figure Example Multistage Pipeline Diagram.
• Removed the following figures:
  • Instruction Flow through a Five-Stage Pipeline Processor.
  • Vector Addition Kernel Compiled to an FPGA.
  • Effect of Kernel Vectorization on Array Summation.
  • Data Flow Implementation of a Four-Element Accumulation Kernel.
  • Data Flow Implementation of a Four-Element Accumulation Kernel with Loop Unrolled.
  • Complete Loop Unrolling.
  • Unrolling Two Loop Iterations.
  • Memory Master Interconnect.
  • Local Memory Read and Write Ports.
  • Local Memory Configuration.
• Updated the section Good Design Practices.
• Removed the following sections:
  • Predicated Execution.
  • Throughput Analysis.
  • Case Studies.
• Updated and renamed Optimizing Data Processing Efficiency to Optimization of Data Processing Efficiency.
• Renamed Replicating Compute Units versus Kernel SIMD Vectorization to Compute Unit Replication versus Kernel SIMD Vectorization.
• Renamed Using num_compute_units and num_simd_work_items Together to Combination of Compute Unit Replication and Kernel SIMD Vectorization.
• Updated and renamed Memory Streaming to Contiguous Memory Accesses.
• Updated and renamed Optimizing Memory Access to General Guidelines on Optimizing Memory Accesses.
• Updated and renamed Optimizing Memory Efficiency to Optimization of Memory Access Efficiency.
• Inserted the subsection Single Work-Item Execution under Optimization of Memory Access Efficiency.
<table>
<thead>
<tr>
<th>Date</th>
<th>Version</th>
<th>Changes</th>
</tr>
</thead>
<tbody>
<tr>
<td>June 2013</td>
<td>13.0 SP1.0</td>
<td>• Updated support status of OpenCL kernel source code containing complex exit paths.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Updated the figure <em>Effect of Kernel Vectorization on Array Summation</em> to correct the data flow between Store and Global Memory.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Updated content for the <code>unroll</code> pragma directive in the section <em>Loop Unrolling</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Updated content of the <em>Local Memory</em> section.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Updated the figure <em>Local Memories Transferring Data Blocks within Matrices A and B</em> to correct the data transfer pattern in Matrix B.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Removed the figure <em>Loop Unrolling with Vectorization</em>.</td>
</tr>
<tr>
<td></td>
<td></td>
<td>• Removed the section <em>Optimizing Local Memory Bandwidth</em>.</td>
</tr>
<tr>
<td>May 2013</td>
<td>13.0.1</td>
<td>• Updated terminology. For example, pipeline is replaced with compute unit; vector lane is replaced with SIMD vector lane.</td>
</tr>
</tbody>
</table>
|            |         | • Added the following sections under *Good Design Practices*:
|            |         |   • *Preprocessor Macros*.                                                                  |
|            |         |   • *Floating-Point versus Fixed-Point Representations*.                                    |
|            |         |   • *Recommended Optimization Methodology*.                                                   |
|            |         |   • *Sequence of Optimization Techniques*.                                                    |
|            |         | • Updated code fragments.                                                                   |
|            |         | • Updated the figure *Data Flow with Multiple Compute Units*.                               |
|            |         | • Updated the figure *Compute Unit Replication versus Kernel SIMD Vectorization*.           |
|            |         | • Updated the figure *Optimizing Throughput Using Compute Unit Replication and SIMD Vectorization*. |
|            |         | • Updated the figure *Memory Streaming*.                                                     |
|            |         | • Inserted the figure *Local Memories Transferring Data Blocks within Matrices A and B*.    |
|            |         | • Reorganized the flow of information. Number of figures, tables, and examples have been updated. |
|            |         | • Included information on new kernel attributes: `max_share_resources` and `num_share_resources`. |
| May 2013   | 13.0.0  | • Updated pipeline discussion.                                                               |
|            |         | • Updated case study code examples and results tables.                                       |
|            |         | • Updated figures.                                                                          |
| November 2012 | 12.1.0 | Initial release.                                                                            |