- Added the following new topics:
||Minor editorial modification.
- Rebranded the Altera SDK for OpenCL to .
- Rebranded the Altera Offline Compiler to .
- In Align a Struct with or without Padding, modified code snippets to correct the placement of attributes with respect to the struct declaration.
- Added the topic Review Your Kernel's report.html File, with subtopics describing the HTML GUI, the various reports the GUI provides, and a walkthrough on how to leverage the information in the HTML report to optimize an OpenCL design example.
- Changed Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage to HTML Report: Area Report Messages, and removed the following subsections:
- Area Report Messages for Global Memory and Global Memory Interconnect
- Area Report Messages for Local Memory
- Area Report Messages for Channels
- Added the topic HTML Report: Kernel Design Concepts, which includes subtopics on kernels, global memory interconnect, local memory, nested loops, loops in single work-item kernels, and channels.
- In Interpreting the Profiling Information, reorganized the content and added the following:
- Additional explanations on stall, occupancy, bandwidth, activity, and cache hit.
- Suggestions on addressing unsatisfactory Profiler metrics.
- In Addressing Single Work-Item Kernel Dependencies Based On Optimization Report Feedback, modified the figure Optimization Work Flow of a Single Work-Item Kernel to replace area report with HTML report.
- Removed the Optimization Report section along with the associated subsections because the information is now part of the HTML report.
- Changed Review Kernel Properties and Loop Unroll Status in the Optimization Report to Review Kernel Properties and Loop Unroll Status in the HTML Report because the optimization report is now part of the report.html file.
- Added the topic Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays to introduce the ivdep pragma.
- Under Strategies for Improving Memory Access Efficiency, added the following topics to explain how to use the numbanks and bankwidth kernel attributes to configure the geometry of local memory system:
- Improve Kernel Performance by Banking the Local Memory
- Optimize the Geometric Configuration of Local Memory Banks Based on Array Index
- Under Strategies for Improving Memory Access Efficiency, added the topic Optimize Accesses to Local Memory by Controlling the Memory Replication Factor to explain the usage of the singlepump and doublepump kernel attributes.
- Added information on the area report messages. Refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section for more information.
- Removed the Kernel-Specific Area Report section because it is replaced by the enhanced area report. Refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section for more information.
- Updated the subsections under Optimization Report to include the enhanced optimization report messages.
- Added the Optimization Report Message for Speed-Limiting Constructs
- Updated the subsections under Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to include the enhanced optimization report messages.
- Updated the figure Optimization Work Flow for a Single Work-Item Kernel to include steps on accessing the enhanced area report to review resource usage.
- Under Strategies for Improving NDRange Kernel Data Processing Efficiency, added the Review Kernel Properties and Loop Unroll Status in the Optimization Report section.
- Added the topic Multi-Threaded Host Application.
- Added Caution note regarding memory barrier in Specify a Maximum Work-Group Size or a Required Work-Group Size.
- In Memory Access Considerations, added Caution note regarding performance degradation that might occur when declaring __constant pointer arguments in kernels targeting Cyclone® V devices.
- In Good Design Practices for Single Work-Item Kernel, removed the Initialize Data Prior to Usage in a Loop section and added a Declare Variables in the Deepest Scope Possible section.
- Added Removing Loop-Carried Dependency by Inferring Shift Registers. The topic discusses how, in single work-item kernels, inferring double precision floating-point array as a shift register can remove loop-carried dependencies.
- Added Kernel-Specific Area Reports to show examples of kernel-specific .area files that the Altera Offline Compiler generates during compilation.
- Renamed Transfer Data Via offline compiler Channels to Transfer Data Via offline compiler Channels or OpenCL Pipes and added the following:
- More information on how channels can help improve kernel performance.
- Information on OpenCL pipes.
- Renamed Data Type Considerations to Data Type Selection Considerations.
- Reorganized the information flow in the Optimization Report Messages section to update report messages and the layout of the optimization report.
- Included new optimization report messages detailing the reasons for unsuccessful and suboptimal pipelined executions.
- Added the Optimization Report Messages for Simplified Analysis of a Complex Design subsection under Optimization Report Messages to describe new report message for simplified kernel analysis.
- Renamed Using Feedback from the Optimization Report to Address Single Work-Item Kernels Dependencies to Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback.
- Added the Transferring Loop-Carried Dependency to Local Memory subsection under Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to describe new strategy for resolving loop-carried dependency.
- Updated the Resource-Driven Optimization and Compilation Considerations sections to reflect the deprecation of the -O3 and --util <N> Altera® Offline Compiler (offline compiler) command options.
- Consolidated and simplified the Heterogeneous Memory Buffers and Host Application Modifications for Heterogeneous Memory Accesses sections.
- Added the section Align a Struct and Remove Padding between Struct Fields.
- Removed the section Ensure 4-Byte Alignment to All Data Structures.
- Modified the figure Single Work-Item Optimization Work Flow to include emulation and profiling.
- Renamed document as the Best Practices Guide.
- Reorganized information flow.
- Renamed Good Design Practices to Good OpenCL Kernel Design Practices.
- Added channels information in Transfer data via offline compilerL Channels.
- Added profiler information in Profile Your Kernel to Identify Performance Bottlenecks.
- Added the section Single Work-Item Kernel Versus NDRange Kernel.
- Updated Single Work-Item Execution section.
- Removed Performance Warning Messages section.
- Renamed Single Work-Item Kernel Programming Considerations to Good Design Practices for Single Work-Item Kernel.
- Added the section Strategies for Improving Single Work-Item Kernel Performance.
- Renamed Optimization of Data Processing Efficiency to Strategies for Improving NDRange Kernel Data Processing Efficiency.
- Removed Resource Sharing section.
- Renamed Floating-Point Operations to Optimize Floating-Point Operations.
- Renamed Optimization of Memory Access Efficiency to Strategies for Improving Memory Access Efficiency.
- Updated Manual Partitioning of Global Memory section.
- Added the section Strategies for Optimizing FPGA Area Usage.
- Updated the section Specify a Maximum Work-Group Size or a Required Work-Group Size.
- Added the section Heterogeneous Memory Buffers.
- Updated the section Single Work-Item Execution.
- Added the section Performance Warning Messages.
- Updated the section Single Work-Item Kernel Programming Considerations .
| November 2013
- Reorganized information flow.
- Updated the section Compilation Flow.
- Updated the section Pipelines; inserted the figure Example Multistage Pipeline Diagram.
- Removed the following figures:
- Instruction Flow through a Five-Stage Pipeline Processor.
- Vector Addition Kernel Compiled to an FPGA.
- Effect of Kernel Vectorization on Array Summation.
- Data Flow Implementation of a Four-Element Accumulation Kernel.
- Data Flow Implementation of a Four-Element Accumulation Kernel with Loop Unrolled.
- Complete Loop Unrolling.
- Unrolling Two Loop Iterations.
- Memory Master Interconnect.
- Local Memory Read and Write Ports.
- Local Memory Configuration.
- Updated the section Good Design Practices.
- Removed the following sections:
- Predicated Execution.
- Throughput Analysis.
- Case Studies.
- Updated and renamed Optimizing Data Processing Efficiency to Optimization of Data Processing Efficiency.
- Renamed Replicating Compute Units versus Kernel SIMD Vectorization to Compute Unit Replication versus Kernel SIMD Vectorization.
- Renamed Using num_compute_units and num_simd_work_items Together to Combination of Compute Unit Replication and Kernel SIMD Vectorization.
- Updated and renamed Memory Streaming to Contiguous Memory Accesses.
- Updated and renamed Optimizing Memory Access to General Guidelines on Optimizing Memory Accesses.
- Updated and renamed Optimizing Memory Efficiency to Optimization of Memory Access Efficiency.
- Inserted the subsection Single Work-Item Execution under Optimization of Memory Access Efficiency.
- Updated support status of OpenCL kernel source code containing complex exit paths.
- Updated the figure Effect of Kernel Vectorization on Array Summation to correct the data flow between Store and Global Memory.
- Updated content for the unroll pragma directive in the section Loop Unrolling.
- Updated content of the Local Memory section.
- Updated the figure Local Memories Transferring Data Blocks within Matrices A and B to correct the data transfer pattern in Matrix B.
- Removed the figure Loop Unrolling with Vectorization.
- Removed the section Optimizing Local Memory Bandwidth.
- Updated terminology. For example, pipeline is replaced with compute unit; vector lane is replaced with SIMD vector lane.
- Added the following sections under Good Design Practices:
- Preprocessor Macros.
- Floating-Point versus Fixed-Point Representations.
- Recommended Optimization Methodology.
- Sequence of Optimization Techniques.
- Updated code fragments.
- Updated the figure Data Flow with Multiple Compute Units.
- Updated the figure Compute Unit Replication versus Kernel SIMD Vectorization.
- Updated the figure Optimizing Throughput Using Compute Unit Replication and SIMD Vectorization.
- Updated the figure Memory Streaming.
- Inserted the figure Local Memories Transferring Data Blocks within Matrices A and B.
- Reorganized the flow of information. Number of figures, tables, and examples have been updated.
- Included information on new kernel attributes: max_share_resources and num_share_resources .
- Updated pipeline discussion.
- Updated case study code examples and results tables.
- Updated figures.