A.1. Document Revision History for the Standard Edition Best...

Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

Download PDF

ID 683176

Date 9/24/2018

Version 18.1

Public

A.1. Document Revision History for the Standard Edition Best Practices Guide

Document Version	Intel® Quartus® Prime Version	Changes
2018.09.24	18.1	Maintenance release
2018.05.04	18.0	Removed Pro Edition information. In Preloading Data to Local Memory, added recommendation to create local array elements to be a power of 2 bytes. Removed the topic Resource-Driven Optimization because it described an obsolete optimization behavior.

Table 12. Best Practices Guide Document Revision History
Date	Version	Changes
December 2017	2017.12.08	Added the following new topics: Autorun Captures Tab Autorun Profiler Data
November 2017	2017.11.06	Moved all topics into individual chapters. Changed some of the topic titles to task-based titles. Changed all occurrences of Fmax to f_max. Rebranded Dynamic Profiler to Added a new short description to Stall, Occupancy, Bandwidth. Added a new image to show comparison between parallel threads and loop pipelining, along with explanation to Multi-Threaded Host Application. Added an FPGA architecture along with some explanation in FPGA Overview. Added OpenCL Design Components image to HTML Report: Kernel Design Concepts. Added an important note to Aligning a Struct with or without Padding about 4-byte alignment and remove information related to a struct that is aligned and not padded. Added two bullet points to the last Attention section in Optimizing Accesses to Local Memory by Controlling the Memory Replication Factor. Added Minimizing the Memory Dependencies for Loop Pipelining. Added area report hierarchy details to Reviewing Area Information. Added Best Practices for Channels and Pipes. Updated Allocating Aligned Memory. Added Reducing the Area Consumed by Nested Loops Using loop_coalesce. Added Changing the Memory Access Pattern Example. Updated the image Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback. In the following topics, implemented single dash and `-option=<value>` conventions for aoc command. Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback Optimizing Floating-Point Operations Manual Partitioning of Global Memory Constant Cache Memory Compilation Considerations High Stall and High Occupancy Percentages In Source Code Tab and Tool Tip Options, updated the images to reflect Intel. In High Stall Percentage, added a screenshot for high stall percentage identification along with relevant explanation. In Local Memory, added a sentence about the overall state of the local memory as observed in the HTML report. In Load-Store Units, updated the description of semi-streaming LSU to describe how data travels throughout the block. New example codes and relevant explanation added to Nested Loops. Updated the code fragment in Pipelines section by removing the `index` keyword updated Figure 4. In Single Work-Item Kernel versus NDRange Kernel, Removed the criteria for creating single work item kernels for your design. Added new example codes and relevant explanation Removed the subtopic on Single Work-Item Execution and merged its content with this topic.
May 2017	2017.05.08	Rebranded some functions in code examples as follows: Rebranded read_channel_altera to read_channel_intel. Rebranded write_channel_altera to write_channel_intel. Rebranded read_channel_nb_altera to read_channel_nb_intel. Rebranded write_channel_nb_altera to write_channel_nb_intel. Added Load-Store Units. Added Reviewing the Report Summary. Added Features of the Kernel Memory Viewer. Revised the Local Memory Banks section of Local Memory to include information about the `bank_bits` attribute. Revised flowchart in Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to reflect changes to the profiling commands.
December 2016	2016.12.02	Minor editorial modification.
October 2016	2016.10.31	Rebranded the Altera SDK for OpenCL to . Rebranded the Altera Offline Compiler to . In Align a Struct with or without Padding, modified code snippets to correct the placement of attributes with respect to the struct declaration. Added the topic Review Your Kernel's report.html File, with subtopics describing the HTML GUI, the various reports the GUI provides, and a walkthrough on how to leverage the information in the HTML report to optimize an OpenCL design example. Changed Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage to HTML Report: Area Report Messages, and removed the following subsections: Area Report Messages for Global Memory and Global Memory Interconnect Area Report Messages for Local Memory Area Report Messages for Channels Added the topic HTML Report: Kernel Design Concepts, which includes subtopics on kernels, global memory interconnect, local memory, nested loops, loops in single work-item kernels, and channels. In Interpreting the Profiling Information, reorganized the content and added the following: Additional explanations on stall, occupancy, bandwidth, activity, and cache hit. Suggestions on addressing unsatisfactory Profiler metrics. In Addressing Single Work-Item Kernel Dependencies Based On Optimization Report Feedback, modified the figure Optimization Work Flow of a Single Work-Item Kernel to replace area report with HTML report. Removed the Optimization Report section along with the associated subsections because the information is now part of the HTML report. Changed Review Kernel Properties and Loop Unroll Status in the Optimization Report to Review Kernel Properties and Loop Unroll Status in the HTML Report because the optimization report is now part of the report.html file.
May 2016	2016.05.02	Added the topic Removing Loop-Carried Dependencies Caused by Accesses to Memory Arrays to introduce the `ivdep` pragma. Under Strategies for Improving Memory Access Efficiency, added the following topics to explain how to use the `numbanks` and `bankwidth` kernel attributes to configure the geometry of local memory system: Improve Kernel Performance by Banking the Local Memory Optimize the Geometric Configuration of Local Memory Banks Based on Array Index Under Strategies for Improving Memory Access Efficiency, added the topic Optimize Accesses to Local Memory by Controlling the Memory Replication Factor to explain the usage of the `singlepump` and `doublepump` kernel attributes. Added information on the area report messages. Refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section for more information. Removed the Kernel-Specific Area Report section because it is replaced by the enhanced area report. Refer to the Review Your Kernel's Area Report to Identify Inefficiencies in Resource Usage section for more information. Updated the subsections under Optimization Report to include the enhanced optimization report messages. Added the Optimization Report Message for Speed-Limiting Constructs Updated the subsections under Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to include the enhanced optimization report messages. Updated the figure Optimization Work Flow for a Single Work-Item Kernel to include steps on accessing the enhanced area report to review resource usage. Under Strategies for Improving NDRange Kernel Data Processing Efficiency, added the Review Kernel Properties and Loop Unroll Status in the Optimization Report section.
November 2015	2015.11.02	Added the topic Multi-Threaded Host Application. Added Caution note regarding memory barrier in Specify a Maximum Work-Group Size or a Required Work-Group Size.
May 2015	15.0.0	In Memory Access Considerations, added Caution note regarding performance degradation that might occur when declaring __constant pointer arguments in kernels targeting Cyclone® V devices. In Good Design Practices for Single Work-Item Kernel, removed the Initialize Data Prior to Usage in a Loop section and added a Declare Variables in the Deepest Scope Possible section. Added Removing Loop-Carried Dependency by Inferring Shift Registers. The topic discusses how, in single work-item kernels, inferring double precision floating-point array as a shift register can remove loop-carried dependencies. Added Kernel-Specific Area Reports to show examples of kernel-specific .area files that the Altera Offline Compiler generates during compilation. Renamed Transfer Data Via offline compiler Channels to Transfer Data Via offline compiler Channels or OpenCL Pipes and added the following: More information on how channels can help improve kernel performance. Information on OpenCL pipes. Renamed Data Type Considerations to Data Type Selection Considerations.
December 2014	14.1.0	Reorganized the information flow in the Optimization Report Messages section to update report messages and the layout of the optimization report. Included new optimization report messages detailing the reasons for unsuccessful and suboptimal pipelined executions. Added the Optimization Report Messages for Simplified Analysis of a Complex Design subsection under Optimization Report Messages to describe new report message for simplified kernel analysis. Renamed Using Feedback from the Optimization Report to Address Single Work-Item Kernels Dependencies to Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback. Added the Transferring Loop-Carried Dependency to Local Memory subsection under Addressing Single Work-Item Kernel Dependencies Based on Optimization Report Feedback to describe new strategy for resolving loop-carried dependency. Updated the Resource-Driven Optimization and Compilation Considerations sections to reflect the deprecation of the -O3 and `--util <N>` Altera® Offline Compiler (offline compiler) command options. Consolidated and simplified the Heterogeneous Memory Buffers and Host Application Modifications for Heterogeneous Memory Accesses sections. Added the section Align a Struct and Remove Padding between Struct Fields. Removed the section Ensure 4-Byte Alignment to All Data Structures. Modified the figure Single Work-Item Optimization Work Flow to include emulation and profiling.
June 2014	14.0.0	Renamed document as the Best Practices Guide. Reorganized information flow. Renamed Good Design Practices to Good OpenCL Kernel Design Practices. Added channels information in Transfer data via offline compilerL Channels. Added profiler information in Profile Your Kernel to Identify Performance Bottlenecks. Added the section Single Work-Item Kernel Versus NDRange Kernel. Updated Single Work-Item Execution section. Removed Performance Warning Messages section. Renamed Single Work-Item Kernel Programming Considerations to Good Design Practices for Single Work-Item Kernel. Added the section Strategies for Improving Single Work-Item Kernel Performance. Renamed Optimization of Data Processing Efficiency to Strategies for Improving NDRange Kernel Data Processing Efficiency. Removed Resource Sharing section. Renamed Floating-Point Operations to Optimize Floating-Point Operations. Renamed Optimization of Memory Access Efficiency to Strategies for Improving Memory Access Efficiency. Updated Manual Partitioning of Global Memory section. Added the section Strategies for Optimizing FPGA Area Usage.
December 2013	13.1.1	Updated the section Specify a Maximum Work-Group Size or a Required Work-Group Size. Added the section Heterogeneous Memory Buffers. Updated the section Single Work-Item Execution. Added the section Performance Warning Messages. Updated the section Single Work-Item Kernel Programming Considerations .
November 2013	13.1.0	Reorganized information flow. Updated the section Compilation Flow. Updated the section Pipelines; inserted the figure Example Multistage Pipeline Diagram. Removed the following figures: Instruction Flow through a Five-Stage Pipeline Processor. Vector Addition Kernel Compiled to an FPGA. Effect of Kernel Vectorization on Array Summation. Data Flow Implementation of a Four-Element Accumulation Kernel. Data Flow Implementation of a Four-Element Accumulation Kernel with Loop Unrolled. Complete Loop Unrolling. Unrolling Two Loop Iterations. Memory Master Interconnect. Local Memory Read and Write Ports. Local Memory Configuration. Updated the section Good Design Practices. Removed the following sections: Predicated Execution. Throughput Analysis. Case Studies. Updated and renamed Optimizing Data Processing Efficiency to Optimization of Data Processing Efficiency. Renamed Replicating Compute Units versus Kernel SIMD Vectorization to Compute Unit Replication versus Kernel SIMD Vectorization. Renamed Using num_compute_units and num_simd_work_items Together to Combination of Compute Unit Replication and Kernel SIMD Vectorization. Updated and renamed Memory Streaming to Contiguous Memory Accesses. Updated and renamed Optimizing Memory Access to General Guidelines on Optimizing Memory Accesses. Updated and renamed Optimizing Memory Efficiency to Optimization of Memory Access Efficiency. Inserted the subsection Single Work-Item Execution under Optimization of Memory Access Efficiency.
June 2013	13.0 SP1.0	Updated support status of OpenCL kernel source code containing complex exit paths. Updated the figure Effect of Kernel Vectorization on Array Summation to correct the data flow between Store and Global Memory. Updated content for the `unroll` pragma directive in the section Loop Unrolling. Updated content of the Local Memory section. Updated the figure Local Memories Transferring Data Blocks within Matrices A and B to correct the data transfer pattern in Matrix B. Removed the figure Loop Unrolling with Vectorization. Removed the section Optimizing Local Memory Bandwidth.
May 2013	13.0.1	Updated terminology. For example, pipeline is replaced with compute unit; vector lane is replaced with SIMD vector lane. Added the following sections under Good Design Practices: Preprocessor Macros. Floating-Point versus Fixed-Point Representations. Recommended Optimization Methodology. Sequence of Optimization Techniques. Updated code fragments. Updated the figure Data Flow with Multiple Compute Units. Updated the figure Compute Unit Replication versus Kernel SIMD Vectorization. Updated the figure Optimizing Throughput Using Compute Unit Replication and SIMD Vectorization. Updated the figure Memory Streaming. Inserted the figure Local Memories Transferring Data Blocks within Matrices A and B. Reorganized the flow of information. Number of figures, tables, and examples have been updated. Included information on new kernel attributes: `max_share_resources` and `num_share_resources` .
May 2013	13.0.0	Updated pipeline discussion. Updated case study code examples and results tables. Updated figures.
November 2012	12.1.0	Initial release.