User Guide

Contents

Examine Kernel Details

After identifying hotspots,
GPU Roofline Insights
perspective enables you to analyze their performance deeper. Select a dot on the chart and use GPU Details and Recommendations tabs in the right-side pane to examine code analytics for a specific kernel in more details and view actionable recommendations for code optimization.

Get Recommendations

Select a kernel on a Roofline chart and switch to
Recommendations
tab to view actionable recommendations helping you optimize your code for compute and memory bound applications running on GPU. Expand a recommendation to access a full description and a code sample containing a possible solution of the problem.

Review Compute and Memory Bandwidth Utilization

Review how well your kernel uses the compute and memory bandwidth of your hardware in the
OP/S and Bandwidth
pane. It indicates the following metrics:
  • The total number of floating-point and integer operations transferred by the kernel per second as a percentage of the maximum compute capacity of your hardware. The red bar represents the dominant operation data type used in the kernel.
  • The amount of data transferred by the kernel at each cache memory level per second as a percentage of the memory level bandwidth. Cache memory level bandwidth utilization (in per cent) is a ratio of effective bandwidth and maximum bandwidth of a given memory level. This metric shows how well the kernel uses the capability of your hardware and can help you identify bottlenecks for your kernel.
For example, in the screenshot below, the dominating data type is FLOP. The kernel utilizes 19% of L3 Bandwidth. Considering these data and compared to utilization metrics for other memory levels and compute capacity, the Roofline chart displays the L3 Bandwidth as the main factor limiting the performance of the kernel.
Review how your application uses memory levels using the
Memory Metrics
pane:
  • Review how much time the kernel spends processing requests for each memory level in relation to the total time, in perspective, reported in the Impacts histogram.
    A big value indicates a memory level that bounds the selected kernel. Examine the difference between the two largest bars to see how much throughput you can gain if you reduce the impact on your main bottleneck. It also gives you a long-time plan to reduce your memory bound limitations as once you will solve the problems coming from the widest bar, your next issue will come from the second biggest bar and so on.
    Ideally, you should see the L3 or SLM as the most impactful memory.
  • Review an amount of data that passes through each memory level reported in the
    Shares
    histogram.
Data in the Memory Metrics pane is based on a dominant type of operations in your code (FLOAT or INT).

Explore Operation Types Used During Application Execution

Examine types of instructions that the kernel executes in the
Instruction Mix
pane. For example, in a screenshot below, the kernel mostly executes compute instructions with integer operations, which means that the kernel is mostly compute bound.
Intel Advisor
automatically determines the data type used in operations and groups the instructions collected during Characterization analysis by the following categories:
Category
Instruction Types
Compute (FLOP and INTOP)
  • BASIC COMPUTE:
    add, addc, mul, rndu, rndd, rnde, rndz, subb, avg, frc, lzd, fbh, fbl, cbit
  • BIT:
    and, not, or, xor, asr, shr, shl, bfrev, bfe, bfi1, bfi2, ror, rol
  • FMA:
    mac, mach, mad, madm
    (weight 2)
  • DIV: INT_DIV_BOTH, INT_DIV_QUOTIENT, INT_DIV_REMAINDER, and FDIV types of extended math function
  • POW extended math function
  • MATH: other function types performed by math instruction
  • VECTOR:
    add3
    (weight 2),
    line
    (weight 2),
    sad2
    (weight 3),
    dp2
    (weight 3),
    sada2
    (weight 4),
    lrp
    (weight 4),
    pln
    (weight 4),
    dp3
    (weight 5),
    dph
    (weight 6),
    dp4
    (weight 7),
    dp4a
    (weight 8)
Memory
LOAD, STORE, SLM_LOAD, SLM_STORE types depending on the argument:
send, sendc, sends, sendsc
Other
  • MOVE:
    mov, sel, movi, smov, csel
  • CONTROL FLOW:
    if, else, endif, while, break, cont, call, calla, ret, goto, jmpi, brd, brc, join, halt
  • SYNC:
    wait, sync
  • OTHER:
    cmp, cmpn, nop, f32to16, f16to32, dim
Atomic
  • SEND
Get more insights about instructions used in your kernel using
Instruction Mix Details
pane:
  • Examine instruction count for each category as well as its percentage in overall instruction mix to determine the dominating category of instructions in the kernel.
  • Examine instruction count for each type of compute, memory, atomics, and other instructions.
  • For compute instructions, view the dominating data type for each type of instructions.
    The data type dominating in the entire kernel is highlighted blue.
In the
Performance Characteristics
, review how effectively the kernel uses the GPU resources: activity of all execution units, percentage of time when both FPUs are used, percentage of cycles with a thread scheduled. Ideally, you should see a higher percentage of active execution units and other effectiveness metrics to use more GPU resources.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.