Codee Achieves 18x Performance Boost for HPC Workloads with oneAPI

Summary

The Spain-based technology company provides a software development platform for optimizing C/C++/Fortran application performance across modern heterogeneous hardware. Using Intel® oneAPI tools and the latest Intel® Xeon® Scalable processors, Codee achieved up to 18x performance improvement on mature, open source mathematical and cryptographic algorithms.

The Challenge of Delivering Time-Critical Code

Software solutions can be time-critical for many industries such as automotive, consumer electronics, aerospace/military, renewable energy, medical devices, and high-performance computing. For these and other industries, developing time-critical software requires adapting C/C++/Fortran source code to the characteristics of the target environment—the operating system, compiler toolchain, and target hardware, from microcontrollers and microprocessors to accelerators. Additionally, software development and testing teams must deliver code that’s cost-effective and maintainable and meets the ever-increasing business requirements of increased speed, reduced size, minimized energy consumption, and compliance with industry regulations.

As a result, software developers must follow a lengthy, expensive, trial-and-error, manual development process, demanding new tools to automate performance optimization tasks and to leverage the skills of expert performance software engineers.

For Codee, there had to be a better way.

And there was.

Empowering Developers with Shift-Left Performance

Codee is a software development platform that provides tools to help improve the performance of C/C++/Fortran applications. Its Static Code Analyzer provides a systematic predictable approach to enforce C/C++/Fortran performance optimization best practices for the target environment: hardware, compiler, and operating system. It provides innovative Coding Assistant capabilities to enable semi-automatic source code rewriting through the software development lifecycle.

Codee is a world-first solution providing a systematic, predictable approach to enforce C/C++/Fortran performance optimization best practices for CPUs and GPUs. Notably, it is the perfect complement to the best-in-class Intel® oneAPI DPC++/C++ Compilers and runtimes.

Manuel Arenaz, Founder & CEO/CTO, Codee

By combining Codee’s tools platform with the complementary Intel® oneAPI Base Toolkit (analysis & profiling tools, debuggers, compilers, a code-migration tool, and more), the company enabled shift-left performance by automating source code inspection tasks carried out manually by expert performance software engineers, complemented with coding assistance capabilities to actually implement performance optimizations in the C/C++/Fortran source code tailored to the Intel® oneAPI DPC++/C++ Compiler (part of the Base Kit) and Intel® Xeon® Scalable processors.

The following performance-optimization areas illustrate how Codee is a perfect complement to Intel oneAPI tools on the Intel Xeon Scalable processor*¹:

Memory efficiency: Codee enables 3.3x faster execution of matrix-matrix multiplication.
Multithreading: Codee enables up to 18x faster execution of well-known open-source mathematical, science and engineering codes.
Vectorization: Codee enables up to 1.7x faster execution of open-source AES encryption cryptographic code.

Codee’s 4-Step Optimization Approach

Codee provides a systematic predictable approach to C/C++/Fortran performance optimization, enabling developers to write faster code at an expert level, alleviating the scarcity of senior developers. For experts, it helps enforce that performance optimization best practices are applied before the final release.

Taking advantage of Codee, the development process starts with a profiling of the C/C++/Fortran code and a quick assessment of the existing performance optimization opportunities available in the code. Integrating Codee into the software development process is straightforward.

First, produce the Codee Screening Report: Invoke the “pwreport --screening” command, enabling source code checks for CPU and GPU (“--include-tags all”), passing the target source code (main.c) and its corresponding compilation flags after the special mark “--” (“-I include -fast”). In this case, the screening reported 8 checks in the source code, with the breakdown being 1 check related to scalar optimizations, 3 checks related to memory issues, 2 checks related to vectorization, 1 check related to multithreading and 1 check related to offloading.

$ pwreport --screening main.c --include-tags all -- -I include -fast
Compiler flags: -I include -fast

[C] target compiler: <none> (Compiler Agnostic Mode)

LANGUAGE SUMMARY
C files: 1

SCREENING REPORT

Target Lines of code Optimizable lines Analysis time # checks Effort Cost    Profiling
------ ------------- ----------------- ------------- -------- ------ ------- ---------
main.c 55            14                22 ms         8        57 h   1865€   n/a
------ ------------- ----------------- ------------- -------- ------ ------- ---------
Total  55            14                22 ms         8        57 h   1865€   n/a

CHECKS PER STAGE OF THE PERFORMANCE OPTIMIZATION ROADMAP
Target Scalar Control Memory Vector Multi Offload Quality
------ ------ ------- ------ ------ ----- ------- -------
main.c 1      0       3      2      1     1       0
------ ------ ------- ------ ------ ----- ------- -------
Total  1      0       3      2      1     1       0

. . .

SUGGESTIONS
  Use --verbose to get more details, e.g:
        pwreport --verbose --screening main.c --include-tags all -- -I include -fast

  You can automatically vectorize every vectorizable loop of one function with:
        pwdirectives --auto --vector omp --in-place main.c -- -I include -fast

  Use --checks to find out details about the detected checks:
        pwreport --checks main.c --include-tags all -- -I include -fast

  You can focus on a specific optimization type by filtering by its tag (scalar, control, memory, quality, vector, multi, offload), eg.:
        pwreport --checks --only-tags scalar main.c -- -I include -fast

  Consider using Codee with a target compiler in order to filter out optimizations that are already applied by your compiler. For example, for GCC:
        pwreport --target-compiler-cc gcc --screening main.c --include-tags all -- -I include -fast

1 file successfully analyzed and 0 failures in 22 ms

Second, produce the Codee Checks Report. Dig deeper into the performance issues discovered listing all the checks found in the code, by invoking the “pwreport --checks” command. The output format is similar to other static code analyzers in order to facilitate the user uptake and integration in the development workflow.

$ pwreport --checks main.c:matmul --verbose --include-tags all -- -I include -fast
Compiler flags: -I include -fast

[C] target compiler: <none> (Compiler Agnostic Mode)

CHECKS REPORT

main.c:18:17 [PWR048]: Replace multiplication/addition combo with an explicit call to fused multiply-add
  Suggestion: Replace a combination of multiplication and addition `C[i][j] += A[i][k] * B[k][j];`, with a call to the `fma` function.
  Documentation: https://www.codee.com/knowledge/pwr048

main.c:9:9 [PWR053]: consider applying vectorization to forall loop
  Suggestion: use pwdirectives to automatically optimize the code
  Documentation: https://www.codee.com/knowledge/pwr053
  AutoFix:
    * Using OpenMP pragmas (recommended):
        pwdirectives --vector omp --in-place main.c:9:9 -- -I include -fast
    * Using Clang compiler pragmas:
        pwdirectives --vector clang --in-place main.c:9:9 -- -I include -fast
    * Using GCC pragmas
        pwdirectives --vector gcc --in-place main.c:9:9 -- -I include -fast
    * Using ICC pragmas:
        pwdirectives --vector icc --in-place main.c:9:9 -- -I include -fast

main.c:15:5 [PWR035]: avoid non-consecutive array access for variables 'A', 'B' and 'C' to improve performance
  Non-consecutive array access:
    18:                 C[i][j] += A[i][k] * B[k][j];
  Suggestion: consider using techniques like loop fusion, loop interchange, loop tiling or changing the data layout to avoid non-sequential access to variables 'A', 'B' and 'C'.
  Documentation: https://www.codee.com/knowledge/pwr035

main.c:15:5 [PWR050]: consider applying multithreading parallelism to forall loop
  Suggestion: use pwdirectives to automatically optimize the code
  Documentation: https://www.codee.com/knowledge/pwr050
  AutoFix (choose one option):
      * Using OpenMP 'for' (recommended):
        pwdirectives --multi omp-for --in-place main.c:15:5 -- -I include -fast
      * Using OpenMP 'taskwait':
        pwdirectives --multi omp-taskwait --in-place main.c:15:5 -- -I include -fast
      * Using OpenMP 'taskloop':
        pwdirectives --multi omp-taskloop --in-place main.c:15:5 -- -I include -fast

main.c:15:5 [PWR055]: consider applying offloading parallelism to forall loop
  Suggestion: use pwdirectives to automatically optimize the code
  Documentation: https://www.codee.com/knowledge/pwr055
  AutoFix:
    * Using OpenMP (recommended):
        pwdirectives --offload omp-teams --in-place main.c:15:5 -- -I include -fast
    * Using OpenAcc:
        pwdirectives --offload acc --in-place main.c:15:5 -- -I include -fast

main.c:16:9 [PWR039]: consider loop interchange to improve the locality of reference and enable vectorization
  Loops to interchange:
    16:         for (size_t j = 0; j < n; j++) {
    17:             for (size_t k = 0; k < p; k++) {
  Suggestion: loop interchange can be used to improve the performance of the loop nest.
  Documentation: https://www.codee.com/knowledge/pwr039
  AutoFix:
    pwdirectives --memory loop-interchange --in-place main.c:16:9 -- -I include -fast

main.c:17:13 [PWR010]: 'B' multi-dimensional array not accessed in row-major order
  Accesses:
    18:                 C[i][j] += A[i][k] * B[k][j];
  Suggestion: change the code to access the 'B' multi-dimensional array in a row-major order
  Documentation: https://www.codee.com/knowledge/pwr010

main.c:17:13 [RMK010]: the vectorization cost model states the loop is not a SIMD opportunity due to strided memory accesses in the loop body
  Documentation: https://www.codee.com/knowledge/rmk010

SUGGESTIONS

  More details on the defects, recommendations and more in the Knowledge Base:
    	https://www.codee.com/knowledge/

1 file successfully analyzed and 0 failures in 110 ms

Third, solve the performance issue by understanding its logic through the open catalog documentation and by implementing the proper fix. A memory inefficiency was pointed out by the PWR039 check, which provides an AutoFix. For this purpose, copy and paste the pwdirectives invocation suggested by the tool.

$ pwdirectives --memory loop-interchange main.c:16:9 -o main_li.c -- -I include -fast
Compiler flags: -I include -fast

[C] target compiler: <none> (Compiler Agnostic Mode)

Results for file 'main.c':
  Successfully applied loop interchange between the following loops:
      - main.c:16:9
      - main.c:17:13
Successfully created main_li.c

Want More?

Take your continuous integration/continuous delivery (CI/CD) pipeline to the next level by adding code performance to your Shift-Left strategy:

Explore Codee capabilities through the open catalog of performance optimization best practices for C/C++/Fortran.
Reproduce Codee Benchmarking on your own - GitHub

Codee helps developers to speed up and improve performance, scalability, and time-to-market of the final product. Collaborating with Intel has helped us fine-tune our capabilities to streamline and improve Codee’s software solutions, filling the gap for code-performance checks in the existing CI/CD pipelines. We look forward to continuing our collaboration with Intel to add more value to our customers.

Manuel Arenaz, Founder & CEO/CTO, Codee

Performance Optimization Best Practices

Maximizing performance is essential across a wide range of industries and use cases, from embedded systems to high-performance computing. There are many angles to tackle performance improvement such as using the latest generation of hardware, latest releases of compilers and system libraries, and high-speed storage. This case study addresses performance optimization from the perspective of fine-tuning C/C++/Fortran source code to make it friendly to the Intel oneAPI DPC++/C++ Compiler targeting Intel Xeon Scalable processors. For the first time, Codee enables shift-left performance in the Intel developer ecosystem by providing a systematic, predictable approach to enforce performance optimization best practices on legacy and modernized C/C++/Fortran code.

Select Your Language

Using Intel.com Search

Quick Links

Recent Searches

Advanced Search

Only search in

Codee Enables Up to 18x Performance Boost for Compute-Intensive Workloads using Intel® oneAPI Tools