The Power of Software Optimization

ANSYS Scales Simulation Performance

by optimizing ANSYS® Mechanical™ for the increasing parallelism of Intel® architecture

The Need for Speed in Simulation-Based Design

Engineering simulation software has changed how companies design products, enabling them to explore and test more design options faster, while reducing the need for physical prototyping. ANSYS software has played a central role in this transition and is now used by 96 of the top 100 industrial companies on the FORTUNE 500® list.1

ANSYS customers have an insatiable need for higher computing performance, so they can model and test bigger assemblies with greater fidelity. Many of them also want to generate deeper insights with each simulation, by integrating additional physics variables, exploring nonlinear and composite materials, and evaluating more complex and dynamic environmental conditions. Yet, fast time to results remains a critical factor for most customers to meet aggressive time-to-market requirements. As the size and complexity of the simulations grows, the speed and capacity of the computing platform must also grow.

Intel® Many Integrated Core Architecture Offers a Path

To meet the continually growing compute demand, ANSYS has worked closely with Intel for several years to optimize ANSYS Mechanical software for increasing parallelism in each new generation of multicore Intel® Xeon® processors. More recently, ANSYS and Intel have extended these efforts to the many-core Intel® Xeon Phi™ coprocessor. These highly parallel coprocessors provide up to 61 cores, 244 threads, and 1.2 teraflops of double-precision peak performance per coprocessor.2 They can run the same code as Intel Xeon processors, so independent software vendors (ISVs), such as ANSYS, are not required to write and manage multiple code bases.

By optimizing ANSYS® Mechanical™ 16.0 for the latest multicore Intel® Xeon® processors and many-core Intel® Xeon Phi™ coprocessors, ANSYS and Intel deliver up to three times the performance of a previous-generation hardware and software platform.3
The performance benefits have been compelling. ANSYS Mechanical 16.0 running on the Intel® Xeon® processor E5 v3 family and Intel Xeon Phi coprocessors delivers up to three times the performance of a previous-generation hardware and software platform. The coprocessors can also be added to older workstation and server platforms to deliver compelling performance gains.4

**Optimized ANSYS Software Unleashes the Performance of the Platform**

Delivering performance gains of this magnitude on highly parallel computing platforms is not automatic. Software code must be optimized to make efficient use of parallel execution resources. ANSYS engineers focus on four key strategies to enable efficient parallel execution.

- **Thread-level parallelism** using OpenMP®, for example, to help ensure efficient parallel execution on shared memory systems, such as individual workstations and servers.
- **Process-level parallelism** using message passing interface (MPI) to deliver higher performance on distributed memory systems, such as clustered workstations or servers. Although more complicated to implement than thread-level parallelism, MPI tends to deliver better efficiency across most hardware topologies, including shared memory systems.
- **Vectorization** using the single instruction multiple data (SIMD) execution units in Intel® processors and coprocessors. By executing instructions simultaneously on multiple data points, more work is accomplished in each clock cycle.
- **Memory optimization** to increase data access efficiencies. This strategy is essential to help ensure that the right data is in the right location at the right time to minimize access latencies and to keep parallel execution resources operating at optimal capacity.

**Targeted Optimizations for Highest ROI**

The matrix solvers in ANSYS Mechanical account for 60 to 90 percent of solution time, so they are prime targets for optimization. Although ANSYS provides a number of solvers to support different simulation types, the sparse direct solver is the default solver for ANSYS Mechanical and is commonly used for all types of analyses. This solver also has a high potential for parallelization, since the factorization of sparse matrices can typically be decomposed to dense matrix operations that can be executed efficiently in parallel. To deliver the greatest benefits for most customers, ANSYS engineers give high priority to optimizing the sparse direct solver.

**Intel Resources Help Simplify Optimization**

For many years, ANSYS engineers have monitored solver performance to the level of individual functions to identify bottlenecks and hot spots that are good targets for optimization. Several years ago, ANSYS began using a number of Intel® software tools to enhance these efforts. In addition to Intel® compilers, which provide advanced performance optimizations for Intel architecture, ANSYS uses:

- **Intel® Math Kernel Library (Intel® MKL)**, which provides math processing routines that are highly optimized for multi-level parallelism on Intel architecture. These routines make efficient use of available resources in multcore, many-core, and clustered architectures, and automatically balance workloads across Intel Xeon processors and Intel Xeon Phi coprocessors.
- **Intel® Trace Analyzer and Collector**, which is a graphical tool for understanding MPI application behavior. ANSYS engineers use it to quickly find bottlenecks, improve correctness, and achieve higher performance.
- **Intel® VTune™ Amplifier XE**, which provides deep insights into hotspots, threading, locks and waits, bandwidth, and other issues. Built-in analytics help ANSYS engineers sort, filter, and visualize the results in their source code and on their execution timelines.

Intel also provides prerelease hardware platforms, along with knowledge and expertise on hardware-related development issues. This ongoing collaboration benefits Intel, as well as ANSYS, since ANSYS engineers provide valuable insights that help Intel developers plan future enhancements to Intel software development tools.

**Optimizing SMP ANSYS for Thread-Level Parallelism**

From its inception, ANSYS Mechanical was designed to deliver optimized performance on shared-memory, symmetric multiprocessing (SMP) systems using OpenMP. ANSYS developers continue to increase thread-level parallelism. One strategy is to integrate more Intel MKL routines in each software release. Intel devotes considerable resources to optimizing these routines for every new processor and coprocessor generation. By keeping the libraries up to date, ANSYS developers benefit directly from Intel’s efforts. Since the library calls typically remain unchanged, relatively little effort is required.
Optimizing Distributed ANSYS for Process-Level Parallelism

Although ANSYS initially developed Distributed ANSYS to enable efficient parallel execution on distributed memory systems, Distributed ANSYS parallelizes work across all computing architectures. Distributed ANSYS now performs better than SMP ANSYS for most simulations, and customers can choose to use this option for any simulation, whether it’s run on a shared memory or distributed memory system.

ANSYS software engineers use Intel MPI and other MPI distributions to manage and coordinate parallel processes in Distributed ANSYS. They use the integrated profiling capabilities within Intel MPI—and also Intel® Trace Analyzer and Collector—to help evaluate and optimize process execution and memory usage for higher performance. They also use the latest version of Intel MKL to provide highly optimized code for many solver functions with relatively little effort.

Tuning domain decomposition plays a key role in optimizing solver performance using MPI. During decomposition, the solution matrix is divided into multiple geometric subdomains, each of which is assigned to a process that executes on a single processor core.

ANSYS engineers tune decomposition for efficient workload balancing across all available cores. They also tune it to make efficient use of non-uniform memory access (NUMA), which is supported in Intel architecture. NUMA provides faster access to nearby data for each processor socket. By tuning subdomains so the data fits within fast, nearby memory, ANSYS engineers reduce the need for high-latency, socket-to-socket data transfers. If execution is scaled out to a sufficient number of cores, the data will even fit within L2 cache, which is an order of magnitude faster than main memory. This provides even faster data access.

Vectorization for Additional Performance Gains

In many software optimization scenarios, it is well worth the time and effort to explicitly vectorize code to improve parallelism. For the ANSYS direct sparse solver, however, the extensive use of Intel MKL helps to ensure that highly vectorized code is used for many of the most performance-critical components.

Here, again, ANSYS gains advantages by staying up to date with the most recent versions of Intel MKL, as well as Intel compilers and Intel VTune Amplifier XE, which help to identify and leverage additional vectorization opportunities. These tools are always tuned for the latest vector technologies in Intel processors and coprocessors, such as Intel® Advanced Vector Extensions 2.0 (Intel® AVX2) in the Intel Xeon processor E5 v3 family and the double-wide (512-bit) vector units in Intel Xeon Phi coprocessors.

Higher Performance to Come

Intel will continue to deliver increasing core densities in future Intel Xeon processors and Intel Xeon Phi coprocessors, along with many additional innovations to enhance parallel execution. For example, next-generation Intel Xeon Phi coprocessors, code-named Knights Landing and scheduled for release in 2016, are expected to provide up to 3 times the performance of the current generation of coprocessors, and will also be able to function as standalone processors. They will include not only more cores, but also an integrated high-speed fabric and integrated high-bandwidth memory, which will help to resolve the kinds of bottlenecks that can potentially impede performance in increasingly parallel computing environments.

ANSYS and Intel will continue working together to optimize ANSYS software for the increasing numbers of cores and threads provided by Intel architecture. This will help ANSYS developers deliver higher performance with less effort, so they can continue to provide the best possible user experience for their customers.
ANSYS Scales Simulation Performance

MORE INFORMATION

ANSYS® Mechanical™
www.ansys.com/Products/Simulation+Technology/Structural+Analysis/ANSYS+Mechanical

Intel® Xeon Phi™ Product Family

Software optimization for Intel architecture
http://software.intel.com/moderncode

Save money and maximize performance with ANSYS Mechanical 16.0 on Intel® architecture
http://www.ansys.com/Campaigns/intel-phi4fea

Intel and ANSYS partner website
http://www.ansys.com/About+ANSYS/Partner+Programs/HPC+Partners/Intel+Corporation

---

1 Source: ANSYS website. www.ansys.com/About+ANSYS

2 The claim of up to 1.2 Teraflops of performance per coprocessor is based on Intel calculations of theoretical peak double precision performance capability for a single coprocessor

3 The claim of up to 1.2 Teraflops of performance per coprocessor is based on Intel calculations of theoretical peak double precision performance capability for a single coprocessor

4 For more information, read the ANSYS and Intel white paper, “Three Paths to Faster Simulations Using ANSYS® Mechanical™ 16.0 and Intel® Architecture.”

5 Next-generation Intel® Xeon Phi™ coprocessors (code-named Knights Landing) are expected to deliver more than 3 teraflops of double-precision performance based on internal and preliminary Intel projections of theoretical double-precision performance measured by Linpack® and on current expectations of Knights Landing’s cores, clock frequency, and floating point operations per cycle. They are also expected to
deliver three times the single-threaded performance of the current generation, based on projected peak theoretical single-thread performance relative to first generation Intel® Xeon Phi™ coprocessor 7120P (formerly code-named Knights Corner).

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark® and MobileMark®, are measured using specific
computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in
fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3
instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with
your system manufacturer or retailer or learn more at [intel.com].

Copyright © 2015 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, the Intel Inside logo, Intel Xeon Phi, VTune, and Xeon are trademarks of Intel Corporation in the U.S. and other countries.

*Other names and brands may be claimed as the property of others.