Intel® C++ Compiler Classic Developer Guide and Reference

ID 767249
Date 3/31/2023
Public

A newer version of this document is available. Customers should click here to go to the newest version.

Document Table of Contents

Intrinsics for FP Loads and Store Operations

The prototypes for Intel® Advanced Vector Extensions 512 (Intel® AVX-512) intrinsics are located in the zmmintrin.h header file.

To use these intrinsics, include the immintrin.h file as follows:

#include <immintrin.h>


Intrinsic Name

Operation

Corresponding
Intel® AVX-512 Instruction

_mm512_load_pd, _mm512_mask_load_pd, _mm512_maskz_load_pd

_mm512_store_pd, _mm512_mask_store_pd

Load/store aligned float64 values from memory.

MOVAPD

_mm512_load_ps, _mm512_mask_load_ps, _mm512_maskz_load_ps

_mm512_store_ps, _mm512_mask_store_ps

Load/store aligned float32 values from memory.

MOVAPS

_mm_mask_load_sd, _mm_maskz_load_sd

_mm_mask_store_sd

Load/store lower float64 values from memory.

VMOVSD

_mm_mask_load_ss, _mm_maskz_load_ss

_mm_mask_store_ss

Load/store lower float32 values from memory.

VMOVSS

_mm512_loadu_pd, _mm512_mask_loadu_pd, _mm512_maskz_loadu_pd

_mm512_storeu_pd, _mm512_mask_storeu_pd

Load/store unaligned float64 values from memory.

VMOVUPD

_mm512_loadu_ps , _mm512_mask_loadu_ps, _mm512_maskz_loadu_ps

_mm512_storeu_ps, _mm512_mask_storeu_ps

Load/store unaligned float32 values from memory.

VMOVUPS

_mm512_stream_pd

Store float64 values using non-temporal hint.

VMOVNTPD

_mm512_stream_ps

Store float32 values using non-temporal hint.

VMOVNTPS


variable definition
k

writemask used as a selector

a

first source vector element

src

source element to use based on writemask result

mem_addr

pointer to base address in memory


_mm512_load_pd

extern __m512d __cdecl _mm512_load_pd(void const* mem_addr);

Loads 512-bits (composed of eight packed float64 elements) from mem_addr into destination.

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.


_mm512_mask_load_pd

extern __m512d __cdecl _mm512_mask_load_pd(__m512d src, __mmask8 k, void const* mem_addr);

Loads packed float64 elements from mem_addr into destination using writemask k (elements are copied from src when the corresponding mask bit is not set).

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.


_mm512_maskz_load_pd

extern __m512d __cdecl _mm512_maskz_load_pd(__mmask8 k, void const* mem_addr);

Loads packed float64 elements from mem_addr into destination using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.



_mm512_load_ps

extern __m512 __cdecl _mm512_load_ps(void const* mem_addr);

Loads 512-bits (composed of sixteen packed float32 elements) from mem_addr into destination.

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.


_mm512_mask_load_ps

extern __m512 __cdecl _mm512_mask_load_ps(__m512 src, __mmask16 k, void const* mem_addr);

Loads packed float32 elements from mem_addr into destination using writemask k (elements are copied from src when the corresponding mask bit is not set).

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.


_mm512_maskz_load_ps

extern __m512 __cdecl _mm512_maskz_load_ps(__mmask16 k, void const*  mem_addr);

Loads packed float32 elements from mem_addr into destination using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.



_mm_mask_load_sd

extern __m128d __cdecl _mm_mask_load_sd(__m128d src, __mmask8 k, const double* mem_addr);

Loads float64 element from mem_addr into lower element of destination using writemask k (the element is copied from src when mask bit 0 is not set), and sets upper destination element to zero.

mem_addr must be aligned on a 16-byte boundary or a general-protection exception will be generated.


_mm_maskz_load_sd

extern __m128d __cdecl _mm_maskz_load_sd(__mmask8 k, const double* mem_addr);

Loads a float64 element from mem_addr into lower destination element using zeromask k (the element is zeroed out when mask bit 0 is not set), and sets upper destination elements to zero.

mem_addr must be aligned on a 16-byte boundary or a general-protection exception will be generated.


_mm_mask_load_ss

extern __m128 __cdecl _mm_mask_load_ss(__m128 src, __mmask8 k, const float* mem_addr);

Loads float32 element from mem_addr into lower destination element using writemask k (the element is copied from src when mask bit 0 is not set), and sets upper destination elements to zero.

mem_addr must be aligned on a 16-byte boundary or a general-protection exception will be generated.


_mm_maskz_load_ss

extern __m128 __cdecl _mm_maskz_load_ss(__mmask8 k, const float* mem_addr);

Loads float32 element from mem_addr into lower element of destination using zeromask k (the element is zeroed out when mask bit 0 is not set), and sets upper destination elements to zero.

mem_addr must be aligned on a 16-byte boundary or a general-protection exception will be generated.



_mm512_loadu_pd

extern __m512d __cdecl _mm512_loadu_pd(void const* mem_addr);

Loads 512-bits (composed of eight packed float64 elements) from mem_addr into destination.

mem_addr does not need to be aligned on any particular boundary.


_mm512_mask_loadu_pd

extern __m512d __cdecl _mm512_mask_loadu_pd(__m512d src, __mmask8 k, void const* mem_addr);

Loads packed float64 elements from mem_addr into destination using writemask k (elements are copied from src when the corresponding mask bit is not set).

mem_addr does not need to be aligned on any particular boundary.


_mm512_maskz_loadu_pd

extern __m512d __cdecl _mm512_maskz_loadu_pd(__mmask8 k, void const* mem_addr);

Loads packed float64 elements from mem_addr into destination using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

mem_addr does not need to be aligned on any particular boundary.



_mm512_loadu_ps

extern __m512 __cdecl _mm512_loadu_ps(void const* mem_addr);

Loads 512-bits (composed of sixteen packed float32 elements) from mem_addr into destination.


_mm512_mask_loadu_ps

extern __m512 __cdecl _mm512_mask_loadu_ps(__m512 src, __mmask16 k, void const* mem_addr);

Loads packed float32 elements from mem_addr into destination using writemask k (elements are copied from src when the corresponding mask bit is not set).


_mm512_maskz_loadu_ps

extern __m512 __cdecl _mm512_maskz_loadu_ps(__mmask16 k, void const* mem_addr);

Loads packed float32 elements from mem_addr into destination using zeromask k (elements are zeroed out when the corresponding mask bit is not set).

mem_addr does not need to be aligned on any particular boundary.



_mm512_store_pd

extern void __cdecl _mm512_store_pd(void* mem_addr, __m512d a);

Stores 512-bits (composed of eight packed float64 elements) from a into mem_addr.


_mm512_mask_store_pd

extern void __cdecl _mm512_mask_store_pd(void* mem_addr, __mmask8 k, __m512d a);

Stores packed float64 elements from a into mem_addr using writemask k.

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.



_mm512_store_ps

extern void __cdecl _mm512_store_ps(void* mem_addr, __m512 a);

Store 512-bits (composed of sixteen packed float32 elements) from a into mem_addr.

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.



_mm512_mask_store_ps

extern void __cdecl _mm512_mask_store_ps(void* mem_addr, __mmask16 k, __m512 a);

Store packed float32 elements from a into mem_addr using writemask k.

mem_addr must be aligned on a 64-byte boundary or a general-protection exception will be generated.



_mm512_stream_pd

extern void __cdecl _mm512_stream_pd(void* mem_addr, __m512d a);

Stores 512-bits (composed of eight packed float64 elements) from a into mem_addr using a non-temporal memory hint.


_mm512_stream_ps

extern void __cdecl _mm512_stream_ps(void* mem_addr, __m512 a);

Stores 512-bits (composed of sixteen packed float32 elements) from a into mem_addr using a non-temporal memory hint.



_mm_mask_store_sd

extern void __cdecl _mm_mask_store_sd(double* mem_addr, __mmask8 k, __m128d a);

Stores lower float64 element from a into mem_addr using writemask k.

mem_addr must be aligned on a 16-byte boundary or a general-protection exception will be generated.



_mm_mask_store_ss

extern void __cdecl _mm_mask_store_ss(float* mem_addr, __mmask8 k, __m128 a);

Stores lower float32 element from a into mem_addr using writemask k.

mem_addr must be aligned on a 16-byte boundary or a general-protection exception will be generated.



_mm512_storeu_pd

extern void __cdecl _mm512_storeu_pd(void* mem_addr, __m512d a);

Stores 512-bits (composed of eight packed float64 elements) from a into mem_addr.

mem_addr does not need to be aligned on any particular boundary.


_mm512_mask_storeu_pd

extern void __cdecl _mm512_mask_storeu_pd(void* mem_addr, __mmask8 k, __m512d a);

Stores packed float64 elements from a into mem_addr using writemask k.

mem_addr does not need to be aligned on any particular boundary.



_mm512_storeu_ps

extern void __cdecl _mm512_storeu_ps(void* mem_addr, __m512 a);

Stores 512-bits (composed of sixteen packed float32 elements) from a into mem_addr.

mem_addr does not need to be aligned on any particular boundary.


_mm512_mask_storeu_ps

extern void __cdecl _mm512_mask_storeu_ps(void* mem_addr, __mmask16 k, __m512 a);

Stores packed float32 elements from a into mem_addr using writemask k.