The Streaming SIMD Extensions (SSE) intrinsics provide access to IA-64 instructions for Streaming SIMD Extensions. To provide source compatibility with the IA-32 architecture, these intrinsics are equivalent both in name and functionality to the set of IA-32 architecture-based SSE intrinsics.
To write programs with the intrinsics, you should be familiar with the hardware features provided by SSE. Keep the following issues in mind:
Certain intrinsics are provided only for compatibility with previously-defined IA-32 architecture-based intrinsics. Using them on systems based on IA-64 architecture probably leads to performance degradation.
Floating-point (FP) data loaded stored as __m128 objects must be 16-byte-aligned.
Some intrinsics require that their arguments be immediates -- that is, constant integers (literals), due to the nature of the instruction.
The new data type __m128 is used with the SSE intrinsics. It represents a 128-bit quantity composed of four single-precision FP values. This corresponds to the 128-bit IA-32 architecture-based Streaming SIMD Extensions register.
The compiler aligns __m128 local data to 16-byte boundaries on the stack. Global data of these types is also 16 byte-aligned. To align integer, float, or double arrays, you can use the declspec alignment.
Because IA-64 instructions treat the SSE registers in the same way whether you are using packed or scalar data, there is no __m32 data type to represent scalar data. For scalar operations, use the __m128 objects and the "scalar" forms of the intrinsics; the compiler and the processor implement these operations with 32-bit memory references. But, for better performance the packed form should be substituting for the scalar form whenever possible.
The address of a __m128 object may be taken.
For more information, see Intel Architecture Software Developer's Manual, Volume 2: Instruction Set Reference Manual, Intel Corporation, doc. number 243191.
SSE intrinsics are defined for the __m128 data type, a 128-bit quantity consisting of four single-precision FP values. SIMD instructions for systems based on IA-64 architecture operate on 64-bit FP register quantities containing two single-precision floating-point values. Thus, each __m128 operand is actually a pair of FP registers and therefore each intrinsic corresponds to at least one pair of IA-64 instructions operating on the pair of FP register operands.
Many of the SSE intrinsics for systems based on IA-64 architecture were created for compatibility with existing IA-32 architecture-based intrinsics and not for performance. In some situations, intrinsic usage that improved performance on IA-32 architecture will not do so on systems based on IA-64 architecture. One reason for this is that some intrinsics map nicely into the IA-32 instruction set but not into the IA-64 instruction set. Thus, it is important to differentiate between intrinsics which were implemented for a performance advantage on systems based on IA-64 architecture, and those implemented simply to provide compatibility with existing IA-32 architecture-based code.
The following intrinsics are likely to reduce performance and should only be used to initially port legacy code or in non-critical code sections:
Any SSE scalar intrinsic (_ss variety) - use packed (_ps) version if possible
comi and ucomi SSE comparisons - these correspond to IA-32 architecture-based COMISS and UCOMISS instructions only. A sequence of IA-64 instructions are required to implement these.
Conversions in general are multi-instruction operations. These are particularly expensive: _mm_cvtpi16_ps, _mm_cvtpu16_ps, _mm_cvtpi8_ps, _mm_cvtpu8_ps, _mm_cvtpi32x2_ps, _mm_cvtps_pi16, _mm_cvtps_pi8
SSE utility intrinsic _mm_movemask_ps
If the inaccuracy is acceptable, the SIMD reciprocal and reciprocal square root approximation intrinsics (rcp and rsqrt) are much faster than the true div and sqrt intrinsics.