This topic provides general guidelines for coding practices and techniques for using:
IA-32 and Intel® 64 architectures supporting MMX™ technology and Intel® Streaming SIMD Extensions Streaming SIMD Extensions (SSE), Streaming SIMD Extensions 2 (SSE2), Streaming SIMD Extensions 3 (SSE3), and Streaming SIMD Extensions 4 (SSE4)
IA-64 architecture
This section describes practices, tools, coding rules and recommendations associated with the architecture features that can improve the performance for processors based on IA-32, Intel® 64, and IA-64 architectures.
If a guideline refers to a particular architecture only, this architecture is explicitly named. The default is for IA-32 architectures.
Performance of compiler-generated code may vary from one compiler to another. Intel® Fortran Compiler generates code that is optimized for Intel architectures. You can significantly improve performance by using various compiler optimization options. In addition, you can help the compiler to optimize your Fortran program by following the guidelines described in this section.
To achieve optimum processor performance in your Fortran application, do the following:
avoiding memory access stalls
ensuring good floating-point performance
ensuring good SIMD integer performance
using vectorization
The following sections summarize and describe coding practices, rules and recommendations associated with the features that will contribute to optimizing the performance on Intel architecture-based processors.
The Intel compiler lays out arrays in column-major order. For example, in a two-dimensional array, elements A(22,34) and A(23,34) are contiguous in memory. For best performance, code arrays so that inner loops access them in a contiguous manner.
Consider the following examples. The code in example 1 will likely have higher performance than the code in example 2.
Example 1 |
---|
subroutine contiguous(a, b, N) integer :: i, j, N, a(N,N), b(N,N) do j = 1, N do i = 1, N b(i, j) = a(i, j) + 1 end do end do end subroutine contiguous |
The code above illustrates access to arrays A and B in the inner loop I in a contiguous manner which results in good performance; however, the following example illustrates access to arrays A and B in inner loop J in a non-contiguous manner which results in poor performance.
Example 2 |
---|
subroutine non_contiguous(a, b, N) integer :: i, j, N, a(N,N), b(N,N) do i = 1, N do j = 1, N b(i, j) = a(i, j) + 1 end do end do end subroutine non_contiguous |
The compiler can transform the code so that inner loops access memory in a contiguous manner. To do that, you need to use advanced optimization options, such as -O3 (Linux* OS) or /O3 (Windows* OS) for IA-32, Intel® 64, and IA-64 architectures, and -O3 (Linux) or /O3 (Windows) and -ax (Linux) or /Qax (Windows) for IA-32 architecture only.
Alignment is an increasingly important factor in ensuring good performance. Aligned memory accesses are faster than unaligned accesses. If you use the interprocedural optimization on multiple files, the -ipo (Linux ) or /Qipo (Windows) option, the compiler analyzes the code and decides whether it is beneficial to pad arrays so that they start from an aligned boundary. Multiple arrays specified in a single common block can impose extra constraints on the compiler.
For example, consider the following COMMON statement:
Example 3 |
---|
COMMON /AREA1/ A(200), X, B(200) |
If the compiler added padding to align A(1) at a 16-byte aligned address, the element B(1) would not be at a 16-byte aligned address. So it is better to split AREA1 as follows.
Example 4 |
---|
COMMON /AREA1/ A(200) COMMON /AREA2/ X COMMON /AREA3/ B(200) |
The above code provides the compiler maximum flexibility in determining the padding required for both A and B.