Coding Guidelines for Intel® Architectures

This topic provides general guidelines for coding practices and techniques for using:

This section describes practices, tools, coding rules and recommendations associated with the architecture features that can improve the performance for processors based on IA-32, Intel® 64, and IA-64 architectures.

Note

If a guideline refers to a particular architecture only, this architecture is explicitly named. The default is for IA-32 architectures.

Performance of compiler-generated code may vary from one compiler to another. Intel® Fortran Compiler generates code that is optimized for Intel architectures. You can significantly improve performance by using various compiler optimization options. In addition, you can help the compiler to optimize your Fortran program by following the guidelines described in this section.

To achieve optimum processor performance in your Fortran application, do the following:

The following sections summarize and describe coding practices, rules and recommendations associated with the features that will contribute to optimizing the performance on Intel architecture-based processors.

Memory Access

The Intel compiler lays out arrays in column-major order. For example, in a two-dimensional array, elements A(22,34) and A(23,34) are contiguous in memory. For best performance, code arrays so that inner loops access them in a contiguous manner.

Consider the following examples. The code in example 1 will likely have higher performance than the code in example 2.

Example 1

subroutine contiguous(a, b, N)

  integer :: i, j, N, a(N,N), b(N,N)

  do j = 1, N

    do i = 1, N

      b(i, j) = a(i, j) + 1

    end do

  end do

end subroutine contiguous

The code above illustrates access to arrays A and B in the inner loop I in a contiguous manner which results in good performance; however, the following example illustrates access to arrays A and B in inner loop J in a non-contiguous manner which results in poor performance.

Example 2

subroutine non_contiguous(a, b, N)

  integer :: i, j, N, a(N,N), b(N,N)

  do i = 1, N

    do j = 1, N

      b(i, j) = a(i, j) + 1

    end do

  end do

end subroutine non_contiguous

The compiler can transform the code so that inner loops access memory in a contiguous manner. To do that, you need to use advanced optimization options, such as -O3 (Linux* OS) or /O3 (Windows* OS) for IA-32, Intel® 64, and IA-64 architectures, and -O3 (Linux) or /O3 (Windows) and -ax (Linux) or /Qax (Windows) for IA-32 architecture only.

Memory Layout

Alignment is an increasingly important factor in ensuring good performance. Aligned memory accesses are faster than unaligned accesses. If you use the interprocedural optimization on multiple files, the -ipo (Linux ) or /Qipo (Windows) option, the compiler analyzes the code and decides whether it is beneficial to pad arrays so that they start from an aligned boundary. Multiple arrays specified in a single common block can impose extra constraints on the compiler.

For example, consider the following COMMON statement:

Example 3

COMMON /AREA1/ A(200), X, B(200)

If the compiler added padding to align A(1) at a 16-byte aligned address, the element B(1) would not be at a 16-byte aligned address. So it is better to split  AREA1 as follows.

Example 4

COMMON /AREA1/ A(200)

COMMON /AREA2/ X

COMMON /AREA3/ B(200)

The above code provides the compiler maximum flexibility in determining the padding required for both A and B.