前言

为了实现节点间和节点内的并行通讯,一般采用的是MPI+OpenMP混合编程,但这种编程方式比较复杂。在MPI 3.0标准中,实现了同一节点内各进程间的远程内存访问(RMA),不必再使用send/recive方式来传输数据,通讯的开销可以降低,而且这样可以使用统一的通讯模型,编程难度也较低。

参考了Intel 对SHM模型的介绍An Introduction to MPI-3 Shared Memory Programming
以及Pavan Balaji和Torsten Hoefler在ISC16上讲的中高级MPI编程教程Next Generation MPI Programming: Advanced MPI-2 and New Features in MPI-3 at ISC’16

这些教程都是用C语言编写的,转换到Fortran时会遇到一些坑,所以又从Stack Overflow查找了Fortran相关的问题:MPI Fortran code: how to share data on node via openMP?

函数定义

1. MPI_Comm_split_type

Split the world communicator into groups that span the same host/node.

split_type需要选择为MPI_COMM_TYPE_SHARED

1
2
3
4
5
6
7
8
9
MPI_Comm_split_type(comm, split_type, key, info, newcomm, ierror) BIND(C)
        TYPE(MPI_Comm), INTENT(IN) :: comm
        INTEGER, INTENT(IN) :: split_type, key
        TYPE(MPI_Info), INTENT(IN) :: info
        TYPE(MPI_Comm), INTENT(OUT) :: newcomm
        INTEGER, OPTIONAL, INTENT(OUT) :: ierror

MPI_COMM_SPLIT_TYPE(COMM, SPLIT_TYPE, KEY, INFO, NEWCOMM, IERROR)
        INTEGER COMM, SPLIT_TYPE, KEY, INFO, NEWCOMM, IERROR

2. MPI_Group_translate_ranks

This function is important for determining the relative numbering of the same processes in two different groups. For instance, if one knows the ranks of certain processes in the group of MPI_COMM_WORLD, one might want to know their ranks in a subset of that group.

1
2
3
4
5
6
7
MPI_Group_translate_ranks(group1, n, ranks1, group2, ranks2, ierror) BIND(C)
        TYPE(MPI_Group), INTENT(IN) :: group1, group2
        INTEGER, INTENT(IN) :: n, ranks1(n)
        INTEGER, INTENT(OUT) :: ranks2(n)
        INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_GROUP_TRANSLATE_RANKS(GROUP1, N, RANKS1, GROUP2, RANKS2, IERROR)
        INTEGER GROUP1, N, RANKS1(*), GROUP2, RANKS2(*), IERROR

3. MPI_Win_allocate_shared

Window That Allocates Shared Memory

MPI_WIN_ALLOCATE_SHARED allocates a chunk of shared memory in each process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
MPI_Win_allocate_shared(size, disp_unit, info, comm, baseptr, win, ierror) BIND(C)
        USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR
        INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(IN) :: size
        INTEGER, INTENT(IN) :: disp_unit
        TYPE(MPI_Info), INTENT(IN) :: info
        TYPE(MPI_Comm), INTENT(IN) :: comm
        TYPE(C_PTR), INTENT(OUT) :: baseptr
        TYPE(MPI_Win), INTENT(OUT) :: win
        INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_WIN_ALLOCATE_SHARED(SIZE, DISP_UNIT, INFO, COMM, BASEPTR, WIN, IERROR)
        INTEGER DISP_UNIT, INFO, COMM, WIN, IERROR
        INTEGER(KIND=MPI_ADDRESS_KIND) SIZE, BASEPTR

4. MPI_Win_shared_query

This function queries the process-local address for remote memory segments created with MPI_WIN_ALLOCATE_SHARED.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
MPI_Win_shared_query(win, rank, size, disp_unit, baseptr, ierror) BIND(C)
        USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR
        TYPE(MPI_Win), INTENT(IN) :: win
        INTEGER, INTENT(IN) :: rank
        INTEGER(KIND=MPI_ADDRESS_KIND), INTENT(OUT) :: size
        INTEGER, INTENT(OUT) :: disp_unit
        TYPE(C_PTR), INTENT(OUT) :: baseptr
        INTEGER, OPTIONAL, INTENT(OUT) :: ierror
MPI_WIN_SHARED_QUERY(WIN, RANK, SIZE, DISP_UNIT, BASEPTR, IERROR)
        INTEGER WIN, RANK, DISP_UNIT, IERROR
        INTEGER (KIND=MPI_ADDRESS_KIND) SIZE, BASEPTR

5. MPI-3 RMA access epoch

有几组函数可用: MPI_Win_lock / MPI_Win_unlock + MPI_Win_sync MPI_Win_lock_all / MPI_Win_unlock_all + MPI_Win_sync MPI_Win_Fence / MPI_Win_Fence

进行完1-4的操作后,用5里边的函数来显式划分需要进行远程内存访问的代码区域。根据前边的参考文章,三组函数的开销依次增加。

例子

SHM模型中的函数用到了很多指针的操作,个人认为采用C BINDING的函数类型较好。
就是指针用type(C_PTR)来声明,而不是INTEGER (KIND=MPI_ADDRESS_KIND)。之后再调用C_F_POINTER来把c_ptr的指针与Fortran中的pointer类型指针关联起来。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
program sharedmemtest
  USE, INTRINSIC :: ISO_C_BINDING, ONLY : C_PTR, C_F_POINTER
  use mpi
  implicit none
  integer, parameter :: dp = selected_real_kind(14,200)
  integer :: win,win2,hostcomm,hostrank
  INTEGER(KIND=MPI_ADDRESS_KIND) :: windowsize
  INTEGER :: disp_unit,my_rank,ierr,total
  TYPE(C_PTR) :: baseptr,baseptr2
  real(dp), POINTER :: matrix_elementsy(:,:,:,:)
  integer,allocatable :: arrayshape(:)

  call MPI_INIT( ierr )

  !GET THE RANK OF ONE PROCESS
  call MPI_COMM_RANK(MPI_COMM_WORLD,MY_RANK,IERR)
  !GET THE TOTAL PROCESSES OF THE COMM
  call MPI_COMM_SIZE(MPI_COMM_WORLD,Total,IERR)

  CALL MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, hostcomm,ierr)
  CALL MPI_Comm_rank(hostcomm, hostrank,ierr)

  ! Gratefully based on: 
  !http://stackoverflow.com/questions/24797298/mpi-fortran-code-how-to-share-data-on-node-via-openmp
  ! and https://gcc.gnu.org/onlinedocs/gfortran/C_005fF_005fPOINTER.html
  ! We only want one process per host to allocate memory
  ! Set size to 0 in all processes but one

  allocate(arrayshape(4))
  arrayshape=(/ 10,10,10,10 /)
  if (hostrank == 0) then
     ! Put the actual data size here
     windowsize = int(10**4,MPI_ADDRESS_KIND)*8_MPI_ADDRESS_KIND !*8 for double
  else
     windowsize = 0_MPI_ADDRESS_KIND
  end if
  disp_unit = 1
  CALL MPI_Win_allocate_shared(windowsize, disp_unit, MPI_INFO_NULL, hostcomm, baseptr, win, ierr)

  ! Obtain the location of the memory segment
  if (hostrank /= 0) then
     CALL MPI_Win_shared_query(win, 0, windowsize, disp_unit, baseptr, ierr)
  end if

  ! baseptr can now be associated with a Fortran pointer  
  ! and thus used to access the shared data

  CALL C_F_POINTER(baseptr, matrix_elementsy,arrayshape)

  !!! your code here!

  !!! sample below


  if (hostrank == 0) then
     matrix_elementsy=0.0_dp
     matrix_elementsy(1,2,3,4)=1.0_dp
  end if
  CALL MPI_WIN_FENCE(0, win, ierr)

  print *,"my_rank=",my_rank,matrix_elementsy(1,2,3,4),matrix_elementsy(1,2,3,5)

  !!! end sample code


  call MPI_WIN_FENCE(0, win, ierr) 
  call MPI_BARRIER(MPI_COMM_WORLD,ierr) 
  call MPI_Win_free(win,ierr)
  call MPI_FINALIZE(IERR)

  end program