Avoiding deadlock in a standard point-to-point communication in an MPI program

The execution of a program calling the function MPI_Send, followed by MPI_Recv, can remain blocked on MPI_Send when two or more processes exchange large-sized messages (on Ada: varies according to the number of processes; on Turing: >2048 bytes).

This problem is caused by using the “standard” MPI send mode (MPI_Send). This mode authorizes the MPI implementation to choose the way of sending messages. In general (and on all the IDRIS machines), small messages are re-copied into a temporary memory space before being sent; large messages, however, are sent in synchronous mode. This synchronous mode implies that for the message to be sent (i.e. for the MPI_Send call to be completed), it is necessary for the reception of this message to have been posted (i.e. the call to MPI_Recv was completed).

Situations of blockage are possible. This can occur, for example, with the following communication schema:

          process   0                             1
                    send(  to 1)                  send(  to 0)
                    recv(from 1)                  recv(from 0)

This communication schema is inadvisable and, moreover, is considered as erroneous by the MPI standard. Fortunately, there are solutions but they require a slight modification of your source:

  • Replace MPI_SEND/MPI_RECV with MPI_SENDRECV :
             process    0                               1
                        sendrecv(to 1, from 1)          sendrecv(to 0, from 0)
  • Replace MPI_SEND with MPI_BSEND (attention : a specific buffer must be allocated) :
             process    0                               1
                        bsend( to 1)                    bsend( to 0)
                        recv(from 1)                    recv(from 0)
  • For one of the processes, inverse MPI_SEND and MPI_RECV :
             process    0                               1
                        send(  to 1)                    recv(from 0)
                        recv(from 1)                    send(  to 0)
  • Use the non-blocking functions MPI_ISEND and MPI_IRECV followed by a call to MPI_WAITALL :
             process    0                               1
                        isend(  to 1)                   isend(  to 0)
                        irecv(from 1)                   irecv(from 0)
                        waitall()                       waitall()

It is possible to modify the value from which messages will be sent with synchronous mode, according to the host you are using:

  • On Turing : the environment variable PAMID_EAGER allows the setting of message size (in bytes) from which the synchronous sending mode will be used (2049 by default),
  • On Ada : the environment variable MP_EAGER_LIMIT allows the setting of message size from which the synchronous sending mode will be used.

If you would like to be sure that your application does not risk having this problem, it is advised to test it by putting the environment variable at 0 in order to force the synchronous mode for all the standard sends. If all goes well (no blockage), your application should not have this problem (except if you change your communication schemas).