Problem Description
I am observing that distributed cluster jobs on Linux are not starting up. I am receiving error messages from MPI.
Solution
The underlying reason for COMSOL not working on a Linux cluster might be that the network interface and fabrics are not detected correctly. On Linux, COMSOL 6.1 is shipped with Intel MPI 2021.6 and COMSOL 6.0 with Intel MPI 2021.2. You can investigate if there is an incompatibility with Intel MPI using the following steps:
When you find that Intel MPI is not working on your cluster, you should first make sure that your submission script is configured correctly. In addition, you should run the MPI test by calling
comsol hydra mpitest -nn 2 -f hostfile
or, e.g. with Slurm,
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
...
comsol hydra mpitest -nn 2 -nnhost 1
to see that actually MPI is the issue. You can add the switch '-mpidebug 10' for getting additional debug output.
For resolving the problem you can try the suggestions A. and B. If A. works for you, you should try B. as this option would offer better performance.
A. Fall Back to TCP
Export the environment variable FI_PROVIDER and set it to 'sockets'. With Slurm, this can be done by means of
#SBATCH --export=FI_PROVIDER=sockets
Otherwise, you can use
export FI_PROVIDER=sockets
or
setenv FI_PROVIDER sockets
and make sure that this environment variable is handed over to your cluster job.
If you are running cluster jobs from the COMSOL Desktop, add --export=FI_PROVIDER=sockets to the Additional scheduler arguments field. I you are using SLURM, also add the FLROOT environment variable, using a comma character as separator. The value of FLROOT should be the COMSOL installation directory path.
--export=FI_PROVIDER=sockets,FLROOT=<COMSOL installation directory>
The downside with this approach is that the communication falls back to TCP, which might be slow if you have a faster fabrics.
B. Install a Later Intel MPI
Download the latest Intel MPI from here and install it. You can install to your home directory if you don't have admin rights on the cluster.
Launch COMSOL with the additional switch
-mpiroot <Intel MPI installation directory>/intel/oneapi/mpi/latest
On Slurm, you can call for example
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
...
comsol hydra mpitest -nn 2 -nnhost 1 -mpiroot <Intel MPI installation directory>/intel/oneapi/mpi/latest
Remarks:
- You can also point to other MPICH2-based MPI installations (but not to OpenMPI for example)
- In COMSOL 5.6 you can point to the new Intel MPI via -mpiroot as well.
COMSOL は, 本ページに掲載されている情報の確認に合理的な努力を払っております. リソースおよびドキュメントは情報提供のみを目的としており, COMSOL はその有効性について明示的または黙示的な保証を行いません. 開示されたデータの正確性について, COMSOL は法的責任を負いません. 本文書で言及されている商標はすべて, それぞれの所有者に帰属します. 商標に関する詳細は, 製品マニュアルをご参照ください.