Hello Varsha,
thanks for the suggestion. Here's what I got.
On HPC Cluster, normal queue:
As per your suggestion, ran the following through normal queue
I_MPI_DEBUG=30 FI_LOG_LEVEL=debug mpirun -n 2 -ppn 1 ./hello_mpi >out.txt
The error channel output is as follows:
Loading compiler version 2021.1.1
Loading tbb version 2021.1.1
Loading debugger version 10.0.0
Loading compiler-rt version 2021.1.1
Loading dpl version 2021.1.1
Loading oclfpga version 2021.1.1
Loading init_opencl version 2021.1.1
Warning: Intel PAC device is not found.
Please install the Intel PAC card to execute your program on an FPGA device.
Warning: Intel PAC device is not found.
Please install the Intel PAC card to execute your program on an FPGA device.Loading compiler/2021.1.1
Loading requirement: tbb/latest debugger/latest compiler-rt/latest dpl/latest
/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/modulefiles/init_opencl /opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/modul\
efiles/oclfpga
Loading mpi version 2021.1.1
Currently Loaded Modulefiles:
1) tbb/latest
2) debugger/latest
3) compiler-rt/latest
4) dpl/latest
5) /opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/modulefiles/init_opencl
6) /opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/modulefiles/oclfpga
7) compiler/2021.1.1mpi/2021.1.1
libfabric:28642:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:28642:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:28642:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:28642:core:mr:ofi_default_cache_size():69<info> default cache size=4223952597
libfabric:14757:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:14757:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:14757:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:14757:core:mr:ofi_default_cache_size():69<info> default cache size=4223952597libfabric:28642:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:28642:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:28642:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:28642:core:mr:ofi_default_cache_size():69<info> default cache size=4223952597
libfabric:14757:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_CUDA not supported
libfabric:14757:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ROCR not supported
libfabric:14757:core:core:ofi_hmem_init():202<info> Hmem iface FI_HMEM_ZE not supported
libfabric:14757:core:mr:ofi_default_cache_size():69<info> default cache size=4223952597
libfabric:28642:verbs:fabric:verbs_devs_print():871<info> list of verbs devices found for FI_EP_MSG:
libfabric:28642:verbs:fabric:verbs_devs_print():875<info> #1 mlx5_0 - IPoIB addresses:
libfabric:28642:verbs:fabric:verbs_devs_print():885<info> 192.168.2.2
libfabric:28642:verbs:fabric:verbs_devs_print():885<info> fe80::63f:7203:ae:a922
libfabric:14757:verbs:fabric:verbs_devs_print():871<info> list of verbs devices found for FI_EP_MSG:
libfabric:14757:verbs:fabric:verbs_devs_print():875<info> #1 mlx5_0 - IPoIB addresses:
libfabric:14757:verbs:fabric:verbs_devs_print():885<info> 192.168.2.3
libfabric:14757:verbs:fabric:verbs_devs_print():885<info> fe80::63f:7203:ae:a93e
libfabric:28642:verbs:fabric:vrb_get_device_attrs():617<info> device mlx5_0: first found active port is 1
libfabric:14757:verbs:fabric:vrb_get_device_attrs():617<info> device mlx5_0: first found active port is 1
libfabric:28642:verbs:fabric:vrb_get_device_attrs():617<info> device mlx5_0: first found active port is 1
libfabric:28642:core:core:ofi_register_provider():427<info> registering provider: verbs (111.0)
libfabric:14757:verbs:fabric:vrb_get_device_attrs():617<info> device mlx5_0: first found active port is 1
libfabric:14757:core:core:ofi_register_provider():427<info> registering provider: verbs (111.0)
libfabric:14757:core:core:ofi_register_provider():427<info> registering provider: tcp (111.0)
libfabric:28642:core:core:ofi_register_provider():427<info> registering provider: tcp (111.0)
libfabric:14757:core:core:ofi_register_provider():427<info> registering provider: sockets (111.0)
libfabric:28642:core:core:ofi_register_provider():427<info> registering provider: sockets (111.0)
libfabric:14757:core:core:ofi_register_provider():427<info> registering provider: shm (111.0)
libfabric:28642:core:core:ofi_register_provider():427<info> registering provider: shm (111.0)
libfabric:28642:core:core:ofi_register_provider():427<info> registering provider: ofi_rxm (111.0)
libfabric:14757:core:core:ofi_register_provider():427<info> registering provider: ofi_rxm (111.0)
libfabric:14757:core:core:ofi_register_provider():427<info> registering provider: mlx (1.4)
libfabric:28642:core:core:ofi_register_provider():427<info> registering provider: mlx (1.4)
libfabric:14757:core:core:ofi_register_provider():427<info> registering provider: ofi_hook_noop (111.0)
libfabric:14757:core:core:fi_getinfo_():1117<info> Found provider with the highest priority mlx, must_use_util_prov = 0ibfabric:14757:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:14757:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.10.0
libfabric:14757:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:14757:core:core:fi_getinfo_():1144<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:14757:core:core:fi_getinfo_():1144<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:14757:core:core:fi_getinfo_():1144<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=s\
ockets
libfabric:14757:core:core:fi_getinfo_():1144<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:14757:core:core:fi_getinfo_():1117<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:14757:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:14757:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.10.0
libfabric:14757:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:14757:core:core:fi_getinfo_():1144<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:14757:core:core:fi_getinfo_():1144<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:14757:core:core:fi_getinfo_():1144<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=s\
ockets
libfabric:14757:core:core:fi_getinfo_():1144<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:14757:mlx:core:mlx_fabric_open():172<info>
libfabric:14757:core:core:fi_fabric_():1397<info> Opened fabric: mlx
libfabric:14757:mlx:core:ofi_check_rx_attr():785<info> Tx only caps ignored in Rx caps
libfabric:14757:mlx:core:ofi_check_tx_attr():883<info> Rx only caps ignored in Tx caps
libfabric:28642:core:core:ofi_register_provider():427<info> registering provider: ofi_hook_noop (111.0)
libfabric:28642:core:core:fi_getinfo_():1117<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:28642:mlx:core:mlx_getinfo():172<info> used inject size = 1024
libfabric:28642:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.10.0
libfabric:28642:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:28642:core:core:fi_getinfo_():1144<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:28642:core:core:fi_getinfo_():1144<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:28642:core:core:fi_getinfo_():1144<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=s\
ockets
libfabric:28642:core:core:fi_getinfo_():1144<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:28642:core:core:fi_getinfo_():1117<info> Found provider with the highest priority mlx, must_use_util_prov = 0
libfabric:28642:mlx:core:mlx_getinfo():172<info> used inject size = 1024libfabric:28642:mlx:core:mlx_getinfo():219<info> Loaded MLX version 1.10.0
libfabric:28642:mlx:core:mlx_getinfo():266<warn> MLX: spawn support 0
libfabric:28642:core:core:fi_getinfo_():1144<info> Since mlx can be used, verbs has been skipped. To use verbs, please, set FI_PROVIDER=verbs
libfabric:28642:core:core:fi_getinfo_():1144<info> Since mlx can be used, tcp has been skipped. To use tcp, please, set FI_PROVIDER=tcp
libfabric:28642:core:core:fi_getinfo_():1144<info> Since mlx can be used, sockets has been skipped. To use sockets, please, set FI_PROVIDER=s\
ockets
libfabric:28642:core:core:fi_getinfo_():1144<info> Since mlx can be used, shm has been skipped. To use shm, please, set FI_PROVIDER=shm
libfabric:28642:mlx:core:mlx_fabric_open():172<info>
libfabric:28642:core:core:fi_fabric_():1397<info> Opened fabric: mlx
libfabric:28642:mlx:core:ofi_check_rx_attr():785<info> Tx only caps ignored in Rx caps
libfabric:28642:mlx:core:ofi_check_tx_attr():883<info> Rx only caps ignored in Tx caps
libfabric:14757:mlx:core:ofi_check_rx_attr():785<info> Tx only caps ignored in Rx caps
libfabric:14757:mlx:core:ofi_check_tx_attr():883<info> Rx only caps ignored in Tx caps
libfabric:28642:mlx:core:ofi_check_rx_attr():785<info> Tx only caps ignored in Rx caps
libfabric:28642:mlx:core:ofi_check_tx_attr():883<info> Rx only caps ignored in Tx caps
libfabric:14757:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:28642:mlx:core:mlx_cm_getname_mlx_format():73<info> Loaded UCP address: [307]...
libfabric:28642:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x20eff80
libfabric:14757:mlx:core:mlx_av_insert():179<warn> Try to insert address #0, offset=0 (size=2) fi_addr=0x1b0cfa0
libfabric:14757:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:14757:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x1b0cfa0
libfabric:14757:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:28642:mlx:core:mlx_av_insert():189<warn> address inserted
libfabric:28642:mlx:core:mlx_av_insert():179<warn> Try to insert address #1, offset=1024 (size=2) fi_addr=0x20eff80
libfabric:28642:mlx:core:mlx_av_insert():189<warn> address inserted
[LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file or directory
[LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET manual.
[LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file or directory
[LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET manual.
and it writes these to console that I have redirected to a file.
[0] MPI startup(): Intel(R) MPI Library, Version 2021.1 Build 20201112 (id: b9c9d2fc5)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): Size of shared memory segment (857 MB per rank) * (2 local ranks) = 1715 MB total
[0] MPI startup(): libfabric version: 1.11.0-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrnamelen: 1024
[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.1.1/etc/tuning_skx_shm-ofi.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 153103 hpcvisualization {0,1,2,3,4,5,6,7,8,9,20,21,22,23,24,25,26,27,28,29}
[0] MPI startup(): 1 153104 hpcvisualization {10,11,12,13,14,15,16,17,18,19,30,31,32,33,34,35,36,37,38,39}
[0] MPI startup(): I_MPI_ROOT=/opt/intel/oneapi/mpi/2021.1.1
[0] MPI startup(): I_MPI_MPIRUN=mpirun
[0] MPI startup(): I_MPI_HYDRA_TOPOLIB=hwloc
[0] MPI startup(): I_MPI_INTERNAL_MEM_POLICY=default
[0] MPI startup(): I_MPI_DEBUG=30
Hello World from rank: 1 of 2 total ranks
Hello World from rank: 0 of 2 total ranks
The following run with
mpirun -check_mpi -n 2 -ppn 1 ./hello_mpi
produces the following on error channel
Loading compiler version 2021.1.1
Loading tbb version 2021.1.1
Loading debugger version 10.0.0
Loading compiler-rt version 2021.1.1
Loading dpl version 2021.1.1
Loading oclfpga version 2021.1.1
Loading init_opencl version 2021.1.1
Warning: Intel PAC device is not found.
Please install the Intel PAC card to execute your program on an FPGA device.
Warning: Intel PAC device is not found.
Please install the Intel PAC card to execute your program on an FPGA device.Loading compiler/2021.1.1
Loading requirement: tbb/latest debugger/latest compiler-rt/latest dpl/latest
/opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/modulefiles/init_opencl /opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/modul\
efiles/oclfpga
Loading mpi version 2021.1.1
Currently Loaded Modulefiles:
1) tbb/latest
2) debugger/latest
3) compiler-rt/latest
4) dpl/latest
5) /opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/modulefiles/init_opencl
6) /opt/intel/oneapi/compiler/2021.1.1/linux/lib/oclfpga/modulefiles/oclfpga
7) compiler/2021.1.1mpi/2021.1.1
[LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file or directory
[LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET manual.
[LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file or directory
[LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET manual.[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 60s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20[0] INFO: Error checking completed without finding any problems.
And it produces the following relevant output.
Hello World from rank: 0 of 2 total ranks
Hello World from rank: 1 of 2 total ranks
I don't know what changed between earlier runs and today's but even regular run with
mpirun -n 2 -ppn 1 ./hello_mpi > out.txt
also produces correct output.
Hello World from rank: 1 of 2 total ranks
Hello World from rank: 0 of 2 total ranks
but prints the following on error channel. Let me know if this may cause issue in some other tun.
[LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file or directory
[LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET manual.
[LOG_CAT_SBGP] libnuma.so: cannot open shared object file: No such file or directory
[LOG_CAT_SBGP] Failed to dlopen libnuma.so. Fallback to GROUP_BY_SOCKET manual.
I will report results on "the workstation where original problem exists" with these same step a later taoday.
Thanks and Regards,
Keyur