Project

General

Profile

Bug #3200

Segmentation Fault in gmx mdrun in gromacs 2020-beta1 and hwloc 2.0.2

Added by Koushik Choudhury 30 days ago. Updated 7 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

extarct gromacs-2020-beta1.tar.gz to gromacs-2020-beta1/
cd gromacs-2020-beta1/
mkdir build
cd ../
open build_gromacs.sh
Edited the BASEDIR='/nethome/koushikc' and BUILD_VERSIONS='2020-beta1' and GMX_VERSION=${VERSION%%} in build_gromacs.sh
commented out the lines # echo $[-d "$SRC_DIR"]
  1. [ -f "$SRC_TARBALL" ]
  2. [ -d "$SRC_DIR" ] && rm -rf "$SRC_DIR" in build_gromacs.sh
    ./gromacs.sh.
    This installed gromacs with different hardware configurations in 2020-beta1.
    mkdir modulefiles
    module use ~/modulefiles
    cd modulefiles
    Created a file 2020-beta1. Copied the contents of /opt/tcbsys/modulefiles/gromacs/2019.2 into it.
    cd ..
    copied /opt/tcbsys/modulefiles/get_simd and /opt/tcbsys/modulefiles/init_functions to /nethome/koushikc. cp /opt/tcbsys/modulefiles/get_simd . and /opt/tcbsys/modulefiles/init_functions .
    mkdir sysconfig
    cd sysconfig
    cd
    cd /opt/tcbsys/etc/modules/sysconfig/
    cp * /nethome/koushikc/sysconfig/
    module add 2020-beta1
    gromacs 2020-beta1 gets loaded in the login node
    gmx grompp -f pr_pre_NVT_100K.mdp -c em.gro -r em.gro -p gmx_structure.top -o pr_pre_NVT_100K.tpr
    gmx mdrun -s pr_pre_NVT_100K.tpr -deffnm pr_pre_NVT_100K
    Charmm36 force-field was used.
    This is where I get the error as segmentation fault.
em.gro (4.55 MB) em.gro Koushik Choudhury, 11/06/2019 02:16 PM
gmx_structure.top (1.49 KB) gmx_structure.top Koushik Choudhury, 11/06/2019 02:16 PM
pr_pre_NVT_100K.mdp (1.47 KB) pr_pre_NVT_100K.mdp Koushik Choudhury, 11/06/2019 02:17 PM
pr_pre_NVT_100K.tpr (28.3 MB) pr_pre_NVT_100K.tpr Koushik Choudhury, 11/06/2019 02:22 PM
gromacs-2020-beta1.tar.gz (27.3 MB) gromacs-2020-beta1.tar.gz Koushik Choudhury, 11/06/2019 02:27 PM
build_gromacs.sh (4.44 KB) build_gromacs.sh Koushik Choudhury, 11/06/2019 02:29 PM
pr_pre_NVT_100K.log (3.54 KB) pr_pre_NVT_100K.log Koushik Choudhury, 11/13/2019 02:32 PM

Associated revisions

Revision 295aafe1 (diff)
Added by Paul Bauer 7 days ago

Check for using correct hwloc headers and runtime

Also add assertion in the code to prevent errors from linking
against the wrong library while running.

Fixes #3200

Change-Id: Ib2f2861702e111f67c38b0c9d65ccbe4c81a0ccd

History

#3 Updated by Eric Irrgang 30 days ago

Beta 2 is now available. If the problem is still present, please include output that would help developers narrow down which code is producing the seg fault. Verbose terminal output is helpful. It is not necessary to attach a copy of the source archive to the issue. If you have altered some of the source code, please just include those files. When attaching archive files to a tracked issue, please describe the contents of the archive and the reason for attaching it.

#4 Updated by Koushik Choudhury 24 days ago

I installed the Beta2 and I still get the same problem :

GROMACS: gmx mdrun, version 2020-beta2
Executable: /nethome/koushikc/2020-beta2/AVX2_256/bin/gmx
Data prefix: /nethome/koushikc/2020-beta2/AVX2_256
Working dir: /nethome/koushikc/navab_cryoem_densfit
Command line:
gmx mdrun -s pr_pre_NVT_100K.tpr -deffnm pr_pre_NVT_100K

Back Off! I just backed up pr_pre_NVT_100K.log to ./#pr_pre_NVT_100K.log.5#
Segmentation fault (core dumped)

#5 Updated by Paul Bauer 23 days ago

  • Target version changed from 2020-beta2 to 2020-beta3

I just tried this with the current release-2020 branch and a debug build, but did not trigger the crash
Can you upload a log file from the run? I need to know if this run used any gpu features or not

#7 Updated by Paul Bauer 23 days ago

also does not cause segfault in release build

#8 Updated by Koushik Choudhury 23 days ago

Thanks. I think then there must be some problem in the way I compiled it on the cluster.

#9 Updated by Paul Bauer 23 days ago

still waiting on the CUDA build on my other machine, then we will know

#10 Updated by Paul Bauer 21 days ago

  • Status changed from New to Blocked, need info

also not reproduced with the CUDA build

#11 Updated by Koushik Choudhury 21 days ago

Thanks again. I think there must be something wrong in the way I compiled it on the cluster as it runs perfectly on my PC.

#12 Updated by Yuxuan Zhuang 20 days ago

Koushik Choudhury wrote:

Thanks again. I think there must be something wrong in the way I compiled it on the cluster as it runs perfectly on my PC.

I reproduced the problem when I module load hwloc/2.0.2 as stated in the /opt/tcbsys/gromacs/build_gromacs.sh, while the default hwloc version 1.11.11 does not raise any error.

#13 Updated by Koushik Choudhury 19 days ago

Yuxuan Zhuang wrote:

Koushik Choudhury wrote:

Thanks again. I think there must be something wrong in the way I compiled it on the cluster as it runs perfectly on my PC.

I reproduced the problem when I module load hwloc/2.0.2 as stated in the /opt/tcbsys/gromacs/build_gromacs.sh, while the default hwloc version 1.11.11 does not raise any error.

Thank you, Yuxuan. I compiled with hwloc version 1.11.11 and it is working fine now.

#14 Updated by Paul Bauer 18 days ago

  • Subject changed from Segmentation Fault in gmx mdrun in gromacs 2020-beta1 to Segmentation Fault in gmx mdrun in gromacs 2020-beta1 and hwloc 2.0.2
  • Status changed from Blocked, need info to Accepted
  • Assignee set to Paul Bauer

Thanks @Yuxuan for identifying that the issue is only present with hwloc 2.0.2

#15 Updated by Paul Bauer 11 days ago

ok, for some reason hwloc returns 0x01 as the first child of the hwloc object
@Erik or @Kevin, any idea about this?

#16 Updated by Paul Bauer 10 days ago

I debugged some more and could see that the value of the first_child is set to 0x1 in the hardware topology, but I can't find anything in the documentation that indicates what the value means

#17 Updated by Paul Bauer 10 days ago

This has to be a bug somehwere in the reading/initialization, the number of child objects (arity) is 3948113664

#18 Updated by Paul Bauer 9 days ago

ok, the issue seems to be that GMX_HWLOC_API_VERSION_IS_2XX is not set correctly, leading to the wrong function being called for setting the flags

#19 Updated by Paul Bauer 9 days ago

Yes, this is a mismatch between the hwloc runtime and the library, likely because it doesn't pick up the correct version during the compile stage

#20 Updated by Paul Bauer 8 days ago

  • Status changed from Accepted to Fix uploaded

proposed fix is here
https://gerrit.gromacs.org/c/gromacs/+/14501

You'll need to specify the correct hwloc search path if the environment has the include directory both under the default system include and through the module loaded afterwards.

#21 Updated by Paul Bauer 7 days ago

  • Status changed from Fix uploaded to Resolved

#22 Updated by Paul Bauer 7 days ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF