Project

General

Profile

Bug #3191

TPR file reading from char buffer fails if file is generated on different (32 vs 64 bit) architecture

Added by Szilárd Páll 5 months ago. Updated 5 months ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
analysis tools
Target version:
Affected version - extra info:
likely everything on master since the tpr file reading changes got merged
Affected version:
Difficulty:
uncategorized
Close

Description

Both in post-submit on bs_jetson_tk1 as well as a on an i686 VM gmx check failed during the complex/nst_mismatch regressiontest.
Link to the former: http://jenkins.gromacs.org/job/Matrix_PostSubmit_master/2040/

Backtrace of the latter:

GROMACS:      gmx check, version 2020-beta1-dev-20191030-7aad04af4
Executable:   /home/pszilard/gromacs/build/bin/gmx
Data prefix:  /home/pszilard/gromacs (source tree)
Working dir:  /home/pszilard/gromacs/build/tests/regressiontests-master-22e4ae4/complex/nst_mismatch
Command line:
  gmx check -s1 ./reference_s.tpr -s2 topol.tpr -tol 0.0001 -abstol 0.001

Note: When comparing run input files, default tolerances are reduced.
Reading file ./reference_s.tpr, VERSION 2020-beta1-dev-20191029-c2e5f57-unknown (single precision)

Program received signal SIGSEGV, Segmentation fault.
__memcpy_sse2_unaligned () at ../sysdeps/i386/i686/multiarch/memcpy-sse2-unaligned.S:578
578     ../sysdeps/i386/i686/multiarch/memcpy-sse2-unaligned.S: No such file or directory.
(gdb) bt
#0  __memcpy_sse2_unaligned () at ../sysdeps/i386/i686/multiarch/memcpy-sse2-unaligned.S:578
#1  0xb68eb5ae in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_S_copy_chars(char*, char const*, char const*) ()
   from /usr/lib/i386-linux-gnu/libstdc++.so.6
#2  0xb711fc18 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char const*> (this=0xbfffde14,
    __beg=0x44fe7d "\b", __end=0x4344fe7d <error: Cannot access memory at address 0x4344fe7d>) at /usr/include/c++/7/bits/basic_string.tcc:225
#3  0xb68ee8ac in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, unsigned int, std::allocator<char> const&) () from /usr/lib/i386-linux-gnu/libstdc++.so.6
#4  0xb70f3410 in gmx::InMemoryDeserializer::Impl::doString (this=0x44e9b0, value=0xbfffdea4)
    at /home/pszilard/gromacs/src/gromacs/utility/inmemoryserializer.cpp:209
#5  0xb70f2830 in gmx::InMemoryDeserializer::doString (this=0xbfffe19c, value=0xbfffdea4)
    at /home/pszilard/gromacs/src/gromacs/utility/inmemoryserializer.cpp:311
#6  0xb7930a58 in do_symtab (serializer=0xbfffe19c, symtab=0xbfffe538) at /home/pszilard/gromacs/src/gromacs/fileio/tpxio.cpp:2418
#7  0xb7931348 in do_mtop (serializer=0xbfffe19c, mtop=0xbfffe3cc, file_version=119) at /home/pszilard/gromacs/src/gromacs/fileio/tpxio.cpp:2570
#8  0xb79322ef in do_tpx_mtop (serializer=0xbfffe19c, tpx=0xbfffe2d0, mtop=0xbfffe3cc) at /home/pszilard/gromacs/src/gromacs/fileio/tpxio.cpp:2877
#9  0xb7932bc8 in do_tpx_body (serializer=0xbfffe19c, tpx=0xbfffe2d0, ir=0x44eb10, state=0xbfffe6ec, x=0x0, v=0x0, mtop=0xbfffe3cc)
    at /home/pszilard/gromacs/src/gromacs/fileio/tpxio.cpp:3114
#10 0xb7933419 in completeTprDeserialization (partialDeserializedTpr=0xbfffe2d0, ir=0x44eb10, state=0xbfffe6ec, x=0x0, v=0x0, mtop=0xbfffe3cc)
    at /home/pszilard/gromacs/src/gromacs/fileio/tpxio.cpp:3349
#11 0xb7932f4a in readTpxBody (tpx=0xbfffe380, serializer=0xbfffe2c8, ir=0x44eb10, state=0xbfffe6ec, x=0x0, v=0x0, mtop=0xbfffe3cc)
    at /home/pszilard/gromacs/src/gromacs/fileio/tpxio.cpp:3245
#12 0xb7933559 in read_tpx_state (fn=0x44ea30 "./reference_s.tpr", ir=0x44eb10, state=0xbfffe6ec, mtop=0xbfffe3cc)
    at /home/pszilard/gromacs/src/gromacs/fileio/tpxio.cpp:3373
#13 0xb7638ecc in comp_tpx (fn1=0x44ea30 "./reference_s.tpr", fn2=0x44ea58 "topol.tpr", bRMSD=false, ftol=9.99999975e-05, abstol=0.00100000005)
    at /home/pszilard/gromacs/src/gromacs/tools/check.cpp:107
#14 0xb763c4c8 in gmx_check (argc=9, argv=0xbfffef58) at /home/pszilard/gromacs/src/gromacs/tools/check.cpp:848
#15 0xb7241325 in gmx::(anonymous namespace)::CMainCommandLineModule::run (this=0x4429e0, argc=9, argv=0xbfffef58)
    at /home/pszilard/gromacs/src/gromacs/commandline/cmdlinemodulemanager.cpp:133
#16 0xb7243356 in gmx::CommandLineModuleManager::run (this=0xbfffee78, argc=9, argv=0xbfffef58)
    at /home/pszilard/gromacs/src/gromacs/commandline/cmdlinemodulemanager.cpp:589
#17 0x0041037d in main (argc=10, argv=0xbfffef54) at /home/pszilard/gromacs/src/programs/gmx.cpp:60

Associated revisions

Revision ff3c92df (diff)
Added by Paul Bauer 5 months ago

Use type of explicit size in inmemoryserializer

Prevents issues where memory buffers generated on one architecture are
read in on a different one. Changes the type from size_t to unit64_t.

Fixes #3191

Change-Id: Id1ab0e15f8a2045d0cab45bc61893c28de0a181c

History

#1 Updated by Paul Bauer 5 months ago

Some more observations from running on the ARM7 build host:
- The exception occurs in the same routine in position 125 on the read buffer
- Running grompp under gdb and trapping the write routine on long strings does not result in anything
- Running grompp under valgrind only gives the following

Ignoring obsolete mdp entry 'ns_type'
==2054== Invalid write of size 4
==2054==    at 0x6185B8A: ??? (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==2054==  Address 0xbde5a3b8 is on thread 1's stack
==2054==  16 bytes below stack pointer
==2054== 
==2054== Conditional jump or move depends on uninitialised value(s)
==2054==    at 0x6187418: __udivmoddi4 (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==2054== 
==2054== Use of uninitialised value of size 4
==2054==    at 0x618741A: __udivmoddi4 (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==2054== 
==2054== Use of uninitialised value of size 4
==2054==    at 0x6185B9E: ??? (in /lib/arm-linux-gnueabihf/libgcc_s.so.1)
==2054== 
Generated 330891 of the 330891 non-bonded parameter combinations
Generating 1-4 interactions: fudge = 0.5
Generated 330891 of the 330891 1-4 parameter combinations
Excluding 2 bonded neighbours molecule type 'SOL'
Number of degrees of freedom in T-Coupling group System is 1293.00
Estimate for the relative computational load of the PME mesh part: 0.24

Back Off! I just backed up topol.tpr to ./#topol.tpr.5#

GROMACS reminds you: "Wait a Minute, aren't You.... ? (gunshots) Yeah." (Bodycount)

turning H bonds into constraints...
Analysing residue names:
There are:   216      Water residues
Calculating fourier grid dimensions for X Y Z
Using a fourier grid of 16x16x16, spacing 0.116 0.116 0.116
This run will generate roughly 0 Mb of data
==2054== 
==2054== HEAP SUMMARY:
==2054==     in use at exit: 20,572 bytes in 41 blocks
==2054==   total heap usage: 712,960 allocs, 712,919 frees, 293,421,616 bytes allocated
==2054== 
==2054== LEAK SUMMARY:
==2054==    definitely lost: 0 bytes in 0 blocks
==2054==    indirectly lost: 0 bytes in 0 blocks
==2054==      possibly lost: 0 bytes in 0 blocks

I'll investigate the invalid write but I'm not sure what I can find out.

#2 Updated by Paul Bauer 5 months ago

Maybe related, maybe not, but running an older version with the newer TPR files gives the following failure:

Error in user input:
Unknown input values:
  /applied-forces/density-guided-simulation/adaptive-force-scaling

/applied-forces/density-guided-simulation/adaptive-force-scaling-time-constant

#3 Updated by Paul Bauer 5 months ago

Reverting to before the changes that moved tpr serialization from pure xdr reading to reading in the buffer on its own makes this work again.

#4 Updated by Paul Bauer 5 months ago

I'm bisecting now to see where exactly the error first appeared

#5 Updated by Paul Bauer 5 months ago

So, another data point is now that if I generate the reference_s.tpr file on the arm machine at commit 3836f52778fe36d11e511c6c55a31b6cfb2375fc (or at master HEAD), then I can compare the files without issue.

So the issue is with reading in a reference file produced on 64 bit architecture on 32 bit that somehow doesn't show for other tests.

#6 Updated by Paul Bauer 5 months ago

Confirmed, the issue is that files generated on 64 bit architectures can not be read on 32 bit.
I guess all the other TPR files are old enough not to trigger the new TPR reading routines that read them in from the buffer, and the xdr reading code takes care of all of this under the hood.

#7 Updated by Paul Bauer 5 months ago

  • Subject changed from gmx check segfaults on complex/nst_mismatch to TPR file reading from char buffer fails if file is generated on different (32 vs 64 bit) architecture

#8 Updated by Paul Bauer 5 months ago

  • Priority changed from Normal to High
  • Target version set to 2020-beta2
  • Affected version - extra info set to likely everything on master since the tpr file reading changes got merged

#9 Updated by Paul Bauer 5 months ago

  • Status changed from New to Fix uploaded

#10 Updated by Paul Bauer 5 months ago

  • Status changed from Fix uploaded to Resolved

#11 Updated by Paul Bauer 5 months ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF