Project

General

Profile

Feature #1064

Optimised kernel for BlueGene/Q

Added by Chris Samuel almost 7 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
mdrun
Target version:
Difficulty:
uncategorized
Close

Description

Hi there,

Here at VLSCI at the University of Melbourne we have a 4 rack BlueGene/Q and quite a lot of GROMACS users who will not fit onto our Intel HPC systems.

We are very keen to see a BG/Q optimised kernel for GROMACS (it is really quite important for us), please let us know if there is any way we can help!

All the best
Chris

Associated revisions

Revision 7af9679b (diff)
Added by Mark Abraham almost 7 years ago

Set CMake module search path at a better time

This needs to be set before project() in order that
toolchain files for cross compilling can use the same
search path. Otherwise, the users has to specify
fully-qualified toolchain files.

Refs #1064

Change-Id: I1f1fb4803c9f0dd2b95d236da1c1858c5ea382f9

Revision 4f96586b (diff)
Added by Mark Abraham almost 7 years ago

Re-organize BlueGene toolchain files

Rename /L and /P toolchain files. Add /Q files from
http://www.cmake.org/Bug/view.php?id=13512

Refs #1064

Change-Id: I7492737be936e59dce6217c41986316b88ec1e06

Revision 994af5d7 (diff)
Added by Mark Abraham almost 7 years ago

Manage BlueGene/Q configuration

  • add toolchain
  • set up to use IBM QPX accelerated kernels
  • check the compiler will compile the kernels

Refs #1064

Change-Id: I97ad3c3c96c4c3b0c59318c181f66972ebd9a903

History

#1 Updated by Berk Hess almost 7 years ago

  • Category set to mdrun
  • Assignee set to Mark Abraham
  • Priority changed from High to Normal

As the new Verlet x86 SIMD kernels are written with SIMD intrinics macros, adding BG/Q macro definitions to include/gmx_x86_simd_macros.h should give nearly complete BG/Q SIMD kernels. There is a bit more work to do in the infrastructure around this, but probably not much. People at EPCC in Edinburgh expressed interest in implementing this, but a final decision has not been taken yet. All help is welcome, but we should coordinate the efforts to avoid double work.

#2 Updated by Mark Abraham almost 7 years ago

Yes, I've had some email discussion with Simon Wail and others of IBM Australia.

A good first thing would be for someone to grab a copy of the latest beta (http://www.gromacs.org/Downloads) and see if it builds and runs. I expect so, but haven't had time to check. Some feedback on how best to tweak the CMake invocation to work easily (and/or a toolchain file for BlueGene/Q) would be very welcome.

I expect the new "Verlet" kernels to be those of most interest for use on BlueGene/Q, because of their intrinsically better scaling properties, however your user base will all be using the old-style "group" kernels now. Ideally, the porting work would take place for both, but frankly if the group kernels were never ported I expect I would shed no tears. Users simulating small systems would be the major demand case for efficient group kernels, but they should either
  • not be on BlueGene or
  • structuring their work to replicate simulations on subsets of a midplane using GROMACS multi-simulations (or the BlueGene features like this)

and in either case offering only the Verlet kernels is fine. However, educating the users how best to use the system is more complex for the above reasons.

I'd guess that the MPMD nature of both kinds of kernels would probably work best with 64 OpenMP threads across each compute card of 4 cores in each of 16 processors, however we have no plans to offer OpenMP with the group kernels and wouldn't encourage anybody else to try. However, depending how flexible the OpenMP+MPI setup is, a work-around for group kernels might exist. In (say) a mid-plane sized job on 16*32 compute cards, we might like to have 4*32 MPI processes for the PME set (each running 64 hardware threads) and (16-4)*32*64 MPI processes for the group kernel set (each running 1 hardware thread). You might like to find out whether this kind of thing be done, as part of finding out how best to support users into using the BG/Q well. If so, then the case for porting the group kernels to get the SIMD value becomes more clear.

#3 Updated by Chris Samuel almost 7 years ago

Thanks Berk and Mark for the updates - we've been a bit stretched recently with system issues and people being ill but we do plan on trying out the 4.6 betas on our BG/Q.

We've found with 4.5 that we get best performance with 64 MPI tasks per node for the benchmark problem, so it'll be interesting to see what happens with the hybrid mode.

#4 Updated by Mark Abraham almost 7 years ago

Did some fixing to make sure BlueGene/Q will now build. Will organize regression testing tomorrow. Still just plain C kernels.

#5 Updated by Mark Abraham almost 7 years ago

Plain C kernels now pass tests on BG/Q.

Erik has done some BG/Q kernel work, which I hope to test out on Juelich this evening.

#6 Updated by Chris Samuel almost 7 years ago

Sounds promising, I've just found the Gerrit instance that you use and this appears to be the changeset you're referring too, is that correct?

https://gerrit.gromacs.org/#/c/1993/

In the beta2 announcement you mentioned that the regression tests for 4.6 would be around in a week, and I can see two commits that mention 4.6, are you happy for us to use those for testing on BG/Q now?

#7 Updated by Erik Lindahl almost 7 years ago

Hi Chris,

For reference: The QPX instructions in BG/Q are somewhat similar to Altivec/VMX (which I have experience of before), but since the BG/Q in Juelich is down for a couple of days we have not been able to compile that change yet. I would expect it to fail with a long list of errors (since there are ~100 generated kernels), but the second I have that list it should be very quick to fix it to compile, and then I expect to debug it in 2-3 days given BG/Q access.

You are more than welcome to try it and provide that list of errors as feedback earlier if you want to, but don't expect it to work - yet!

#8 Updated by Chris Samuel almost 7 years ago

Hi Erik,

Would you like access to our BG/Q? Admittedly it's not as big as JuQueen but I've just got the OK to give you access. We run Slurm rather than LL but for now I guess it's the compilers you're after.

If so then I just need your email address.

cheers!
Chris

#9 Updated by Erik Lindahl almost 7 years ago

Hi Chris,

Sure, if it's not any significant amount of work for you it's - we would only need access a couple of days. Basically, we won't need any significant CPU time, but I need to run a couple of compiles and then a handful of minute-long single-node jobs in order to actually execute the code on PowerPC A2 processors (or if there are some interactive nodes).

Cheers,

ERik

#10 Updated by Chris Samuel almost 7 years ago

Not a problem, your invitation to join our guest user project is on its way via email now. We've given that project 100 CPU hours on Avoca (our BG/Q) for this as a start. Good luck!

#11 Updated by Mark Abraham almost 7 years ago

There's a bunch of support machinery needed before compilation and testing can work. I have all that complete, but it got stranded on Juqueen's front end when they took it offline for upgrades :-) I'll rebase+update https://gerrit.gromacs.org/#/c/1993 so things can work.

#12 Updated by Mark Abraham almost 7 years ago

  • Status changed from New to In Progress
  • Target version changed from 4.6 to 4.6.1

#13 Updated by Chris Samuel over 6 years ago

Mark Abraham wrote:

There's a bunch of support machinery needed before compilation and testing can work. I have all that complete, but it got stranded on Juqueen's front end when they took it offline for upgrades :-) I'll rebase+update https://gerrit.gromacs.org/#/c/1993 so things can work.

Hi Mark,

I've been queried by our users about progress on this again, any news please?

thanks!
Chris

#14 Updated by Chris Samuel over 6 years ago

I noticed that GROMACS 4.6.1 has appeared for download (with a release date a few weeks in the future, can I borrow your TARDIS? :-) ), did BG/Q support make it in by some chance?

BTW: I just went to look on Gerrit and was told its certificate had expired:

gerrit.gromacs.org uses an invalid security certificate.

The certificate expired on 14/02/13 17:30. The current time is 18/03/13 10:35.

(Error code: sec_error_expired_certificate)

#15 Updated by Mark Abraham over 6 years ago

  • Assignee deleted (Mark Abraham)
  • Target version deleted (4.6.1)

There are still plans for BG/Q kernels. I rate the chances of Verlet kernels being written much higher than group kernels.

#16 Updated by Mark Abraham about 6 years ago

I have BlueGene/Q Verlet kernels nearly ready for some beta testing. The implementation is already correct, but I am still tweaking the compiler to do reasonable things with the kernels. If there's a user keen to do some beta testing, that would be great.

#17 Updated by Mark Abraham about 6 years ago

  • Assignee set to Mark Abraham
  • Target version set to 4.6.4

#18 Updated by Jeff Hammond about 6 years ago

ALCF would be happy to beta test. Just tell me the git branch to checkout and I'll get some users to try it out.

Thanks!

Jeff

#19 Updated by Chris Samuel about 6 years ago

Mark Abraham wrote:

I have BlueGene/Q Verlet kernels nearly ready for some beta testing.
The implementation is already correct, but I am still tweaking the
compiler to do reasonable things with the kernels. If there's a user
keen to do some beta testing, that would be great.

We've got an internal RT ticket tracking this and I bet the users on
that would love to help out!

Just tried to refresh my local clone but am getting:

Fetching origin
fatal: remote error: access denied or repository not exported: /gromacs.git
error: Could not fetch origin

My URL is:

url = git://git.gromacs.org/gromacs.git

Thanks Mark!

#20 Updated by Mark Abraham about 6 years ago

It's not yet in our repos, and will live on http://gerrit.gromacs.org while being reviewed and beta-tested.

We had to do some maintenance on git.gromacs.org yesterday, which probably explains Chris's observation. Will fix that also.

#21 Updated by Mark Abraham about 6 years ago

Mark Abraham wrote:

It's not yet in our repos, and will live on http://gerrit.gromacs.org while being reviewed and beta-tested.

Early performance results: at the PME load-balance point, twice as fast as the old plain C kernels using 512 cores with 1MPI rank per core and 4 OpenMP threads per core. This may not be best. Note this is only with Verlet cut-off scheme (so far) but group kernels may be on the table soon also.

We had to do some maintenance on git.gromacs.org yesterday, which probably explains Chris's observation. Will fix that also.

Should be fixed.

#22 Updated by Chris Samuel about 6 years ago

Mark Abraham wrote:

It's not yet in our repos, and will live on http://gerrit.gromacs.org while being reviewed and beta-tested.

Any news? I've got users who are very interested in testing this. Thanks for sorting git out too!

#23 Updated by Mark Abraham about 6 years ago

It's now ready to share and test. You can check out the development branch from https://gerrit.gromacs.org/#/c/2572/ (not git.gromacs.org). Sorry for the delay, but there was a three-part problem with neighbour searching in double precision. Don't ask!

Note that this new code only provides accelerated kernels for the Verlet cut-off scheme (new in 4.6), in particular because its scaling is better, and full OpenMP support is only available there. There are draft group cut-off scheme kernels (https://gerrit.gromacs.org/#/c/1993/), but I do not foresee anyone working on them any time soon.

The GROMACS regression tests pass, and as far as I know everything works, but the coverage of those tests is not perfect. Beta testing of this patch in the lead-up to 4.6.4 will be very important.

There's a few minor outstanding question marks, all of which are probably only relevant for runs where PME-PP load balance is good and the run is long enough for DD load balance to occur (thousands of steps, depending on the simulation system). Using mdrun -resetstep is a good way to measure performance only after the DD load has been balanced (see mdrun -h -hidden).
  • Setting the environment variable GMX_DD_SENDRECV2=1 might be slightly faster than the default - if we observe a clear improvement, I can hard-code that one
  • Likewise GMX_NO_NODECOMM=1
  • I expect that none of the standard process mappings will be optimal for either mdrun -ddorder interleave or -ddorder cartesian, but the difference will be moot until the compute load is nearly balanced

If someone wants to use g_tune_pme, then I expect that https://gerrit.gromacs.org/#/c/2554/ will be required for the front-end build.

Feedback is most welcome - general stuff might go here. If you want to comment on particular code, you can register an OpenID on gerrit.gromacs.org and participate in the review there. Thanks!

#24 Updated by Mark Abraham about 6 years ago

At this stage, I would recommend using at most 16 MPI ranks per node, and at least 2 OpenMP threads per core. Optimal points will depend on the extent to which the load of the simulation system can be balanced (in particular, with constraints), and the raw number of atoms. I don't yet have a real handle on the limit of atoms/core in the limit of good load balance.

#25 Updated by Chris Samuel about 6 years ago

Hi Mark,

Thanks for all that, very much appreciated! I've given our users a heads up and I'll arrange a build for them to test.

cheers,
Chris

#26 Updated by Chris Neale about 6 years ago

Can either of you provide me with a compilation script? I've been unable to get it to work:
http://gromacs.5086.x6.nabble.com/BGQ-compilation-with-verlet-kernels-include-file-quot-kernel-impl-h-quot-not-found-td5011259.html

Thank you,
Chris.

#27 Updated by Chris Neale about 6 years ago

Thanks to Mark, I was able to compile as below. Note that the non-MPI compile had errors.

module purge
module load vacpp/12.1 xlf/14.1 mpich2/xl
module load cmake/2.8.8
module load fftw/3.3.2
export FFTW_LOCATION=/scinet/bgq/Libraries/fftw-3.3.2
cmake ../source/ \
-DCMAKE_TOOLCHAIN_FILE=BlueGeneQ-static-XL-C \
-DCMAKE_PREFIX_PATH=$FFTW_LOCATION \
-DCMAKE_INSTALL_PREFIX=$(pwd) \
-DGMX_X11=OFF \
-DGMX_MPI=ON \
-DGMX_PREFER_STATIC_LIBS=ON
make -j 16
make install

#28 Updated by Chris Neale about 6 years ago

I am impressed with the new BG/Q kernel.

For a 30K atom system, using 0.9 nm cutoffs and PME, I can get:

64 cores (4 nodes): 17.0 ns/day
128 cores (8 nodes): 27.5 ns/day
256 cores (16 nodes): 40.6 ns/day

For comparison, the exact same system gets the following rates on an x86 machine (Intel Xeon E5540 at 2.53GHz with 8 cores/node; the "SciNet" cluster in Toronto, Canada):
8 cores: 18.1 ns/day
16 cores: 32.6 ns/day
32 cores: 51.6 ns/day

On the BG/Q, with this system, I always get the best performance using runjob --np $numcores --ranks-per-node=16 --envs OMP_NUM_THREADS=4

Thanks for all the code work and help compiling!

#29 Updated by Mark Abraham about 6 years ago

Thanks for the feedback, Chris!

I will plan to prohibit thread-MPI entirely. It can probably be made to work, but you would then need a mechanism to wrap bundle of them into one job to get the smallest available partition... and for that you need MPI (e.g. -multi).

#30 Updated by Jeff Hammond about 6 years ago

Sorry, but I do not understand what is meant by "prohibit thread-MPI entirely". I don't see any indication in this thread that this is a problem. Did I miss something? I guess I should RTFM or something but if this means MPI+OpenMP is disabled, that is very bad...

#31 Updated by Szilárd Páll about 6 years ago

Jeff Hammond wrote:

I guess I should RTFM or something

Yes, start here.

but if this means MPI+OpenMP is disabled, that is very bad...

No.

#32 Updated by Jeff Hammond about 6 years ago

Ah, so that's the workaround when MPI thread support is not present. Yeah, that's pointless on any decent MPI implementation. Gromacs should test for MPI_Init_thread and disable Thread-MPI whenever it's present, not just on BGQ.

#33 Updated by Szilárd Páll about 6 years ago

Jeff Hammond wrote:

Ah, so that's the workaround when MPI thread support is not present. Yeah, that's pointless on any decent MPI implementation. Gromacs should test for MPI_Init_thread and disable Thread-MPI whenever it's present, not just on BGQ.

No, it's not MPI thread support. You may want to read the linked paragraph again, but just to summarize it, the "thread-MPI" which has been referenced here is GROMACS' own multi-threading based highly-efficient MPI implementation.

#34 Updated by Jeff Hammond about 6 years ago

Okay, I understand it now. My understanding of what goes into a (full) implementation of MPI was brain-blocking me from parsing that correctly.

#35 Updated by Chris Neale about 6 years ago

It is inconvenient to only be able to compile the mpi version, which means that I have to run grompp under mpi on the queue bindles in a call to srun or runjob ... or am I missing something? I still can not compile the non-mpi version, so I don't see how there is an alternative.

Thank you,
Chris.

#36 Updated by Mark Abraham about 6 years ago

There has never been any point to compiling a GROMACS tool with MPI (but might change for 5.0). One should just use the non-MPI version - on any machine.

Trying to run any version of a GROMACS tool on the back end of BlueGene/Q is even less useful, because the front-end machine likely has faster cores, and does not have to pay back-end-partition creation overhead.

On BG/Q, one should compile mdrun MPI for the back end, everything else for the front end, and write job scripts accordingly. The scripts execute on the front end, just like srun and runjob do - but those latter know how to spawn MPI on the back end. This is the price of an heterogeneous HPC architecture, but the only difference in scripts for BG/Q vs some homogeneous x86 cluster should be in the actual call of mdrun.

Hence, the only build that should work with the BG/Q compiler toolchain is one that uses MPI.

#37 Updated by Mark Abraham about 6 years ago

You would have the option of compiling with MPI for the front end, too, if you really wanted to keep scripts calling grompp_mpi. :-)

#38 Updated by Chris Neale about 6 years ago

OK, thanks Mark. it is a generally usable method to do the grompp outside of the BGQ and scp the .tpr into the bluegene and then submit the job there. Is that really what you meant by frontend?

On both the BGQ's that I have access to, if I include -DCMAKE_TOOLCHAIN_FILE=BlueGeneQ-static-XL-C in the cmake call, then I can not compile the gromacs master code unless I use MPI; furthermore, if I omit the -DCMAKE_TOOLCHAIN_FILE=BlueGeneQ-static-XL-C line in cmake, then the compile also fails.

#39 Updated by Mark Abraham about 6 years ago

Chris Neale wrote:

OK, thanks Mark. it is a generally usable method to do the grompp outside of the BGQ and scp the .tpr into the bluegene and then submit the job there. Is that really what you meant by frontend?

The machine to which you log in (Linux on Power7, in my experience) is the front end. It's just a normal machine. The compute nodes are totally different, and require cross-compilation to generate code that can execute there - that's what the BluegeneQ platform file is for.

You should compile the GROMACS tools for the front end, and run them on it. Because it's a normal machine, you don't need a special platform file. You may as well run EM on it, too. Job scripts can just look like

# Magic queue system stuff here 
grompp -your -options -for -em
runjob -np whatever : mdrun_mpi -other -stuff
grompp -your -options -for -equil
runjob -np whatever : mdrun_mpi -other -stuff
grompp -your -options -for -production -md
runjob -np whatever : mdrun_mpi -other -stuff
trjconv -make -me -pretty

but non-mdrun tools are compiled for and execute on the front end, and mdrun_mpi is compiled for the back end. runjob manages the details, just like mpirun would when using OpenMPI.

On both the BGQ's that I have access to, if I include -DCMAKE_TOOLCHAIN_FILE=BlueGeneQ-static-XL-C in the cmake call, then I can not compile the gromacs master code unless I use MPI; furthermore, if I omit the -DCMAKE_TOOLCHAIN_FILE=BlueGeneQ-static-XL-C line in cmake, then the compile also fails.

The master branch tip is lagging on a few BG/Q details that I submitted to release-4-6 branch. The only time you want to use the BG/Q toolchain is when you also want MPI, though. So I don't see a problem, except that our use of cmake does not yet prohibit doing wrong things.

#40 Updated by Chris Neale about 6 years ago

when I compile without -DCMAKE_TOOLCHAIN_FILE=BlueGeneQ-static-XL-C and without MPI, then I get the same error that I started with, which I originally posted on the mailing list ( http://gromacs.5086.x6.nabble.com/BGQ-compilation-with-verlet-kernels-include-file-quot-kernel-impl-h-quot-not-found-td5011259.html ). Namely,:

[  2%] Building C object src/gmxlib/CMakeFiles/gmx.dir/copyrite.c.o
[  2%] Building C object src/gmxlib/CMakeFiles/gmx.dir/cinvsqrtdata.c.o
[  2%] Building C object src/gmxlib/CMakeFiles/gmx.dir/writeps.c.o
[  2%] Building C object src/gmxlib/CMakeFiles/gmx.dir/mtop_util.c.o
[  3%] Building C object src/gmxlib/CMakeFiles/gmx.dir/gmx_fatal.c.o
"/home/p/pomes/cneale/exec/gromacs-4.6.3_bgq/source/src/gmxlib/network.c", line 264.10: 1506-296 (S) #include file <spi/include/kernel/location.h> not found.
make[2]: *** [src/gmxlib/CMakeFiles/gmx.dir/network.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [src/gmxlib/CMakeFiles/gmx.dir/all] Error 2
make: *** [all] Error 2
[  0%] Building C object src/gmxlib/CMakeFiles/gmx.dir/network.c.o
"/home/p/pomes/cneale/exec/gromacs-4.6.3_bgq/source/src/gmxlib/network.c", line 264.10: 1506-296 (S) #include file <spi/include/kernel/location.h> not found.
make[2]: *** [src/gmxlib/CMakeFiles/gmx.dir/network.c.o] Error 1
make[1]: *** [src/gmxlib/CMakeFiles/gmx.dir/all] Error 2
make: *** [all] Error 2

What am I missing?

If I should take this back to the users mailing list, please let me know.

Thanks again for your help,
Chris.

#41 Updated by Mark Abraham about 6 years ago

There is no purpose in using that toolchain and not using MPI. I've already explained why and what to do instead. Not sure how else I can help. Do you understand that you need to cross-compile mdrun? Do you understand that a BlueGene/Q job script "runs" on two different kinds of machine?

#42 Updated by Chris Neale about 6 years ago

OK, I guess that I didn't make it clear. I am not using that toolchain and am not using MPI and can not compile. As far as I know, there is no way to compile any gromacs tools without MPI on any part of the BGQ. If anyone ever does it, I'd be eager to get their compilation script. But for now I can grompp on an x86 machine and scp the .tpr over to the BGQ or run grompp under MPI on the compute nodes so it's not a big problem.

Thanks again,
Chris.

#43 Updated by Mark Abraham about 6 years ago

Chris Neale wrote:

OK, I guess that I didn't make it clear. I am not using that toolchain and am not using MPI and can not compile. As far as I know, there is no way to compile any gromacs tools without MPI on any part of the BGQ. If anyone ever does it, I'd be eager to get their compilation script. But for now I can grompp on an x86 machine and scp the .tpr over to the BGQ or run grompp under MPI on the compute nodes so it's not a big problem.

Sorry, I misread the first few words of your post 40, where you said you were "without" the GROMACS-supplied BlueGene/Q XLC toolchain. However, the output is only consistent with a build that does use a Bluegene/Q toolchain. The missing #include file is protected by an #ifdef that is only true when using a compiler that #defines __bgq__. Two explanations come to mind:
  • with CMake, the compiler/toolchain is an invariant that is fixed as soon as you start CMake, so unless you remove the cache (or the whole contents of the build directory), you cannot "undo" the choice of the platform file, and instead should start afresh
  • you were using some other compiler "for the BlueGene/Q" (e.g. gcc or clang) that also #defines __bgq__ - for the non-MPI build, you want a compiler that targets the front end (ie. login node). Whether or not you were using such a compiler, it does illustrate that I should further protect that #include with an #ifdef for using MPI. Thanks!

#44 Updated by Jeff Hammond about 6 years ago

Just to be clear, you are saying that CMake cannot autodetect the POWER7/Linux login node environment properly? What is your CMake invocation and what is failing? I recommend that you start with /usr/bin/gcc and friends since those will not have an modifications for BGQ relative to the RHEL RPMs.

I will try to work on this for you but it might not happen right away.

#45 Updated by Chris Neale about 6 years ago

gcc works, thank you Jeff.

I was having problems with non-MPI builds while using the bgxlf and bgxlC compilers. I was calling cmake like this:

cmake ../source/ \
-DCMAKE_PREFIX_PATH=$FFTW_LOCATION \
-DCMAKE_INSTALL_PREFIX=$(pwd) \
-DGMX_X11=OFF \
-DGMX_MPI=OFF \
-DGMX_PREFER_STATIC_LIBS=ON

Note that I was previously loading a mpich2 module and a fftw module that was compiled with mpi support, but I was still using -DGMX_MPI=OFF. (I hadn't thought that loading the mpi module would affect the build when I defined -DGMX_MPI=OFF , sorry). When I stopped loading the bgxlC and mpi modules, I got errors with fftw and had to recompile it without mpi support. After that, gcc 4.4.6 compilation with cmake as above worked fine.

Note that all of my previous builds were always using empty install directories (unless there were .files left over from the previous build attempt that didn't get removed by a rm -rf * call)

#46 Updated by Jeff Hammond about 6 years ago

The bgxl* compilers are for the Blue Gene compute nodes. That's the reason for the bg* prefix. On Blue Gene systems, the xl* compilers are not the same ones for POWER and may not work for the logins. At best they will generate vanilla PPC64 code that can run on both processors. Thus, you need to always use GCC on the POWER logins unless you have obtained the XL compilers specifically for POWER/Linux as separate license with IBM.

#47 Updated by Mark Abraham about 6 years ago

Agree with Jeff.

I needed to consider a third option in post 43:
  • that Chris in post 40 was using bgxl* cross-compilers (which #define __bgq__ because they target the back end - these are the compilers specified in the BluegeneQ*.cmake platform files supplied with GROMACS) to try to compile for the front end. You must use different compilers for the front end, but your local sysadmins can tell you what they are.

To repeat: by design, the non-MPI GROMACS build should use compilers that target the login nodes (a.k.a. front end), because those produce executables that work from a login shell (and in a job script) when not using runjob/srun.

#48 Updated by Chris Neale about 6 years ago

Thanks guys. This was very helpful. Sorry for the confusion.

#49 Updated by Mark Abraham almost 6 years ago

  • Status changed from In Progress to Resolved

#50 Updated by Chris Samuel almost 6 years ago

Apologies for the delay folks, I've been buried here.

We eventually got some feedback about this version, they said:

It seems that the new code speeds up the verlet scheme relative to the
pair list scheme - which makes sense, because only the verlet scheme is
using the new kernel (I believe). However, it seems that using the pair
lists, the new version is considerably slower than the old version.

So I'm not sure they're going to consider GROMACS on BG/Q usable (I'll check). :-(

#51 Updated by Mark Abraham almost 6 years ago

Performance regression of the group kernels seems wildly unlikely, but I will try it out. The group kernels themselves are so unfriendly to BG/Q (even if we were to port them with QPX SIMD, which we won't) that using the Verlet kernels for the new performance would be a no-brainer (unless one of the algorithms not yet implemented for Verlet is important, which would itself be interesting to have known about before...).

#52 Updated by Mark Abraham almost 6 years ago

I have tested the group scheme on BG/Q in 4.6.3 and release-4-6 HEAD (i.e. with the above patch included), and their performance is identical (as I expected). The Verlet scheme on the same simulation was nearly 3 times faster. All runs were approximately balanced with PP and PME. So pending further information from Chris's user, I see no problem to address.

#53 Updated by Chris Samuel almost 6 years ago

Hi Mark,

Mark Abraham wrote:

I have tested the group scheme on BG/Q in 4.6.3 and release-4-6 HEAD (i.e. with the above patch included), and their performance is identical (as I expected). The Verlet scheme on the same simulation was nearly 3 times faster. All runs were approximately balanced with PP and PME. So pending further information from Chris's user, I see no problem to address.

Perfect, thanks - that means we do have a local issue then that's caused a performance regression.

I'll get our apps people to chase this locally.

#54 Updated by Rossen Apostolov almost 6 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF