Project

General

Profile

Bug #937

OpenMP bug with MSVC 2010

Added by Roland Schulz about 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
mdrun
Target version:
Affected version - extra info:
Affected version:
Difficulty:
uncategorized
Close

Description

Running with PME with OpenMP with MSVC2010 gives:
"Fatal User Error 1002: '#pragma omp barrier' improperly nested in a work-sharing construct"

For regressiontest nacl it crashes on the 2nd barrier on the 4th time fft5d is executed.
It is fine when compiling without thread-mpi.
I don't have an idea why this is crashing. It seems to me like a bug in the compiler/OpenMP runtime.

Should we auto-disable OpenMP if thread-mpi is activated for MSVC 2010 (as we did for #900)?
I'm not sure whether this bug is also with Intel/VS2008 on Windows because thread-mpi doesn't compile at all for them (#936).

For now I change the Jenkins configuration so that OpenMP&thread-mpi isn't used together on Windows.

Associated revisions

Revision 38cef96f (diff)
Added by Roland Schulz almost 7 years ago

Fix illegal barrier in worksharing

According to section 2.10 of the OpenMP standard:
A barrier region may not be closely nested inside a worksharing
... region.

Replaced parallel-for with parallel plus gmx_omp_get_thread_num.

Fixes #937

Change-Id: Ief31f2097fb084d405e5981913cce2c2bfee868c

History

#1 Updated by Roland Schulz about 7 years ago

VS2008 without thread-mpi also produces the same OpenMP error with PME:
http://jenkins.gromacs.org/job/Gromacs_Gerrit_4_6/650/testReport/

#2 Updated by Roland Schulz almost 7 years ago

What is the reason we use

#pragma omp parallel for num_threads(pme->nthread) schedule(static)
        for(thread=0; thread<pme->nthread; thread++) {

instead of
#pragma omp parallel num_threads(pme->nthread)
         {
            thread = omp_get_thread_num();

?
Did we benchmark this and the first option is faster? The problem goes away when changing it to the 2nd option.
And according to http://msdn.microsoft.com/en-us/library/azb90x99(v=vs.100).aspx the reason that the first option doesn't work is:

barrier directives are not permitted in the dynamic extent of for [] regions if the directives bind to the same parallel as the regions.

Thus I think for MSVC we need to use the 2nd option. Ideally we could use the 2nd option for all compilers. If we know of some compilers for which the 2nd option is slower then we might want to create a macro so that we use option 1/2 depending on the compiler.

#3 Updated by Berk Hess almost 7 years ago

I used the second option nearly everywhere in Gromacs, but changed all to for loops.
At Barcelona computing center they prefer for loops, as those can automatically be converted to parallel tasks by there task parallelization scheme. We could reconsider this though.

#4 Updated by Roland Schulz almost 7 years ago

There is difference between the two approaches, when combined with OpenMP's "dynamic adjustment of the number of threads". If I understand correctly, a runtime which supports dynamic adjustment would only use as many threads as are still available (cores can be unavailable because they are already used by other task/nested parallelism). For the 1st option with dynamic adjustment all pme->nthreads iterations are executed whereas for option 2 only as many as available (if that is less then nthread). But neither is correct for PME. Because the FFT assumes that all other nthread tasks have been done when reaching a barrier (so that then transpose is valid). Also option 1 would dead-lock if less then nthread cores would execute the loop. The reason is that some threads would execute the loop for a 2nd iteration and then only some threads would hit the barrier. Thus to support dynamic adjustment we would need to split every region currently separated by a barrier into a separate omp-for.

As far as I can see we are not using dynamic adjustment. Are we planning to use it? If we are not planning to use it and thus don't split it in seperate omp-for, we should disable it using "omp_set_dynamic(false)", to make sure that we don't get wrong results in cases where the runtime for some reasons thinks that less threads are available.

Interestingly even GCC complaints that: "error: barrier region may not be closely nested inside of work-sharing, critical, ordered, master or explicit task region" for:

int main()
{
    int nthread = omp_get_num_threads(), thread;
    int* data = calloc(nthread, sizeof(int));
#pragma omp parallel for num_threads(nthread) schedule(static)
    for(thread=0; thread<nthread; thread++)
    {                                                                                                                                                                                 
        data[thread] += 1;
#pragma omp barrier
        data[thread] += 1;

    }
    return 0;
}

Only when moving the loop body into a function gcc doesn't throw an error anymore. But it seems to me that that doesn't change the logic, and thus I suspect that the only reason gcc doesn't give an error in our case is that it doesn't check beyond the function boundaries.

I can't find anything in the OpenMP standard prohibiting a barrier in a omp-parallel-for. All it says under barrier restrictions is:

• Each barrier region must be encountered by all threads in a team or by none at all.
• The sequence of worksharing regions and barrier regions encountered must be the same for every thread in a team.

So I'm not sure why gcc and msvc complain. It seems to make sense for the general case but in the special case that the number of threads is equal to the number of iterations, it seems OK according to the standard. But since in that case both options are equivalent it might be better to use option 2.

Does the Barcelona parallel tasks make sure that as many tasks as pme->nthread are used? Because otherwise it would have the same problem as described above for dynamic adjustment.

Do you have another idea or should I go ahead and change it to option 2?

My idea of wrapping the whole parallel region into a start so that one could make option 1/2 configurable doesn't work because pragmas are not allowed in macros and _Pragma requires C99.

#5 Updated by Roland Schulz almost 7 years ago

Berk, do you want me to go ahead and change it or do you have some other idea to fix it?

#6 Updated by Szilárd Páll almost 7 years ago

Does the Barcelona parallel tasks make sure that as many tasks as pme->nthread are used? Because otherwise it would have the same problem as described above for dynamic adjustment.

It does if one specifies it. These kind of inter-dependencies will be an issue especially that they are lurking in the code here and are probably not documented extremely well.

Is there any performance penalty to making the dynamic parallelism work (by splitting the omp for)?

Do you have another idea or should I go ahead and change it to option 2?

I would very much prefer to not have to start by ifdef-ing this pragma to turn it back into an omp for for OmpSs, but if there is no other way to eliminate warnings, I'll have to be fine with it.

My idea of wrapping the whole parallel region into a start so that one could make option 1/2 configurable doesn't work because pragmas are not allowed in macros and _Pragma requires C99.

Right.

#7 Updated by Szilárd Páll almost 7 years ago

  • Category set to mdrun
  • Target version set to 4.6

#8 Updated by Roland Schulz almost 7 years ago

Szilárd Páll wrote:

Does the Barcelona parallel tasks make sure that as many tasks as pme->nthread are used? Because otherwise it would have the same problem as described above for dynamic adjustment.

It does if one specifies it.

Is it then still better for them then the 2nd option? Since at this point they are equivalent if they can convert one but not the other, I would think it is a problem with their code not ours. And even in the dynamic case, why is Barcelon's code not able to extract that from a simple "omp parallel"? I can't see any reason why that shouldn't be possible. Thus I think they should fix it not us.

Is there any performance penalty to making the dynamic parallelism work (by splitting the omp for)?

I don't think so, but it isn't possible without significant code rearrangements. E.g. one of the "omp for" is in pme and from within the parallel region it calls fft5d which has barriers. If we were to split that, we somehow would need to have a "omp for" starting in pme and ending in fft5d. I don't know of any other way than actually copying fft5d into pme, because a for loop obviously cannot start in one function in end in another.

#9 Updated by Szilárd Páll almost 7 years ago

Roland Schulz wrote:

Szilárd Páll wrote:

Does the Barcelona parallel tasks make sure that as many tasks as pme->nthread are used? Because otherwise it would have the same problem as described above for dynamic adjustment.

It does if one specifies it.

Is it then still better for them then the 2nd option? Since at this point they are equivalent if they can convert one but not the other, I would think it is a problem with their code not ours. And even in the dynamic case, why is Barcelon's code not able to extract that from a simple "omp parallel"? I can't see any reason why that shouldn't be possible. Thus I think they should fix it not us.

Yes 1) is better than 2) because parallel is not (fully) supported. Btw, we are not fixing anything, as you said the code should be correct as it is. Anyway, there is no point in arguing about this, if you want to change it, go ahead. As I said, the only drawback of doing that is that I/someone else will have to change it back in the respective branch.

Is there any performance penalty to making the dynamic parallelism work (by splitting the omp for)?

I don't think so, but it isn't possible without significant code rearrangements. E.g. one of the "omp for" is in pme and from within the parallel region it calls fft5d which has barriers. If we were to split that, we somehow would need to have a "omp for" starting in pme and ending in fft5d. I don't know of any other way than actually copying fft5d into pme, because a for loop obviously cannot start in one function in end in another.

I see.

#10 Updated by Szilárd Páll almost 7 years ago

FYI: I get a segv with gcc 4.7 if replace the #pragma omp parallel for with #pragma omp parallel in pme.c:4326.

#11 Updated by Roland Schulz almost 7 years ago

did you remember to replace for(thread=0; thread<pme->nthread; thread++) with thread=gmx_omp_get_thread_num?
and got:

#pragma omp parallel num_threads(pme->nthread) schedule(static)
        {
            thread=gmx_omp_get_thread_num();

? BTW: I meant with fix, that our current code tries to fix that the Barcelona software doesn't understand (or optimize) the parallel. But this makes the current code not OK according to the standard.

#12 Updated by Roland Schulz almost 7 years ago

I also get a segfault. I'm looking at it, and let you know when I know why.

#13 Updated by Szilárd Páll almost 7 years ago

Roland Schulz wrote:

? BTW: I meant with fix, that our current code tries to fix that the Barcelona software doesn't understand (or optimize) the parallel. But this makes the current code not OK according to the standard.

Well, same as you I could not find a single reference in the standard for that barrier to be considered illegal. So I wold rather call it a deficiency/restriction imposed by implementation. This does not change the situation much, so I'd say if we want to consider MSVC 2010 supported, we should fix this.

#14 Updated by Roland Schulz almost 7 years ago

I now found it in the standard: section 2.10. "A barrier region may not be closely nested inside a worksharing..."

#15 Updated by Roland Schulz almost 7 years ago

  • Status changed from New to Closed

Also available in: Atom PDF