Project

General

Profile

Task #2010

Use size_t instead of int for indexing

Added by Christian Blau about 3 years ago. Updated about 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
core library
Target version:
-
Difficulty:
uncategorized
Close

Description

The type size_t is expressing most clearly the intend of indexing something, and it is guaranteed that it can index the largest object the system can hold.

Changing code to include size_t in most places needs concentrated effort because

- negative indices might still be in use
- local changes to size_t are only contained by undesired type-casts.
- printf functions will have to change for conversion from int to size_t
- ...

For some previous discussion on size_t see:

[[https://gerrit.gromacs.org/#/c/6020/]]
[[https://gerrit.gromacs.org/#/c/4734/]]
[[https://gerrit.gromacs.org/#/c/5146/]]


Related issues

Related to GROMACS - Bug #2014: GROMACS free energy memory footprintClosed

Associated revisions

Revision 118b2fe8 (diff)
Added by Roland Schulz about 1 year ago

clang-tidy: misc-misplaced-widening-cast

Adds gmx::index to be used as type for index calculations and
comparisons.

given:
int a,b;
size_t c;
static_cast<size_t>(a+b) < c

clang-tidy warns because the static_cast does two things:
- it widens the sum from 32 bit to 64 bit
- it sign casts from signed to unsigned

If the widening is actually required (because while a and b
fit into 32 bit the sum is bigger), then this gives the
wrong result. Thus the terms should be widened prior to summing.
But this gives the wrong result if either of them is negative,
unless the terms are first widened to signed 64 bit and the sum
is then sign converted.

Additionally sign converting the signed to unsigned is unsafe
because it gives the wrong result if it negative. Converting
the unsigned is safe for 64 bit up to 2^63 which is safe to assume
won't be reached.

For all cases clang-tidy complained about (widening after sum), the
sign conversion is changed from signed->unsigned to unsigned->
signed. Introduced gmx::index to have a type for this purpose.

None of the potential bug highlighted by clang-tidy are fixed by
this change. Instead it assumes that in all cases here the intention
was just the signed conversion and no widening prior to summing
is required.

Related #2010

Change-Id: I9251196bf6ed744317868ce9bfc34876e5cfd43b

Revision 52fd9ad7 (diff)
Added by Roland Schulz about 1 year ago

Change ArrayRef::size_type to signed

Follows gsl:span and std::span design, and Google and
Core guidelines.

Related #2010

Change-Id: Ia7b977d0d1a14b5a81f7a6390a9dc4fc24bfa1ce

History

#1 Updated by Berk Hess about 3 years ago

What about performance? size_t will usually be 64 bits. This might result in slower code for simple loops. And it will lead to more memory (cache) usage or type conversions when we need to store indices in data structures.

#2 Updated by Berk Hess about 3 years ago

  • Related to Bug #2014: GROMACS free energy memory footprint added

#3 Updated by Szilárd Páll almost 3 years ago

While in general size_t is better than just using plain int, AFAIK, strictly speaking it is only intended to be able to hold the size of the largest object on the respective platform and that that does not mean it is intended to be used for indexing, it just happens to be suitable to avoid overflow for the aforementioned spec. If we want to be strict and safe in our design, what we should be using is custom types for the ranges of different kind of data (e.g. local_atom_count_t). I doubt we want to go there, though. Also note that size_t is hostile to int-accustomed coders and existing code as decrementing loops can end up running into infinity.

On the other hand, Berk's point is an important one: size_t is guaranteed to be a qword on a 64-bit platform and it is guaranteed to be very often slower to manipulate than int. Unless I'm mistaken, using size_t for indexing everywhere will inevitably have a great impact on most of the code in the main MD loop, especially hot kernels.

C99 solved this issue by clearly separating integer types based on the intended use; they provide types with exact or at least a certain width, fastest types with at least a certain width, wide enough to hold pointers, and those with greatest width, see: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/stdint.h.html

So I think it'd be best to use the fastest dword-sized integer in most cases when performance matters through a custom type, native qword when needed, and avoid size_t-based indexing.

I know we ditched C, but there must be equivalents of this in the garden of C++.

#4 Updated by Aleksei Iupinov almost 3 years ago

Szilárd Páll wrote:

I know we ditched C, but there must be equivalents of this in the garden of C++.

Well, looking around, there is cstdint (http://en.cppreference.com/w/cpp/types/integer) with funky typedefs like uint_fast32_t.

#5 Updated by Erik Lindahl almost 3 years ago

Over the last few years, we've had a bunch of bugs related to integer overflows (for instance in file manipulation, memory allocation, and normal mode analysis). As far as I know, this far we haven't had a single performance regression due to the usage of size_t. Virtually all modern architectures are also 64-bit, and spilling/filling a 64-bit register to L1 should be pretty darned fast.

There are no guarantees about signed/unsigned for size_t (so one should indeed make sure never to check for <0), but there is a guarantee this will be the type returned for all sizes of vectors and other objects. Thus, those "int-accustomed coders" don't solve the problem by using integers for their loop variables - they just exchange one problem for another (although they might ignore the second problem :-) It is a whole lot safer to use size_t when comparing anything against a length/size rather than worrying about signed/unsigned conversion for every single such occasion.

Let's avoid premature optimization that causes bugs. Use size_t, apart from cases where you can show that it causes severe performance regression.

#6 Updated by Roland Schulz almost 3 years ago

I agree that avoiding bugs is most important everywhere where it isn't shown to be otherwise. But that's why I think it is a bad idea to use size_t. One shouldn't use unsigned for arithmethic to avoid surprises (1). Thus the counter type should be signed. I think using unsigned is more likely to cause a bug than 32bit does right now. So this would be a step backward. I'm fine to use int64_t or ptrdiff_t or similar.

But why specify a type at all and not avoid the problem and use: for(auto i=v.begin(); i<v.end(); i++) or even for (auto e: v)? Of course the first is still often (internally) signed but by expressing it as a loop over a container (or similar) the intent is clear and it shouldn't be surprising that negative values don't work and difficult to accidentally use.

1) It also is worse for performance (because the compiler needs to proof that the loop doesn't overflow for things like auto-vectorization) but see above why that's not my main argument.

#7 Updated by Berk Hess almost 3 years ago

I agree with Roland.
Most of our trivial loops will be over all elements in a std::vector, there I think we should use auto which completely avoids the loop variable.
In most non-trivial cases we use the loop variable for arithmetic and using unsigned here is much more bug prone than using signed.

#8 Updated by Roland Schulz almost 3 years ago

Another problem with the proposal is that for x32 ABI size_t (and ptrdiff_t) is 32bit. Thus if we were using it everywhere (including in places where we use gmx_int64_t atm), then we could actually introduce new integer overflows for those compiling with x32. Thus counters really should be int64_t if we agree that we want to always use 64bit counters (I'm not sure about that).

#9 Updated by Roland Schulz about 1 year ago

It is widely acknowledged that making vector::size_type unsigned was a mistake: Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. As a consequence span::index_type is signed. Also the guidelines we usually use as reference (Google+cppcore) both agree that unsigned shouldn't be used for anything other than bit-wise/bitfields https://google.github.io/styleguide/cppguide.html#Integer_Types Don't use unsigned for subscripts, prefer gsl::index
Don't try to avoid negative values by using unsigned. int64_t or ptrdiff_t (same as gsl::index and span::index_type) don't have an overflow issue compared to size_t (assuming 2^63 is sufficient).

I suggest we follow span/gsl/google advice and create a gmx::index type which is either int64_t or ptrdiff_t. To deal with the sign-compare warnings we could disable them (that's what e.g. absl does). But it is safer to be explicit and use constructor syntax to avoid being very verbose

for (index i=0; i<index(v.size()); i++)

The proposal (and our existing practice which is a mixture of different styles) can easily lead to (especially because we don't have Wsign-conversion enabled):
int f() { return -1; } // in some other file so it isn't easy to see that it returns a negative value
for (std::size_t n=f(); n<2; n++) //results in 0 iterations

#10 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2010.
Uploader: Roland Schulz ()
Change-Id: gromacs~master~I9251196bf6ed744317868ce9bfc34876e5cfd43b
Gerrit URL: https://gerrit.gromacs.org/8041

#11 Updated by Gerrit Code Review Bot about 1 year ago

Gerrit received a related patchset '1' for Issue #2010.
Uploader: Roland Schulz ()
Change-Id: gromacs~master~Ia7b977d0d1a14b5a81f7a6390a9dc4fc24bfa1ce
Gerrit URL: https://gerrit.gromacs.org/8042

Also available in: Atom PDF