We need better hashing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

We need better hashing

Talin-3
LLVM currently has a bunch of different hashing algorithms scattered throughout the code base.

There's also a number of places in the code where a FoldingSetNodeID is created for the purpose of calculating a hash, and then discarded. From an efficiency standpoint, this isn't all that bad unless the number of individual items being hashed > 32, at which point the SmallVector overflows and memory is allocated.

I personally want to see a better approach to hashing because of the cleanup work I've been doing - I've been replacing std::map and FoldingSet with DenseMap in a number of places, and plan to do more of this. The thing is, for complex key types, you really need to have a custom DenseMapInfo, and that's where having a good hash function comes in.

There are a bunch of hash functions out there (FNV1, SuperFastHash, and many others). The best overall hash function that I am currently aware of is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).

For LLVM's use, we want a hash function that can handle mixed data - that is, pointers, ints, strings, and so on. Most of the high-performance hash functions will work well on mixed data types, but you have to put everything into a flat buffer - that is, an array of machine words whose starting address is aligned on a machine-word boundary. The inner loops of the hash functions are designed to take advantage of parallelism of the instruction pipeline, and if you try feeding in values one at a time it's possible that you can lose a lot of speed. (Although I am not an expert in this area, so feel free to correct me on this point.) On the other hand, if your input values aren't already arranged into a flat buffer, the cost of writing them to memory has to be taken into account.

Also, most of the maps in LLVM are fairly small (<1000 entries), so the speed of the hash function itself is probably more important than getting the best possible mixing of bits.

It seems that for LLVM's purposes, something that has an interface similar to FoldingSetNodeID would make for an easy transition. One approach would be to start with something very much like FoldingSetNodeID, except with a fixed-length buffer instead of a SmallVector - the idea is that when you are about to overflow, instead of allocating more space, you would compute an intermediate hash value and then start over at the beginning of the buffer.

Another question is whether or not you would want to replace the hash functions in DenseMapInfo, which are designed to be efficient for very small keys - most of the high-performance hash functions have a fairly substantial fixed overhead (usually in the form of a final mixing step) and thus only make sense for larger key sizes.

--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
By the way, the reason I'm bringing this up is that a number of folks are currently working on optimizing the use of hash tables within LLVM's code base, and unless we can come up with a common hashing facility, there will be an increasing proliferation of cut & paste copies of hash functions. So feedback would be nice.

On Tue, Feb 7, 2012 at 10:58 PM, Talin <[hidden email]> wrote:
LLVM currently has a bunch of different hashing algorithms scattered throughout the code base.

There's also a number of places in the code where a FoldingSetNodeID is created for the purpose of calculating a hash, and then discarded. From an efficiency standpoint, this isn't all that bad unless the number of individual items being hashed > 32, at which point the SmallVector overflows and memory is allocated.

I personally want to see a better approach to hashing because of the cleanup work I've been doing - I've been replacing std::map and FoldingSet with DenseMap in a number of places, and plan to do more of this. The thing is, for complex key types, you really need to have a custom DenseMapInfo, and that's where having a good hash function comes in.

There are a bunch of hash functions out there (FNV1, SuperFastHash, and many others). The best overall hash function that I am currently aware of is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).

For LLVM's use, we want a hash function that can handle mixed data - that is, pointers, ints, strings, and so on. Most of the high-performance hash functions will work well on mixed data types, but you have to put everything into a flat buffer - that is, an array of machine words whose starting address is aligned on a machine-word boundary. The inner loops of the hash functions are designed to take advantage of parallelism of the instruction pipeline, and if you try feeding in values one at a time it's possible that you can lose a lot of speed. (Although I am not an expert in this area, so feel free to correct me on this point.) On the other hand, if your input values aren't already arranged into a flat buffer, the cost of writing them to memory has to be taken into account.

Also, most of the maps in LLVM are fairly small (<1000 entries), so the speed of the hash function itself is probably more important than getting the best possible mixing of bits.

It seems that for LLVM's purposes, something that has an interface similar to FoldingSetNodeID would make for an easy transition. One approach would be to start with something very much like FoldingSetNodeID, except with a fixed-length buffer instead of a SmallVector - the idea is that when you are about to overflow, instead of allocating more space, you would compute an intermediate hash value and then start over at the beginning of the buffer.

Another question is whether or not you would want to replace the hash functions in DenseMapInfo, which are designed to be efficient for very small keys - most of the high-performance hash functions have a fairly substantial fixed overhead (usually in the form of a final mixing step) and thus only make sense for larger key sizes.

--
-- Talin



--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
Here's my latest version of Hashing.h, which I propose to add to llvm/ADT. Comments welcome and encouraged.

On Thu, Feb 9, 2012 at 11:23 AM, Talin <[hidden email]> wrote:
By the way, the reason I'm bringing this up is that a number of folks are currently working on optimizing the use of hash tables within LLVM's code base, and unless we can come up with a common hashing facility, there will be an increasing proliferation of cut & paste copies of hash functions. So feedback would be nice.


On Tue, Feb 7, 2012 at 10:58 PM, Talin <[hidden email]> wrote:
LLVM currently has a bunch of different hashing algorithms scattered throughout the code base.

There's also a number of places in the code where a FoldingSetNodeID is created for the purpose of calculating a hash, and then discarded. From an efficiency standpoint, this isn't all that bad unless the number of individual items being hashed > 32, at which point the SmallVector overflows and memory is allocated.

I personally want to see a better approach to hashing because of the cleanup work I've been doing - I've been replacing std::map and FoldingSet with DenseMap in a number of places, and plan to do more of this. The thing is, for complex key types, you really need to have a custom DenseMapInfo, and that's where having a good hash function comes in.

There are a bunch of hash functions out there (FNV1, SuperFastHash, and many others). The best overall hash function that I am currently aware of is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).

For LLVM's use, we want a hash function that can handle mixed data - that is, pointers, ints, strings, and so on. Most of the high-performance hash functions will work well on mixed data types, but you have to put everything into a flat buffer - that is, an array of machine words whose starting address is aligned on a machine-word boundary. The inner loops of the hash functions are designed to take advantage of parallelism of the instruction pipeline, and if you try feeding in values one at a time it's possible that you can lose a lot of speed. (Although I am not an expert in this area, so feel free to correct me on this point.) On the other hand, if your input values aren't already arranged into a flat buffer, the cost of writing them to memory has to be taken into account.

Also, most of the maps in LLVM are fairly small (<1000 entries), so the speed of the hash function itself is probably more important than getting the best possible mixing of bits.

It seems that for LLVM's purposes, something that has an interface similar to FoldingSetNodeID would make for an easy transition. One approach would be to start with something very much like FoldingSetNodeID, except with a fixed-length buffer instead of a SmallVector - the idea is that when you are about to overflow, instead of allocating more space, you would compute an intermediate hash value and then start over at the beginning of the buffer.

Another question is whether or not you would want to replace the hash functions in DenseMapInfo, which are designed to be efficient for very small keys - most of the high-performance hash functions have a fairly substantial fixed overhead (usually in the form of a final mixing step) and thus only make sense for larger key sizes.

--
-- Talin



--
-- Talin



--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Hashing.h (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Jay Foad-2
On 13 February 2012 00:59, Talin <[hidden email]> wrote:
> Here's my latest version of Hashing.h, which I propose to add to llvm/ADT.
> Comments welcome and encouraged.

> /// Adapted from MurmurHash2 by Austin Appleby

Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

Thanks,
Jay.
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Jay Foad-2
On 13 February 2012 09:22, Jay Foad <[hidden email]> wrote:
> Would it be possible to use CityHash instead for strings?
>
> http://code.google.com/p/cityhash/

Incidentally there was talk of using CityHash for LLVM's StringMap
last year, but I don't think it ever came to anything:

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-April/014656.html

Jay.
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
In reply to this post by Jay Foad-2
On Mon, Feb 13, 2012 at 1:22 AM, Jay Foad <[hidden email]> wrote:
On 13 February 2012 00:59, Talin <[hidden email]> wrote:
> Here's my latest version of Hashing.h, which I propose to add to llvm/ADT.
> Comments welcome and encouraged.

> /// Adapted from MurmurHash2 by Austin Appleby

Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there's no incremental version of 3. If you look at the source, you'll notice that the very first thing that 3 does is Hash ^= Length, but for the incremental case we don't know the length until we're done. Also, 2 has fewer instructions per block hashed than 3; 3 also requires bit rotations, whereas 2 only uses bit shifts.

Bear in mind that the "flaw" in MurmurHash2 is a fairly esoteric case which (I suspect) LLVM is unlikely to ever encounter in practice. Austin is trying to develop the best possible hash for a wide range of key probability distributions, and his testing methodologies are quite strict.

LLVM's needs, on the other hand, are fairly modest. I'm guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn't hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

OK by me. My intention is that "Hashing.h" should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I'm working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.

It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.
 
Thanks,
Jay.

--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chris Lattner-2
In reply to this post by Jay Foad-2

On Feb 13, 2012, at 1:27 AM, Jay Foad wrote:

> On 13 February 2012 09:22, Jay Foad <[hidden email]> wrote:
>> Would it be possible to use CityHash instead for strings?
>>
>> http://code.google.com/p/cityhash/
>
> Incidentally there was talk of using CityHash for LLVM's StringMap
> last year, but I don't think it ever came to anything:
>
> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-April/014656.html

At that point, there wasn't a clear win.  CityHash (iirc) is optimized for huge strings, which aren't the usual case in the compiler AFAIK.

-Chris
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chris Lattner-2
In reply to this post by Talin-3
On Feb 13, 2012, at 2:00 AM, Talin wrote:
Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there's no incremental version of 3.

I think that that is a great reason.

LLVM's needs, on the other hand, are fairly modest. I'm guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn't hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Indeed.  It can't be hard to be better than our existing adhoc stuff :).  That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor.  The identifier uniquing is a very hot path in "clang -E" performance.


Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

OK by me. My intention is that "Hashing.h" should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I'm working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.

It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.

I think that this is a great way to stage it.  I personally care more about the interface than the implementation.  Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.


#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/DataTypes.h"
#include "llvm/Support/PointerLikeTypeTraits.h"
#include "llvm/Support/type_traits.h"

Do you actually need all of these includes?  PointerLikeTypeTraits doesn't seem necessary.  Is type_traits?

  enum {
    BufferSize = 32,

BufferSize is dead.


 /// Add a pointer value
  template<typename T>
  void add(const T *PtrVal) {
    addImpl(
        reinterpret_cast<const uint32_t *>(&PtrVal),
        reinterpret_cast<const uint32_t *>(&PtrVal + 1));
  }

This violates TBAA rules and looks pretty dangerous to expose as public API.  Is this really needed?  Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes.  The ArrayRef version has the same problem.

Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

  /// Add a float
  void add(float Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

  /// Add a double
  void add(double Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

Similarly, these need to go through a union to avoid TBAA problems.



 void add(StringRef StrVal) {
    addImpl(StrVal.data(), StrVal.size());
  }

I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data?  The Bernstein hash function works really well and is much cheaper to compute than Murmur.  It is used by HashString (and thus by StringMap).

  // Add a possibly unaligned sequence of bytes.
  void addImpl(const char *I, size_t Length) {

This should probably be moved out of line to avoid code bloat.

Overall, the design of the class is making sense to me!  Thanks for tackling this!

-Chris



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <[hidden email]> wrote:
On Feb 13, 2012, at 2:00 AM, Talin wrote:
Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there's no incremental version of 3.

I think that that is a great reason.

LLVM's needs, on the other hand, are fairly modest. I'm guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn't hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Indeed.  It can't be hard to be better than our existing adhoc stuff :).  That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor.  The identifier uniquing is a very hot path in "clang -E" performance.

I'm not planning on touching StringMap. 

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

OK by me. My intention is that "Hashing.h" should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I'm working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.

It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.

I think that this is a great way to stage it.  I personally care more about the interface than the implementation.  Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.


#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/DataTypes.h"
#include "llvm/Support/PointerLikeTypeTraits.h"
#include "llvm/Support/type_traits.h"

Do you actually need all of these includes?  PointerLikeTypeTraits doesn't seem necessary.  Is type_traits?

Ooops, this was a cut & paste error from FoldingSet.cpp.
 
  enum {
    BufferSize = 32,

BufferSize is dead.


 /// Add a pointer value
  template<typename T>
  void add(const T *PtrVal) {
    addImpl(
        reinterpret_cast<const uint32_t *>(&PtrVal),
        reinterpret_cast<const uint32_t *>(&PtrVal + 1));
  }

This violates TBAA rules and looks pretty dangerous to expose as public API.  Is this really needed?  Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes.  The ArrayRef version has the same problem.

So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>.

Now, as to the 4 bytes issue, I think I can solve that with some clever template methods.
 
Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

  /// Add a float
  void add(float Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

  /// Add a double
  void add(double Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

Similarly, these need to go through a union to avoid TBAA problems.

I'm not sure how that works. Can you give an example?

 void add(StringRef StrVal) {
    addImpl(StrVal.data(), StrVal.size());
  }

I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data?  The Bernstein hash function works really well and is much cheaper to compute than Murmur.  It is used by HashString (and thus by StringMap).

So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)
 
  // Add a possibly unaligned sequence of bytes.
  void addImpl(const char *I, size_t Length) {

This should probably be moved out of line to avoid code bloat.

OK 

Overall, the design of the class is making sense to me!  Thanks for tackling this!

-Chris





--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
On Tue, Feb 14, 2012 at 10:47 PM, Talin <[hidden email]> wrote:
On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <[hidden email]> wrote:
On Feb 13, 2012, at 2:00 AM, Talin wrote:
Just out of curiosity, why not MurmurHash3 ? This page seems to
suggest that #2 has some flaw, and #3 is better all round:

https://sites.google.com/site/murmurhash/

The main reason is because there's no incremental version of 3.

I think that that is a great reason.

LLVM's needs, on the other hand, are fairly modest. I'm guessing that most DenseMaps contain less than a few thousand entries. Even a bad hash function wouldn't hurt performance that much, and the time taken to calculate the hash is probably more of a factor in overall performance than getting the optimum distribution of hash values.

Indeed.  It can't be hard to be better than our existing adhoc stuff :).  That said, please do not change the hash function used by StringMap without do careful performance analysis of the clang preprocessor.  The identifier uniquing is a very hot path in "clang -E" performance.

I'm not planning on touching StringMap. 

Would it be possible to use CityHash instead for strings?

http://code.google.com/p/cityhash/

OK by me. My intention is that "Hashing.h" should contain a variety of different hashing algorithms for various specific needs. Right now, I am mainly focusing on hashing of mixed data types - that is, you have some structure which contains pointers, ints, strings, and you want to calculate a hash of the entire struct. I need this because I'm working on improving the uniquing of constants and similar data structures. My next target is to improve the uniquing of MDNodes, but I want to get the hashing stuff squared away first.

It is also my intent that some person who is more of an expert in hashing than I am can do detailed performance analysis under real-world loads (such as compiling actual programs with clang) and tweak and optimize the hashing algorithm, without needing to modify the API and/or all of the places that call it.

I think that this is a great way to stage it.  I personally care more about the interface than the implementation.  Someone can tweak and tune it after enough code is using the API to get reasonable performance numbers.


#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/Support/Compiler.h"
#include "llvm/Support/DataTypes.h"
#include "llvm/Support/PointerLikeTypeTraits.h"
#include "llvm/Support/type_traits.h"

Do you actually need all of these includes?  PointerLikeTypeTraits doesn't seem necessary.  Is type_traits?

Ooops, this was a cut & paste error from FoldingSet.cpp.
 
  enum {
    BufferSize = 32,

BufferSize is dead.


 /// Add a pointer value
  template<typename T>
  void add(const T *PtrVal) {
    addImpl(
        reinterpret_cast<const uint32_t *>(&PtrVal),
        reinterpret_cast<const uint32_t *>(&PtrVal + 1));
  }

This violates TBAA rules and looks pretty dangerous to expose as public API.  Is this really needed?  Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes.  The ArrayRef version has the same problem.

So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>.

Also, something I forgot to mention. Since it's possible to convert any single T into an ArrayRef of size 1, then technically you could just have one method that accepts an ArrayRef<T> which would work for all cases - all of the other methods are just optimizations. In which case, my question is do you still need a union? In other words, would the following work as expected?

   void add(float Data) {
     addArray(ArrayRef<float>(Data));
   }
 
Now, as to the 4 bytes issue, I think I can solve that with some clever template methods.
 
Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

  /// Add a float
  void add(float Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

  /// Add a double
  void add(double Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

Similarly, these need to go through a union to avoid TBAA problems.

I'm not sure how that works. Can you give an example?

 void add(StringRef StrVal) {
    addImpl(StrVal.data(), StrVal.size());
  }

I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data?  The Bernstein hash function works really well and is much cheaper to compute than Murmur.  It is used by HashString (and thus by StringMap).

So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)
 
  // Add a possibly unaligned sequence of bytes.
  void addImpl(const char *I, size_t Length) {

This should probably be moved out of line to avoid code bloat.

OK 

Overall, the design of the class is making sense to me!  Thanks for tackling this!

-Chris





--
-- Talin



--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Jeffrey Yasskin-2
In reply to this post by Chris Lattner-2
On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <[hidden email]> wrote:

> On Feb 13, 2012, at 2:00 AM, Talin wrote:
>>
>> Just out of curiosity, why not MurmurHash3 ? This page seems to
>> suggest that #2 has some flaw, and #3 is better all round:
>>
>> https://sites.google.com/site/murmurhash/
>>
> The main reason is because there's no incremental version of 3.
>
>
> I think that that is a great reason.
>
> LLVM's needs, on the other hand, are fairly modest. I'm guessing that most
> DenseMaps contain less than a few thousand entries. Even a bad hash function
> wouldn't hurt performance that much, and the time taken to calculate the
> hash is probably more of a factor in overall performance than getting the
> optimum distribution of hash values.
>
>
> Indeed.  It can't be hard to be better than our existing adhoc stuff :).
>  That said, please do not change the hash function used by StringMap without
> do careful performance analysis of the clang preprocessor.  The identifier
> uniquing is a very hot path in "clang -E" performance.
>
>>
>> Would it be possible to use CityHash instead for strings?
>>
>> http://code.google.com/p/cityhash/
>>
> OK by me. My intention is that "Hashing.h" should contain a variety of
> different hashing algorithms for various specific needs. Right now, I am
> mainly focusing on hashing of mixed data types - that is, you have some
> structure which contains pointers, ints, strings, and you want to calculate
> a hash of the entire struct. I need this because I'm working on improving
> the uniquing of constants and similar data structures. My next target is to
> improve the uniquing of MDNodes, but I want to get the hashing stuff squared
> away first.
>
> It is also my intent that some person who is more of an expert in hashing
> than I am can do detailed performance analysis under real-world loads (such
> as compiling actual programs with clang) and tweak and optimize the hashing
> algorithm, without needing to modify the API and/or all of the places that
> call it.
>
>
> I think that this is a great way to stage it.  I personally care more about
> the interface than the implementation.  Someone can tweak and tune it after
> enough code is using the API to get reasonable performance numbers.
>
>
> #include "llvm/ADT/ArrayRef.h"
> #include "llvm/ADT/StringRef.h"
> #include "llvm/Support/Compiler.h"
> #include "llvm/Support/DataTypes.h"
> #include "llvm/Support/PointerLikeTypeTraits.h"
> #include "llvm/Support/type_traits.h"
>
> Do you actually need all of these includes?  PointerLikeTypeTraits doesn't
> seem necessary.  Is type_traits?
>
>   enum {
>     BufferSize = 32,
>
> BufferSize is dead.
>
>
>  /// Add a pointer value
>   template<typename T>
>   void add(const T *PtrVal) {
>     addImpl(
>         reinterpret_cast<const uint32_t *>(&PtrVal),
>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>   }
>
> This violates TBAA rules and looks pretty dangerous to expose as public API.
>  Is this really needed?  Also, addImpl is dereferencing the pointers as
> uint32_t's, but there is nothing that guarantees that T is a multiple of 4
> bytes.  The ArrayRef version has the same problem.
>
> Though it is more verbose, I think it would be better to expose a template
> specialization approach to getting the hash_value of T.
>
>   /// Add a float
>   void add(float Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
>
>   /// Add a double
>   void add(double Data) {
>     addImpl(
>       reinterpret_cast<const uint32_t *>(&Data),
>       reinterpret_cast<const uint32_t *>(&Data + 1));
>   }
>
> Similarly, these need to go through a union to avoid TBAA problems.

These are just wrong: they'll hash +0 and -0 to different values even
though they compare ==.

>
>  void add(StringRef StrVal) {
>     addImpl(StrVal.data(), StrVal.size());
>   }
>
> I'm contradicting my stance above about not caring about the implementation
> :), but is MurmurHash a good hash for string data?  The Bernstein hash
> function works really well and is much cheaper to compute than Murmur.

Have you seen benchmarks saying that, or are you just guessing from
the length of the code? The benchmarks I've seen say the opposite:
http://code.google.com/p/smhasher/wiki/MurmurHash3#Bulk_speed_test,_hashing_an_8-byte-aligned_256k_block
and http://code.google.com/p/cityhash/source/browse/trunk/README.

> It
> is used by HashString (and thus by StringMap).
>
>   // Add a possibly unaligned sequence of bytes.
>   void addImpl(const char *I, size_t Length) {
>
> This should probably be moved out of line to avoid code bloat.
>
> Overall, the design of the class is making sense to me!  Thanks for tackling
> this!
>
> -Chris
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Jay Foad-2
In reply to this post by Chris Lattner-2
On 14 February 2012 10:44, Chris Lattner <[hidden email]> wrote:
>  /// Add a pointer value
>   template<typename T>
>   void add(const T *PtrVal) {
>     addImpl(
>         reinterpret_cast<const uint32_t *>(&PtrVal),
>         reinterpret_cast<const uint32_t *>(&PtrVal + 1));
>   }

> Also, addImpl is dereferencing the pointers as
> uint32_t's, but there is nothing that guarantees that T is a multiple of 4
> bytes.

I think you've misread the code. We're not passing PtrVal to addImpl,
we're passing &PtrVal. So the constraint is that the pointer type
"const T *" must be at least as aligned as a uint32_t, which seems
safe.

Jay.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chris Lattner-2
In reply to this post by Jeffrey Yasskin-2
On Feb 15, 2012, at 12:05 AM, Jeffrey Yasskin wrote:

>>  void add(StringRef StrVal) {
>>     addImpl(StrVal.data(), StrVal.size());
>>   }
>>
>> I'm contradicting my stance above about not caring about the implementation
>> :), but is MurmurHash a good hash for string data?  The Bernstein hash
>> function works really well and is much cheaper to compute than Murmur.
>
> Have you seen benchmarks saying that, or are you just guessing from
> the length of the code? The benchmarks I've seen say the opposite:
> http://code.google.com/p/smhasher/wiki/MurmurHash3#Bulk_speed_test,_hashing_an_8-byte-aligned_256k_block
> and http://code.google.com/p/cityhash/source/browse/trunk/README.

The compiler's use case is not typically a 256K long string.  I'm speaking from experience tuning the clang preprocessor identifier lookup table.

-Chris
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chris Lattner-2
In reply to this post by Talin-3
On Feb 14, 2012, at 10:47 PM, Talin wrote:
 /// Add a pointer value
  template<typename T>
  void add(const T *PtrVal) {
    addImpl(
        reinterpret_cast<const uint32_t *>(&PtrVal),
        reinterpret_cast<const uint32_t *>(&PtrVal + 1));
  }

This violates TBAA rules and looks pretty dangerous to expose as public API.  Is this really needed?  Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes.  The ArrayRef version has the same problem.

So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>.

Ah, Jay was right, I misread this code!

Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

  /// Add a float
  void add(float Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

  /// Add a double
  void add(double Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

Similarly, these need to go through a union to avoid TBAA problems.

I'm not sure how that works. Can you give an example?

Just use:

void add(double Data) {
  union {
     double D; uint64_t I;
  };
  D = Data;  
  add(I);
}


 void add(StringRef StrVal) {
    addImpl(StrVal.data(), StrVal.size());
  }

I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data?  The Bernstein hash function works really well and is much cheaper to compute than Murmur.  It is used by HashString (and thus by StringMap).

So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)

Ok, so what's the answer? :)   We can do different things for ArrayRef<char> and StringRef.

-Chris




_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Bryce Cogswell
In reply to this post by Jeffrey Yasskin-2

On Feb 15, 2012, at 2:05 AM, Jeffrey Yasskin wrote:
On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <[hidden email]> wrote:
I'm contradicting my stance above about not caring about the implementation
:), but is MurmurHash a good hash for string data?  The Bernstein hash
function works really well and is much cheaper to compute than Murmur.

Have you seen benchmarks saying that, or are you just guessing from
the length of the code? The benchmarks I've seen say the opposite:
http://code.google.com/p/smhasher/wiki/MurmurHash3#Bulk_speed_test,_hashing_an_8-byte-aligned_256k_block
and http://code.google.com/p/cityhash/source/browse/trunk/README.


These are the numbers I come up with, graphing time (ns) vs. string length (chars), using the current Xcode compiler, 64-bit, -O3. 
Bernstein is inlined while Murmor and CityHash are not:


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
In reply to this post by Chris Lattner-2
OK here's a patch with the latest, including unit tests. I've also tried to make the comments clear that this is intended for the case of "generic" key types, where you are either hashing multiple data types together, or you don't know in advance what the data type will be - for cases such as strings where a tailored algorithm is available, you should use that instead of this.

There's a couple of things I don't like: First, there's too many levels of inlined function calls - my experience is that compilers aren't as good at inlining nested calls as is often advertised, however I couldn't figure out a way to get the same effect without the nested calls.

Similarly the addBitsImpl helper types are a little too complex and magical for my taste, but again I couldn't think of a better way to automatically detect whether to use the aligned vs. unaligned hashing routine.

On Wed, Feb 15, 2012 at 12:01 PM, Chris Lattner <[hidden email]> wrote:
On Feb 14, 2012, at 10:47 PM, Talin wrote:
 /// Add a pointer value
  template<typename T>
  void add(const T *PtrVal) {
    addImpl(
        reinterpret_cast<const uint32_t *>(&PtrVal),
        reinterpret_cast<const uint32_t *>(&PtrVal + 1));
  }

This violates TBAA rules and looks pretty dangerous to expose as public API.  Is this really needed?  Also, addImpl is dereferencing the pointers as uint32_t's, but there is nothing that guarantees that T is a multiple of 4 bytes.  The ArrayRef version has the same problem.

So as far as hashing pointer values is concerned, I was just copying the code from FoldingSet. Since a lot of the keys that we're going to be dealing with are uniqued pointers, it makes sense to be able to calculate a hash of the bit-value of the pointer, rather than hashing the thing pointed to. That being said, renaming it to "addPointer" and adding a comment might be in order. Similarly, I can make the ArrayRef version 'addPointers' and have it take an ArrayRef<T*>.

Ah, Jay was right, I misread this code!

Though it is more verbose, I think it would be better to expose a template specialization approach to getting the hash_value of T.

  /// Add a float
  void add(float Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

  /// Add a double
  void add(double Data) {
    addImpl(
      reinterpret_cast<const uint32_t *>(&Data),
      reinterpret_cast<const uint32_t *>(&Data + 1));
  }

Similarly, these need to go through a union to avoid TBAA problems.

I'm not sure how that works. Can you give an example?

Just use:

void add(double Data) {
  union {
     double D; uint64_t I;
  };
  D = Data;  
  add(I);
}


 void add(StringRef StrVal) {
    addImpl(StrVal.data(), StrVal.size());
  }

I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data?  The Bernstein hash function works really well and is much cheaper to compute than Murmur.  It is used by HashString (and thus by StringMap).

So, MurmurHash is intended for blocks of arbitrary binary data, which may contain character data, integers, or whatever - it's designed to do such a thorough job of mixing the bits that it really doesn't matter what data types you feed it. You are correct that for purely string data, you'd want to use a less expensive algorithm (I'm partial to FNV-1, which is as cheap as the Bernstein hash and is AFAIK more mathematically sound.)

Ok, so what's the answer? :)   We can do different things for ArrayRef<char> and StringRef.

-Chris






--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

hashing.patch (15K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Jay Foad-2
On 17 February 2012 08:26, Talin <[hidden email]> wrote:
> +  // Helper class to detect if the input type is 4-byte aligned.
> +  template<typename T>
> +  struct is_4byte_aligned {
> +    static const bool value = (sizeof(T) & 3) == 0;
> +  };

I don't think this is safe, e.g. for:

struct {
  char c[4];
}

Jay.
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Jay Foad-2
On 17 February 2012 08:56, Jay Foad <[hidden email]> wrote:

> On 17 February 2012 08:26, Talin <[hidden email]> wrote:
>> +  // Helper class to detect if the input type is 4-byte aligned.
>> +  template<typename T>
>> +  struct is_4byte_aligned {
>> +    static const bool value = (sizeof(T) & 3) == 0;
>> +  };
>
> I don't think this is safe, e.g. for:
>
> struct {
>  char c[4];
> }

... so you should probably use the stuff in <llvm/Support/AlignOf.h> instead!

Jay.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chris Lattner-2
In reply to this post by Talin-3
On Feb 17, 2012, at 12:26 AM, Talin wrote:
> OK here's a patch with the latest, including unit tests. I've also tried to make the comments clear that this is intended for the case of "generic" key types, where you are either hashing multiple data types together, or you don't know in advance what the data type will be - for cases such as strings where a tailored algorithm is available, you should use that instead of this.

Makes sense.

+  /// Add a pointer value.
+  /// Note: this adds pointers to the hash using sizes and endianness that
+  /// depend on the host.  It doesn't matter however, because hashing on
+  /// pointer values is inherently unstable.

This makes perfect sense.

+  /// Add an ArrayRef of arbitrary data.
+  template<typename T>
+  GeneralHash& add(ArrayRef<T> ArrayVal) {
+    addBits(ArrayVal.begin(), ArrayVal.end());
+    return *this;
+  }

Doesn't this have the same host-specificity problem, except that it will cause things that *are* stable to vary, such as arrays of char, or is the alignment check enough?

+  /// Add a float
+  GeneralHash& add(float Data) {

It is worth adding a comment here that this does a bitwise hash, so -0.0 and +0.0 will hash to different values even though they compare equal and two identical NaN's will hash to the same value even though they compare unequal.

The mix logic is inherently a series of 32-bit operations.  Would it be possible and make more sense to implement them as 64-bit operations?  64-bit hosts are the vast majority of the machines that run a compiler these days.  OTOH, making it depend on the host brings up host instability issues.  

Actually, how much do we care about host instability here?  If this is used by hashing containers, we can just declare that iteration order is undefined even for non-pointer values.  The real problem I guess is that we'd have to go check out all the existing DenseMap's etc to make sure people aren't doing well-defined iteration right now and fixing the code.  What do you think?

> There's a couple of things I don't like: First, there's too many levels of inlined function calls - my experience is that compilers aren't as good at inlining nested calls as is often advertised, however I couldn't figure out a way to get the same effect without the nested calls.

Is there a specific observed concern here (e.g. when built with Clang) or is this a historical problem?  Compilers have gotten much better about inlining stuff that is actually small, if Clang handles it then I think we're good.  Marking these attribute(always_inline) is massive overkill.

Overall the code is looking great.  I'd recommend checking in the new API separately from switching Constants.cpp to use it though (just so that any problems doesn't cause both to get reverted).

-Chris
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chandler Carruth-2
In reply to this post by Chris Lattner-2
Sorry for coming late to this thread. I've been very active in recent work on hashing routines inside and outside of Google, so I've a keen interest in this...

Some initial clarifications:

On Tue, Feb 14, 2012 at 2:26 AM, Chris Lattner <[hidden email]> wrote:

On Feb 13, 2012, at 1:27 AM, Jay Foad wrote:

> On 13 February 2012 09:22, Jay Foad <[hidden email]> wrote:
>> Would it be possible to use CityHash instead for strings?
>>
>> http://code.google.com/p/cityhash/
>
> Incidentally there was talk of using CityHash for LLVM's StringMap
> last year, but I don't think it ever came to anything:
>
> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-April/014656.html

At that point, there wasn't a clear win.  CityHash (iirc) is optimized for huge strings, which aren't the usual case in the compiler AFAIK.

This isn't entirely accurate. Let me try to clarify.

For reference, I'm the one that did this. I looked at both DenseMap and StringMap quite closely. I replaced all of the hashing I could to use CityHash. The reason I did this was because collisions in the hash table showed up on the profile in a few test cases with Clang.

However, when measuring it, what I observed was that while the collisions did show up, they were still small enough overall that increasing the complexity of the hash function to achieve fewer collisions was not in fact a good tradeoff. They just didn't happen *that* often.


CityHash has some specializations for larger strings to make it a reasonably good hash function, but for truly huge strings, it still isn't ideal. There are some very interesting CRC instruction based hashing strategies that are much more promising for *huge* datasets. The goal of CityHash is to be an excellent hashing function for *general* usage. That includes small strings, ints, structs, and other common keys.

That said, any really high quality hash function is going to spend more cycles-per-byte on small inputs than an large inputs because it is harder to "mix" smaller inputs sufficiently. This doesn't make CityHash bad for general usage, it's just something to be aware of.

I measured *no* performance degradation from switching to cityhash. These benchmarks as well as others led to libc++ adopting CityHash for its unordered containers. The reason I didn't push forward is that I also measured no performance improvements from it for LLVM and Clang. The reason seemed pretty clear: there aren't enough collisions for a higher quality hashing algorithm to have a high impact. LLVM and Clang's tables are (relatively speaking) very small. Doesn't mean a bad algorithm should be kept around, but it limited the amount of time I wanted to invest.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
12