We need better hashing

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chandler Carruth-2
Jeffrey and I are working on future standard library functionality for hashing user defined types:


I would much rather have an interface that is close to or mirrors this one. We already have some field experience with it, and using it in LLVM and Clang would provide more. Also, it would be possible to essentially share code between such an implementation and libc++.

We looked closely at 'hasher' objects and using add methods on them and they tended to have some serious drawbacks:

1) they require some amount of "incrementality", limiting the quality and performance of the hashing algorithm
2) they require more boiler plate
3) they compose recursively less cleanly

Even given your interface, there is no actual requirement for an incremental hash. Simply store intermediate state in the object, and provide a 'finalize' step that produces the final hash code.

On Sun, Feb 12, 2012 at 4:59 PM, Talin <[hidden email]> wrote:
Here's my latest version of Hashing.h, which I propose to add to llvm/ADT. Comments welcome and encouraged.


On Thu, Feb 9, 2012 at 11:23 AM, Talin <[hidden email]> wrote:
By the way, the reason I'm bringing this up is that a number of folks are currently working on optimizing the use of hash tables within LLVM's code base, and unless we can come up with a common hashing facility, there will be an increasing proliferation of cut & paste copies of hash functions. So feedback would be nice.


On Tue, Feb 7, 2012 at 10:58 PM, Talin <[hidden email]> wrote:
LLVM currently has a bunch of different hashing algorithms scattered throughout the code base.

There's also a number of places in the code where a FoldingSetNodeID is created for the purpose of calculating a hash, and then discarded. From an efficiency standpoint, this isn't all that bad unless the number of individual items being hashed > 32, at which point the SmallVector overflows and memory is allocated.

I personally want to see a better approach to hashing because of the cleanup work I've been doing - I've been replacing std::map and FoldingSet with DenseMap in a number of places, and plan to do more of this. The thing is, for complex key types, you really need to have a custom DenseMapInfo, and that's where having a good hash function comes in.

There are a bunch of hash functions out there (FNV1, SuperFastHash, and many others). The best overall hash function that I am currently aware of is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).

For LLVM's use, we want a hash function that can handle mixed data - that is, pointers, ints, strings, and so on. Most of the high-performance hash functions will work well on mixed data types, but you have to put everything into a flat buffer - that is, an array of machine words whose starting address is aligned on a machine-word boundary. The inner loops of the hash functions are designed to take advantage of parallelism of the instruction pipeline, and if you try feeding in values one at a time it's possible that you can lose a lot of speed. (Although I am not an expert in this area, so feel free to correct me on this point.) On the other hand, if your input values aren't already arranged into a flat buffer, the cost of writing them to memory has to be taken into account.

Also, most of the maps in LLVM are fairly small (<1000 entries), so the speed of the hash function itself is probably more important than getting the best possible mixing of bits.

It seems that for LLVM's purposes, something that has an interface similar to FoldingSetNodeID would make for an easy transition. One approach would be to start with something very much like FoldingSetNodeID, except with a fixed-length buffer instead of a SmallVector - the idea is that when you are about to overflow, instead of allocating more space, you would compute an intermediate hash value and then start over at the beginning of the buffer.

Another question is whether or not you would want to replace the hash functions in DenseMapInfo, which are designed to be efficient for very small keys - most of the high-performance hash functions have a fairly substantial fixed overhead (usually in the form of a final mixing step) and thus only make sense for larger key sizes.

--
-- Talin



--
-- Talin



--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev



_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chandler Carruth-2
In reply to this post by Chris Lattner-2
On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <[hidden email]> wrote:
I'm contradicting my stance above about not caring about the implementation :), but is MurmurHash a good hash for string data?  The Bernstein hash function works really well and is much cheaper to compute than Murmur.  It is used by HashString (and thus by StringMap).

If you want a good string hashing function, CityHash is by a fair margin the best one out there. Look at the comparison done by Craig, Howard, and several others when discussing what hashing function to use for libc++.

The only downside to CityHash is code size. It is heavily tuned, and that results in several special case routines to get maximal efficiency and hash quality for short strings (yep, not just huge ones). That said, the code size increase was measured carefully for libc++ and it's really quite small.

That said, I have no benchmarks showing this matters for our uses of StringMap. It reduced collisions, it didn't show up as a hot function, but the collisions and the hashing simply didn't dominate any profiles I looked at....

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Matt Johnson-12
In reply to this post by Talin-3
On Fri, Feb 17, 2012 at 3:59 AM, Chandler Carruth <chandlerc at google.com> wrote:
On Tue, Feb 14, 2012 at 2:44 AM, Chris Lattner <clattner at apple.com> wrote:

> I'm contradicting my stance above about not caring about the
> implementation :), but is MurmurHash a good hash for string data?
>  The Bernstein hash function works really well and is much cheaper to
> compute than Murmur.  It is used by HashString (and thus by StringMap).


If you want a good string hashing function, CityHash is by a fair margin
the best one out there. Look at the comparison done by Craig, Howard, and
several others when discussing what hashing function to use for libc++.

The only downside to CityHash is code size. It is heavily tuned, and that
results in several special case routines to get maximal efficiency and hash
quality for short strings (yep, not just huge ones). That said, the code
size increase was measured carefully for libc++ and it's really quite small.
What about machines that are big endian, 32-bit, or less optimized for unaligned accesses than x86? From http://code.google.com/p/cityhash/source/browse/trunk/README: > 1) The current version of CityHash is intended for little-endian 64-bit CPUs. > Functions that don't use the CRC32 instruction should also work, slowly, > in little-endian 32-bit code. CityHash should work on big-endian CPUs as well; I ported CityHash about a year ago to a niche research architecture that is all three of the above, and found that the byteswapping and unaligned loads killed performance (the target only supports naturally aligned loads). It's quite possible that x86 is the only target we care about in terms of compile time performance, just wanted to see if CityHash had since been generalized to work well on less general-purpose-focused targets, or if there was something else I was overlooking when I did my initial experiments. -Matt
That said, I have no benchmarks showing this matters for our uses of
StringMap. It reduced collisions, it didn't show up as a hot function, but
the collisions and the hashing simply didn't dominate any profiles I looked
at....

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
In reply to this post by Chandler Carruth-2
On Fri, Feb 17, 2012 at 1:53 AM, Chandler Carruth <[hidden email]> wrote:
Jeffrey and I are working on future standard library functionality for hashing user defined types:


I would much rather have an interface that is close to or mirrors this one. We already have some field experience with it, and using it in LLVM and Clang would provide more. Also, it would be possible to essentially share code between such an implementation and libc++.

We looked closely at 'hasher' objects and using add methods on them and they tended to have some serious drawbacks:

1) they require some amount of "incrementality", limiting the quality and performance of the hashing algorithm
2) they require more boiler plate
3) they compose recursively less cleanly

Even given your interface, there is no actual requirement for an incremental hash. Simply store intermediate state in the object, and provide a 'finalize' step that produces the final hash code.

I'm not sure I understand what you are proposing here. I don't know what you mean by "intermediate state". However, I really do need an incremental hash for the various uniquing maps which I'm attempting to optimize. Take for example the case of uniquing a constant array - the key consists of a type* pointer and an array of constant*. Those data fields are not stored contiguously in memory, so I need to hash them separately and then combine the hashes. Being able to hash the data fields in place (as opposed to copying them to a contiguous buffer) turns out to be a fairly significant win for the uniquing maps - otherwise you end up having to do a malloc just to look up a key, and that's going to be slower than any incremental hash algorithm.


On Sun, Feb 12, 2012 at 4:59 PM, Talin <[hidden email]> wrote:
Here's my latest version of Hashing.h, which I propose to add to llvm/ADT. Comments welcome and encouraged.


On Thu, Feb 9, 2012 at 11:23 AM, Talin <[hidden email]> wrote:
By the way, the reason I'm bringing this up is that a number of folks are currently working on optimizing the use of hash tables within LLVM's code base, and unless we can come up with a common hashing facility, there will be an increasing proliferation of cut & paste copies of hash functions. So feedback would be nice.


On Tue, Feb 7, 2012 at 10:58 PM, Talin <[hidden email]> wrote:
LLVM currently has a bunch of different hashing algorithms scattered throughout the code base.

There's also a number of places in the code where a FoldingSetNodeID is created for the purpose of calculating a hash, and then discarded. From an efficiency standpoint, this isn't all that bad unless the number of individual items being hashed > 32, at which point the SmallVector overflows and memory is allocated.

I personally want to see a better approach to hashing because of the cleanup work I've been doing - I've been replacing std::map and FoldingSet with DenseMap in a number of places, and plan to do more of this. The thing is, for complex key types, you really need to have a custom DenseMapInfo, and that's where having a good hash function comes in.

There are a bunch of hash functions out there (FNV1, SuperFastHash, and many others). The best overall hash function that I am currently aware of is Austin Appleby's MurmurHash3 (http://code.google.com/p/smhasher/).

For LLVM's use, we want a hash function that can handle mixed data - that is, pointers, ints, strings, and so on. Most of the high-performance hash functions will work well on mixed data types, but you have to put everything into a flat buffer - that is, an array of machine words whose starting address is aligned on a machine-word boundary. The inner loops of the hash functions are designed to take advantage of parallelism of the instruction pipeline, and if you try feeding in values one at a time it's possible that you can lose a lot of speed. (Although I am not an expert in this area, so feel free to correct me on this point.) On the other hand, if your input values aren't already arranged into a flat buffer, the cost of writing them to memory has to be taken into account.

Also, most of the maps in LLVM are fairly small (<1000 entries), so the speed of the hash function itself is probably more important than getting the best possible mixing of bits.

It seems that for LLVM's purposes, something that has an interface similar to FoldingSetNodeID would make for an easy transition. One approach would be to start with something very much like FoldingSetNodeID, except with a fixed-length buffer instead of a SmallVector - the idea is that when you are about to overflow, instead of allocating more space, you would compute an intermediate hash value and then start over at the beginning of the buffer.

Another question is whether or not you would want to replace the hash functions in DenseMapInfo, which are designed to be efficient for very small keys - most of the high-performance hash functions have a fairly substantial fixed overhead (usually in the form of a final mixing step) and thus only make sense for larger key sizes.

--
-- Talin



--
-- Talin



--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev





--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chandler Carruth-2
On Fri, Feb 17, 2012 at 10:58 PM, Talin <[hidden email]> wrote:
However, I really do need an incremental hash for the various uniquing maps which I'm attempting to optimize. Take for example the case of uniquing a constant array - the key consists of a type* pointer and an array of constant*. Those data fields are not stored contiguously in memory, so I need to hash them separately and then combine the hashes. Being able to hash the data fields in place (as opposed to copying them to a contiguous buffer) turns out to be a fairly significant win for the uniquing maps - otherwise you end up having to do a malloc just to look up a key, and that's going to be slower than any incremental hash algorithm.

I think you have a different idea of what 'incremental hash' means from what I do, and that's leading to the confusion, because we're talking about two very different things.

The term "incremental hash function" usually is a term of art referring to a hash function with the following property:

Given a series of bytes (or other units if you like) in 's', and using a python-like syntax:

 H(s) = H(H(s[:-1]), s[-1])

This means that you can re-use a hash for a smaller chunk of data when computing the hash for a larger chunk. I don't think this is what you're looking for, and I know that most efficient, high-quality, non-cryptographic hash functions don't satisfy this property.


What you're talking about is just being able to hash a non-contiguous set of data. That is clearly important, but there are a lot of ways to achieve it.

Most of the functions I'm referring to are essentially block based. For example, CityHash is based on a 64-byte block hashing system. While this can necessitate copying the data, it should never require a malloc. You simply fill a block, and flush it out. The block can be on the stack, and modern hashing algorithms will find it much faster to memcpy mall (pointer-sized) bits of data into a single contiguous N-byte (64 in the case of city) block first, so I don't think this "costs" us anything, we actually want to copy the data in, and do the hashing algorithm once.

Note that the current CityHash algorithm (especially in its current implementation) isn't really setup for this, but we've talked to the author of it as well as Austin when we were designing the standard interface, and they seemed very positive on being able to adapt these high-quality hashing algorithms to the proposed interface.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
On Sat, Feb 18, 2012 at 12:00 AM, Chandler Carruth <[hidden email]> wrote:
On Fri, Feb 17, 2012 at 10:58 PM, Talin <[hidden email]> wrote:
However, I really do need an incremental hash for the various uniquing maps which I'm attempting to optimize. Take for example the case of uniquing a constant array - the key consists of a type* pointer and an array of constant*. Those data fields are not stored contiguously in memory, so I need to hash them separately and then combine the hashes. Being able to hash the data fields in place (as opposed to copying them to a contiguous buffer) turns out to be a fairly significant win for the uniquing maps - otherwise you end up having to do a malloc just to look up a key, and that's going to be slower than any incremental hash algorithm.

I think you have a different idea of what 'incremental hash' means from what I do, and that's leading to the confusion, because we're talking about two very different things.

The term "incremental hash function" usually is a term of art referring to a hash function with the following property:

Given a series of bytes (or other units if you like) in 's', and using a python-like syntax:

 H(s) = H(H(s[:-1]), s[-1])

This means that you can re-use a hash for a smaller chunk of data when computing the hash for a larger chunk. I don't think this is what you're looking for, and I know that most efficient, high-quality, non-cryptographic hash functions don't satisfy this property.


What you're talking about is just being able to hash a non-contiguous set of data. That is clearly important, but there are a lot of ways to achieve it.

Most of the functions I'm referring to are essentially block based. For example, CityHash is based on a 64-byte block hashing system. While this can necessitate copying the data, it should never require a malloc. You simply fill a block, and flush it out. The block can be on the stack, and modern hashing algorithms will find it much faster to memcpy mall (pointer-sized) bits of data into a single contiguous N-byte (64 in the case of city) block first, so I don't think this "costs" us anything, we actually want to copy the data in, and do the hashing algorithm once.

Note that the current CityHash algorithm (especially in its current implementation) isn't really setup for this, but we've talked to the author of it as well as Austin when we were designing the standard interface, and they seemed very positive on being able to adapt these high-quality hashing algorithms to the proposed interface.

OK. I actually coded something like this in an earlier incarnation of the patch, and at some point it kind of got optimized away. But that's unimportant - what I am primarily interested in is the interface, not the hashing function per se. (Note that the interface didn't change much when I converted from a block-based version to the current one. Which is not surprising, as the current interface is essentially isomorphic to the one used by FoldingSet.)

One of the things I'm focusing on right now is taking old container classes that were written before the advent of ArrayRef and modernizing them so that they don't do so much allocating and copying of keys. All of this hashing stuff is merely yak shaving as far as I am concerned - if someone has a better idea I'm open to it, as long as they understand what my requirements are, and the fact that my work on the containers is kind of blocked until this gets resolved.

--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
In reply to this post by Chris Lattner-2
On Fri, Feb 17, 2012 at 1:32 AM, Chris Lattner <[hidden email]> wrote:
On Feb 17, 2012, at 12:26 AM, Talin wrote:
> OK here's a patch with the latest, including unit tests. I've also tried to make the comments clear that this is intended for the case of "generic" key types, where you are either hashing multiple data types together, or you don't know in advance what the data type will be - for cases such as strings where a tailored algorithm is available, you should use that instead of this.

Makes sense.

+  /// Add a pointer value.
+  /// Note: this adds pointers to the hash using sizes and endianness that
+  /// depend on the host.  It doesn't matter however, because hashing on
+  /// pointer values is inherently unstable.

This makes perfect sense.

It should, as the comment was copied verbatim from FoldingSetNodeID :)
 
+  /// Add an ArrayRef of arbitrary data.
+  template<typename T>
+  GeneralHash& add(ArrayRef<T> ArrayVal) {
+    addBits(ArrayVal.begin(), ArrayVal.end());
+    return *this;
+  }

Doesn't this have the same host-specificity problem, except that it will cause things that *are* stable to vary, such as arrays of char, or is the alignment check enough?

I thought about whether it would be possible to prevent people from passing in ArrayRefs with unstable things, and I came to the conclusion that there's no simple way to distinguish between stable and unstable ArrayRefs. This is why I decided not to make a special "addPointer" method for pointers, because you could easily subvert it by wrapping it in a singleton ArrayRef.

+  /// Add a float
+  GeneralHash& add(float Data) {

It is worth adding a comment here that this does a bitwise hash, so -0.0 and +0.0 will hash to different values even though they compare equal and two identical NaN's will hash to the same value even though they compare unequal.

BTW, are there in fact any maps in LLVM that use floats as keys, other than uniquing of constants? And in the latter case, would you not want to distinguish between a -0.0 and +0.0 constant? 
 
The mix logic is inherently a series of 32-bit operations.  Would it be possible and make more sense to implement them as 64-bit operations?  64-bit hosts are the vast majority of the machines that run a compiler these days.  OTOH, making it depend on the host brings up host instability issues.

Actually, how much do we care about host instability here?  If this is used by hashing containers, we can just declare that iteration order is undefined even for non-pointer values.  The real problem I guess is that we'd have to go check out all the existing DenseMap's etc to make sure people aren't doing well-defined iteration right now and fixing the code.  What do you think?

I think that you are thinking that existing uses of DenseMap and other ADT containers will be affected by this. That wasn't my plan - I was going to basically use the hashing class to create a custom DenseMapInfo for specific maps which could benefit from the optimization. Other DenseMaps would remain as they are.
 
> There's a couple of things I don't like: First, there's too many levels of inlined function calls - my experience is that compilers aren't as good at inlining nested calls as is often advertised, however I couldn't figure out a way to get the same effect without the nested calls.

Is there a specific observed concern here (e.g. when built with Clang) or is this a historical problem?  Compilers have gotten much better about inlining stuff that is actually small, if Clang handles it then I think we're good.  Marking these attribute(always_inline) is massive overkill.

Well, it is historical from about 5 years ago when I was working on EASTL. The compilers we were using at the time were gcc and MSVC. We found cases in the standard STL where inlines were nested up to 10 levels deep, and in some of those cases the compiler just gave up trying to inline things that deeply.
 
Overall the code is looking great.  I'd recommend checking in the new API separately from switching Constants.cpp to use it though (just so that any problems doesn't cause both to get reverted).

OK. I'm still working on getting a consensus from the hashing experts :)
 
-Chris

--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chris Lattner-2
On Feb 18, 2012, at 1:11 AM, Talin wrote:
+  /// Add an ArrayRef of arbitrary data.
+  template<typename T>
+  GeneralHash& add(ArrayRef<T> ArrayVal) {
+    addBits(ArrayVal.begin(), ArrayVal.end());
+    return *this;
+  }

Doesn't this have the same host-specificity problem, except that it will cause things that *are* stable to vary, such as arrays of char, or is the alignment check enough?

I thought about whether it would be possible to prevent people from passing in ArrayRefs with unstable things, and I came to the conclusion that there's no simple way to distinguish between stable and unstable ArrayRefs. This is why I decided not to make a special "addPointer" method for pointers, because you could easily subvert it by wrapping it in a singleton ArrayRef.

Ok.

+  /// Add a float
+  GeneralHash& add(float Data) {

It is worth adding a comment here that this does a bitwise hash, so -0.0 and +0.0 will hash to different values even though they compare equal and two identical NaN's will hash to the same value even though they compare unequal.

BTW, are there in fact any maps in LLVM that use floats as keys, other than uniquing of constants? And in the latter case, would you not want to distinguish between a -0.0 and +0.0 constant?

I don't know of any.  I think that the ConstantFP uniquing code specifically has to unique things as integers to avoid problems with FP comparisons.   In any case, I'm sure that the desired behavior is client specific - adding the comment just makes it obvious what the semantics are.  

   Actually, how much do we care about host instability here?  If this is used by hashing containers, we can just declare that iteration order is undefined even for non-pointer values.  The real problem I guess is that we'd have to go check out all the existing DenseMap's etc to make sure people aren't doing well-defined iteration right now and fixing the code.  What do you think?

I think that you are thinking that existing uses of DenseMap and other ADT containers will be affected by this. That wasn't my plan - I was going to basically use the hashing class to create a custom DenseMapInfo for specific maps which could benefit from the optimization. Other DenseMaps would remain as they are.

Ok.


Is there a specific observed concern here (e.g. when built with Clang) or is this a historical problem?  Compilers have gotten much better about inlining stuff that is actually small, if Clang handles it then I think we're good.  Marking these attribute(always_inline) is massive overkill.

Well, it is historical from about 5 years ago when I was working on EASTL. The compilers we were using at the time were gcc and MSVC. We found cases in the standard STL where inlines were nested up to 10 levels deep, and in some of those cases the compiler just gave up trying to inline things that deeply.

Ok, LLVM uses a completely different (bottom-up, incremental, simplifying as it goes) approach to inlining.  I wouldn't worry about it.  Please just remove the always inline markers and if it is a measured performance problem later we can add them back.

 
Overall the code is looking great.  I'd recommend checking in the new API separately from switching Constants.cpp to use it though (just so that any problems doesn't cause both to get reverted).

OK. I'm still working on getting a consensus from the hashing experts :)

My advise is to check in when you have to make forward progress.  If people want to reshave your yak into another hairdo, then then can do that at some later time.  No reason to block your progress as long as the API is good.

Thanks again for working on this!

-Chris


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Talin-3
OK committed. Let the shaving begin! :)

On Sat, Feb 18, 2012 at 3:20 AM, Chris Lattner <[hidden email]> wrote:
On Feb 18, 2012, at 1:11 AM, Talin wrote:
+  /// Add an ArrayRef of arbitrary data.
+  template<typename T>
+  GeneralHash& add(ArrayRef<T> ArrayVal) {
+    addBits(ArrayVal.begin(), ArrayVal.end());
+    return *this;
+  }

Doesn't this have the same host-specificity problem, except that it will cause things that *are* stable to vary, such as arrays of char, or is the alignment check enough?

I thought about whether it would be possible to prevent people from passing in ArrayRefs with unstable things, and I came to the conclusion that there's no simple way to distinguish between stable and unstable ArrayRefs. This is why I decided not to make a special "addPointer" method for pointers, because you could easily subvert it by wrapping it in a singleton ArrayRef.

Ok.

+  /// Add a float
+  GeneralHash& add(float Data) {

It is worth adding a comment here that this does a bitwise hash, so -0.0 and +0.0 will hash to different values even though they compare equal and two identical NaN's will hash to the same value even though they compare unequal.

BTW, are there in fact any maps in LLVM that use floats as keys, other than uniquing of constants? And in the latter case, would you not want to distinguish between a -0.0 and +0.0 constant?

I don't know of any.  I think that the ConstantFP uniquing code specifically has to unique things as integers to avoid problems with FP comparisons.   In any case, I'm sure that the desired behavior is client specific - adding the comment just makes it obvious what the semantics are.  

   Actually, how much do we care about host instability here?  If this is used by hashing containers, we can just declare that iteration order is undefined even for non-pointer values.  The real problem I guess is that we'd have to go check out all the existing DenseMap's etc to make sure people aren't doing well-defined iteration right now and fixing the code.  What do you think?

I think that you are thinking that existing uses of DenseMap and other ADT containers will be affected by this. That wasn't my plan - I was going to basically use the hashing class to create a custom DenseMapInfo for specific maps which could benefit from the optimization. Other DenseMaps would remain as they are.

Ok.


Is there a specific observed concern here (e.g. when built with Clang) or is this a historical problem?  Compilers have gotten much better about inlining stuff that is actually small, if Clang handles it then I think we're good.  Marking these attribute(always_inline) is massive overkill.

Well, it is historical from about 5 years ago when I was working on EASTL. The compilers we were using at the time were gcc and MSVC. We found cases in the standard STL where inlines were nested up to 10 levels deep, and in some of those cases the compiler just gave up trying to inline things that deeply.

Ok, LLVM uses a completely different (bottom-up, incremental, simplifying as it goes) approach to inlining.  I wouldn't worry about it.  Please just remove the always inline markers and if it is a measured performance problem later we can add them back.

 
Overall the code is looking great.  I'd recommend checking in the new API separately from switching Constants.cpp to use it though (just so that any problems doesn't cause both to get reverted).

OK. I'm still working on getting a consensus from the hashing experts :)

My advise is to check in when you have to make forward progress.  If people want to reshave your yak into another hairdo, then then can do that at some later time.  No reason to block your progress as long as the API is good.

Thanks again for working on this!

-Chris




--
-- Talin

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chandler Carruth-2
In reply to this post by Chris Lattner-2
On Sat, Feb 18, 2012 at 3:20 AM, Chris Lattner <[hidden email]> wrote:
My advise is to check in when you have to make forward progress.  If people want to reshave your yak into another hairdo, then then can do that at some later time.  No reason to block your progress as long as the API is good.

I was trying to give feedback on the API, and specifically suggest an alternative that might be both easier to use, and naturally dovetail with an upcoming standard. =/ Is there no interest in this? I'm happy to contribute an implementation using this API (we'll need to implement it anyways), but I'm actually not interested in just shaving yaks. I'm interested in getting a really good API here, because this is the support library that is usually held to a very high bar for APIs...

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: We need better hashing

Chris Lattner-2

On Feb 18, 2012, at 1:26 PM, Chandler Carruth wrote:

On Sat, Feb 18, 2012 at 3:20 AM, Chris Lattner <[hidden email]> wrote:
My advise is to check in when you have to make forward progress.  If people want to reshave your yak into another hairdo, then then can do that at some later time.  No reason to block your progress as long as the API is good.

I was trying to give feedback on the API, and specifically suggest an alternative that might be both easier to use, and naturally dovetail with an upcoming standard. =/ Is there no interest in this? I'm happy to contribute an implementation using this API (we'll need to implement it anyways), but I'm actually not interested in just shaving yaks. I'm interested in getting a really good API here, because this is the support library that is usually held to a very high bar for APIs...

Which API?

-Chris

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
12