design for an accurate ODR-checker with clang

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

design for an accurate ODR-checker with clang

Nick Lewycky-2
Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.

For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.

Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.

The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.

Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.

Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.

In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.

In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.

Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

Nick


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

John McCall-2
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:

> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.  This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

You should expect that there may be multiple hashing schemes (or versions thereof) in play and therefore build a simple prefixing scheme on your ODR symbols.

John.


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

Nick Lewycky-2
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:
> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?
 
This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

I feel maybe I'm not understanding what you meant here?

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

You should expect that there may be multiple hashing schemes (or versions thereof) in play and therefore build a simple prefixing scheme on your ODR symbols.

We could put the choice of hashing algorithm in the name of the llvm named metadata node, and in the name of the section in the .o files.

Nick


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

John McCall-2
On Jul 11, 2013, at 6:13 PM, Nick Lewycky <[hidden email]> wrote:
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:
> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?

The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

You should expect that there may be multiple hashing schemes (or versions thereof) in play and therefore build a simple prefixing scheme on your ODR symbols.

We could put the choice of hashing algorithm in the name of the llvm named metadata node, and in the name of the section in the .o files.

A header on the section sounds good.

John.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

Richard Smith-33
On Mon, Jul 15, 2013 at 3:12 PM, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 6:13 PM, Nick Lewycky <[hidden email]> wrote:
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:
> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?

The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

We want to enforce the C++ ODR as much as is reasonably possible, so we want to include the body for both classes and functions. That is, we explicitly want to check for cases where two functions or classes happen to have the same declaration but different definitions.

(Perhaps this also clarifies why we want a hash: an unhashed string would contain as much entropy as the entirety of the source code...)

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

John McCall-2
On Jul 15, 2013, at 3:20 PM, Richard Smith <[hidden email]> wrote:
On Mon, Jul 15, 2013 at 3:12 PM, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 6:13 PM, Nick Lewycky <[hidden email]> wrote:
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:
> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?

The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

We want to enforce the C++ ODR as much as is reasonably possible, so we want to include the body for both classes and functions. That is, we explicitly want to check for cases where two functions or classes happen to have the same declaration but different definitions.

Mmm.  So you want to warn the user that two libraries using different assertion settings both use the standard library?

I think warning about actual differences in code, as opposed to differences in type/vtable layout, is going to be pretty fraught with uninteresting positives, but if you want to chase that rabbit, it's your time spent.

Anyway, you only need to hash in function bodies for inline functions unless this is also an ELF abuse dectector.  (*Whether* a function is inline seems like a legitimate thing to hash for the function signature.)

John.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

JF Bastien
> Mmm.  So you want to warn the user that two libraries using different
> assertion settings both use the standard library?
>
> I think warning about actual differences in code, as opposed to differences
> in type/vtable layout, is going to be pretty fraught with uninteresting
> positives, but if you want to chase that rabbit, it's your time spent.

It's probably desirable to choose to detect ODR violations on classes
or functions independently, although detecting calling convention
differences in functions (without looking at the rest of the code)
could also be useful. I've debugged enough debug+release mixing issues
shared-libraries to feel that pain.
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

Richard Smith-33
In reply to this post by John McCall-2
On Mon, Jul 15, 2013 at 3:42 PM, John McCall <[hidden email]> wrote:
On Jul 15, 2013, at 3:20 PM, Richard Smith <[hidden email]> wrote:
On Mon, Jul 15, 2013 at 3:12 PM, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 6:13 PM, Nick Lewycky <[hidden email]> wrote:
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:
> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?

The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

We want to enforce the C++ ODR as much as is reasonably possible, so we want to include the body for both classes and functions. That is, we explicitly want to check for cases where two functions or classes happen to have the same declaration but different definitions.

Mmm.  So you want to warn the user that two libraries using different assertion settings both use the standard library?

libstdc++ does not use assert. IIRC, nor does libc++ unless you use its "no exceptions" mode.
 
I think warning about actual differences in code, as opposed to differences in type/vtable layout, is going to be pretty fraught with uninteresting positives, but if you want to chase that rabbit, it's your time spent.

For code that already passes gold's --detect-odr-violations, the extra testing for definitions of inline functions would effectively be checking that we don't have two functions that are defined from the same token sequence but are interpreted in different ways, so the uninteresting positive rate should be rather low (or if not, then we've learned something important...). For non-inline functions and classes, the checking would be more novel, so the uninteresting positive rate is hard to be sure about.
 
Anyway, you only need to hash in function bodies for inline functions unless this is also an ELF abuse dectector.  (*Whether* a function is inline seems like a legitimate thing to hash for the function signature.)

Giving different definitions (for either functions or classes) in different source files is one of the things we'd like to catch (although there are probably more direct ways to do so than a full ODR checker, such as maybe -Wmissing-prototypes).

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

John McCall-2
On Jul 15, 2013, at 4:47 PM, Richard Smith <[hidden email]> wrote:
On Mon, Jul 15, 2013 at 3:42 PM, John McCall <[hidden email]> wrote:
On Jul 15, 2013, at 3:20 PM, Richard Smith <[hidden email]> wrote:
On Mon, Jul 15, 2013 at 3:12 PM, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 6:13 PM, Nick Lewycky <[hidden email]> wrote:
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:
> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?

The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

We want to enforce the C++ ODR as much as is reasonably possible, so we want to include the body for both classes and functions. That is, we explicitly want to check for cases where two functions or classes happen to have the same declaration but different definitions.

Mmm.  So you want to warn the user that two libraries using different assertion settings both use the standard library?

libstdc++ does not use assert. IIRC, nor does libc++ unless you use its "no exceptions" mode.

libc++ does appear to use assertions if _LIBCPP_DEBUG is defined, but that causes an ABI break.

Regardless, that is clearly not the point.  A single library, compiled in release mode, vends an interface with an inline function containing an assert.  This is an ODR violation.

I think warning about actual differences in code, as opposed to differences in type/vtable layout, is going to be pretty fraught with uninteresting positives, but if you want to chase that rabbit, it's your time spent.

For code that already passes gold's --detect-odr-violations, the extra testing for definitions of inline functions would effectively be checking that we don't have two functions that are defined from the same token sequence but are interpreted in different ways, so the uninteresting positive rate should be rather low (or if not, then we've learned something important...). For non-inline functions and classes, the checking would be more novel, so the uninteresting positive rate is hard to be sure about.

Hands: washed.

John.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: design for an accurate ODR-checker with clang

John Bytheway
In reply to this post by Richard Smith-33
On 2013-07-15 18:20, Richard Smith wrote:

> On Mon, Jul 15, 2013 at 3:12 PM, John McCall
> <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On Jul 11, 2013, at 6:13 PM, Nick Lewycky
>     <[hidden email]
>     <mailto:[hidden email]>> wrote:
>>         This is the right basic design, but I'm curious why you're
>>         suggesting that the payload should just be a hash instead of
>>         an arbitrary string.
>>
>>
>>     What are you suggesting goes into this string?
>
>     The same sorts of things that you were planning on hashing, but
>     maybe not hashed.  It's up to you; having a full string would let
>     you actually show a useful error message, but it definitely inflates
>     binary sizes.  If you really think you can make this performant
>     enough to do on every load, I can see how the latter would be important.
>
> (Perhaps this also clarifies why we want a hash: an unhashed string
> would contain as much entropy as the entirety of the source code...)

Maybe you can't afford to store the unhashed data for everything, but
what about an option to store it for particular function(s)/class(es).
That way, once an ODR violation has been detected through the hash
infrastructure the compilation/linking can be repeated with more data
stored, and yield a decent error message about what exactly changed
between the two definitions.

John Bytheway


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

Daniel Albuschat
In reply to this post by JF Bastien
2013/7/16 JF Bastien <[hidden email]>
I've debugged enough debug+release mixing issues
shared-libraries to feel that pain.

ACK.

It's not only debug+release, but very often an outdated object file for me. Because the build system didn't pick up code changes in some headers and did not re-compile compilation units that were affected. (Hello Visual Studio, it's you that I am looking at.)

Would have saved me numerous hours.


Greetings,
Daniel Albuschat

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

Nick Lewycky-2
In reply to this post by John McCall-2
On 15 July 2013 15:12, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 6:13 PM, Nick Lewycky <[hidden email]> wrote:
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:
> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?

The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

I was thinking we could add more things to help diagnostics, next to the hash. I *think* there are two cases that matter, but there may be more. Either we have an ODR violation where the file:line are different, or if file and line are the same then the preprocessor state was different. We could emit file and line from the starting loc of each of the hashes, and we could emit a preprocessor table with the list of initial defines and changes to those defines as the TU went along -- at each hash we could point to an index into that table to indicate where we are. Both of those give us enough information for the linker to say why the ODRs failed to match.

There are other situations where the file, line and preprocessor settings were the same. Then I'd expect the md5 of the file was different, and the file was changed between two builds. Anything beyond that is pretty exotic.
This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Fair point. If we want to include less in the hash just to make it more palatable for dynamic library users, we can have a flag for that. Some sort of ODR-checking lite.

I really don't care about .so files myself.
Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

On the contrary, this is the case I care about! Two different definitions of the same function.

Nick
You should expect that there may be multiple hashing schemes (or versions thereof) in play and therefore build a simple prefixing scheme on your ODR symbols.

We could put the choice of hashing algorithm in the name of the llvm named metadata node, and in the name of the section in the .o files.

A header on the section sounds good.

John.


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

John McCall-2
On Aug 5, 2013, at 3:04 PM, Nick Lewycky <[hidden email]> wrote:
On 15 July 2013 15:12, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 6:13 PM, Nick Lewycky <[hidden email]> wrote:
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:

> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?

The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

I was thinking we could add more things to help diagnostics, next to the hash. I *think* there are two cases that matter, but there may be more. Either we have an ODR violation where the file:line are different, or if file and line are the same then the preprocessor state was different.  We could emit file and line from the starting loc of each of the hashes, and we could emit a preprocessor table with the list of initial defines and changes to those defines as the TU went along -- at each hash we could point to an index into that table to indicate where we are. Both of those give us enough information for the linker to say why the ODRs failed to match.

Do you have any idea how much data that is?  Your .o files are going to be *huge*.

There are other situations where the file, line and preprocessor settings were the same. Then I'd expect the md5 of the file was different, and the file was changed between two builds. Anything beyond that is pretty exotic.

If the diagnostics you want the linker to report are literally as coarse as “this file changed, maybe that has something to do with it?”, I think you might be able to find something that doesn’t require quite so much of a core dump.

This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Fair point. If we want to include less in the hash just to make it more palatable for dynamic library users, we can have a flag for that. Some sort of ODR-checking lite.

I really don't care about .so files myself.

Maybe you should just spec this out as a static-linker-only technology, then.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

On the contrary, this is the case I care about! Two different definitions of the same function.

Allow me to suggest a structural hash of the statements, then, instead of dumping the entire preprocessor history.

John.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

Nick Lewycky-2
On 5 August 2013 15:33, John McCall <[hidden email]> wrote:
On Aug 5, 2013, at 3:04 PM, Nick Lewycky <[hidden email]> wrote:
On 15 July 2013 15:12, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 6:13 PM, Nick Lewycky <[hidden email]> wrote:
On 11 July 2013 18:02, John McCall <[hidden email]> wrote:
On Jul 11, 2013, at 5:45 PM, Nick Lewycky <[hidden email]> wrote:
> Hi! A few of us over at Google think a nice feature in clang would be ODR violation checking, and we thought for a while about how to do this and wrote it down, but we aren't actively working on it at the moment nor plan to in the near future. I'm posting this to share our design and hopefully save anyone else the design work if they're interested in it.
>
> For some background, C++'s ODR rule roughly means that two definitions of the same symbol must come from "the same tokens with the same interpretation". Given the same token stream, the interpretation can be different due to different name lookup results, or different types through typedefs or using declarations, or due to a different point of instantiation in two translation units.
>
> Unlike existing approaches (the ODR checker in the gold linker for example), clang lets us do this with no false positives and very few false negatives. The basis of the idea is that we produce a hash of all the ODR-relevant pieces, and to try to pick the largest possible granularity. By granularity I mean that we would hash the entire definition of a class including all methods defined lexically inline and emit a single value for that class.
>
> The first step is to build a new visitor over the clang AST that calculates a hash of the ODR-relevant pieces of the code. (StmtProfiler doesn’t work here because it includes pointers addresses which will be different across different translation units.) Hash the outermost declaration with external-linkage. For example, given a class with a method defined inline, we start the visitor at the class, not at the method. The entirety of the class must be ODR-equivalent across two translation units, including any inline methods.
>
> Although the standard mentions that the tokens must be the same, we do not actually include the tokens in the hash. The structure of the AST includes everything about the code which is semantically relevant. Any false positives that would be fixed by hashing the tokens either do not impact the behaviour of the program or could be fixed by hashing more of the AST. References to globals should be hashed by name, but references to locals should be hashed by an ordinal number.
>
> Instantiated templates are also visited by the hashing visitor. If we did not, we would have false negatives where the code is not conforming due to different points of instantiation in two translation units. We can skip uninstantiated templates since they don’t affect the behaviour of the program, and we need to visit the instantiations regardless.
>
> In LLVM IR, create a new named metadata node !llvm.odr_checking which contains a list of <mangled name, hash value> pairs. The names do not necessarily correspond to symbols, for instance, a class will have a hash value but does not have a corresponding symbol. For ease of implementation, names should be mangled per the C++ Itanium ABI (demanglable with c++filt -t). Merging modules that contain these will need to do ODR checking as part of that link, and the resulting module will have the union of these tables.
>
> In the .o file, emit a sorted table of <mangled name, hash value> in a non-loadable section intended to be read by the linker. All entries in the table must be checked if any symbol from this .o file is involved in the link (note that there is no mapping from symbol to odr table name). If two .o files contain different hash values for the same name, we have detected an ODR violation and issue a diagnostic.
>
> Finally, teach the loader (RuntimeDyld) to do verification and catch ODR violations when dlopen'ing a shared library.

This is the right basic design, but I'm curious why you're suggesting that the payload should just be a hash instead of an arbitrary string.

What are you suggesting goes into this string?

The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

I was thinking we could add more things to help diagnostics, next to the hash. I *think* there are two cases that matter, but there may be more. Either we have an ODR violation where the file:line are different, or if file and line are the same then the preprocessor state was different.  We could emit file and line from the starting loc of each of the hashes, and we could emit a preprocessor table with the list of initial defines and changes to those defines as the TU went along -- at each hash we could point to an index into that table to indicate where we are. Both of those give us enough information for the linker to say why the ODRs failed to match.

Do you have any idea how much data that is?  Your .o files are going to be *huge*.

The point of the design is to minimize the number of entries, so adding more data per-entry doesn't seem like a big problem.

That said, I haven't implemented even an experiment that counts how many entries we would produce. I can do that and come back with numbers, possibly with my tail between my legs.
There are other situations where the file, line and preprocessor settings were the same. Then I'd expect the md5 of the file was different, and the file was changed between two builds. Anything beyond that is pretty exotic.

If the diagnostics you want the linker to report are literally as coarse as “this file changed, maybe that has something to do with it?”, I think you might be able to find something that doesn’t require quite so much of a core dump.

This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Fair point. If we want to include less in the hash just to make it more palatable for dynamic library users, we can have a flag for that. Some sort of ODR-checking lite.

I really don't care about .so files myself.

Maybe you should just spec this out as a static-linker-only technology, then.

That's fair criticism. I did exactly that, then was asked to please consider the dynamic linker case and lazily said "oh sure, I don't see any reason the same scheme can't verify at load time too". I'll retract any claims about handling the dynamic loading case.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

On the contrary, this is the case I care about! Two different definitions of the same function.

Allow me to suggest a structural hash of the statements, then, instead of dumping the entire preprocessor history.

But I am suggesting a structural hash of the statements!

Consider a system that produces a structural hash of the statements. We should do this with something that behaves similarly to the StmtProfiler, but with specialized handling for things like labels and global variables. Suppose we get a hash for each function, and emit a table of hashed function name to hashed AST. Great! The basis of an ODR checker.

But can we do better? Instead of storing a hash for each function, we can group them (such that we produce a single hash for more than one function), for instance by class they're in when the methods were written in the body? What I proposed is to have a single has for each top-level-declaration and all of its lexically-included contents, dramatically reducing the number of hashes we need to store. The tradeoff is that we can't quickly map from symbol-name to an entry in the ODR table. That in turn means need to check all ODR entries in a .o if we bring in any symbol from that .o file (we can't just check the individual function because it may not have been hashed individually).

You were asking why we should hash things when we could just store the source code. I don't see how storing the text of the .cc would be better than storing a condensed version of the changes we care about (the preprocessor history). It's a strict subset, and if we really care, we can shrink it further by being clever (collapse multiple changes between the start of two top-level-decls, don't include changes in macros that weren't expanded during the parse of the body of a top-level-decl). The sort of output I'm expecting would be along the lines of "a.o was produced with _DEBUG while b.o was produced with NDEBUG".

Nick


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

John McCall-2
On Aug 5, 2013, at 4:43 PM, Nick Lewycky <[hidden email]> wrote:
On 5 August 2013 15:33, John McCall <[hidden email]> wrote:
On Aug 5, 2013, at 3:04 PM, Nick Lewycky <[hidden email]> wrote:
On 15 July 2013 15:12, John McCall <[hidden email]> wrote:
The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

I was thinking we could add more things to help diagnostics, next to the hash. I *think* there are two cases that matter, but there may be more. Either we have an ODR violation where the file:line are different, or if file and line are the same then the preprocessor state was different.  We could emit file and line from the starting loc of each of the hashes, and we could emit a preprocessor table with the list of initial defines and changes to those defines as the TU went along -- at each hash we could point to an index into that table to indicate where we are. Both of those give us enough information for the linker to say why the ODRs failed to match.

Do you have any idea how much data that is?  Your .o files are going to be *huge*.

The point of the design is to minimize the number of entries, so adding more data per-entry doesn't seem like a big problem.

No, I mean the “preprocessor table with the list of initial defines and changes to those defines as the TU went along”.  That’s global, sure, but it’s enormous.

This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Fair point. If we want to include less in the hash just to make it more palatable for dynamic library users, we can have a flag for that. Some sort of ODR-checking lite.

I really don't care about .so files myself.

Maybe you should just spec this out as a static-linker-only technology, then.

That's fair criticism. I did exactly that, then was asked to please consider the dynamic linker case and lazily said "oh sure, I don't see any reason the same scheme can't verify at load time too". I'll retract any claims about handling the dynamic loading case.

The same scheme could work, but doing it at load time forces completely different performance considerations, so it really demands a different design.

Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

On the contrary, this is the case I care about! Two different definitions of the same function.

Allow me to suggest a structural hash of the statements, then, instead of dumping the entire preprocessor history.

But I am suggesting a structural hash of the statements!

With pointers into a complete preprocessor history for some reason!

By “structural”, I mean “considering only the result of preprocessing and semantic analysis”.

John.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: [cfe-dev] design for an accurate ODR-checker with clang

Nick Lewycky-2
On 5 August 2013 16:55, John McCall <[hidden email]> wrote:
On Aug 5, 2013, at 4:43 PM, Nick Lewycky <[hidden email]> wrote:
On 5 August 2013 15:33, John McCall <[hidden email]> wrote:
On Aug 5, 2013, at 3:04 PM, Nick Lewycky <[hidden email]> wrote:
On 15 July 2013 15:12, John McCall <[hidden email]> wrote:
The same sorts of things that you were planning on hashing, but maybe not hashed.  It's up to you; having a full string would let you actually show a useful error message, but it definitely inflates binary sizes.  If you really think you can make this performant enough to do on every load, I can see how the latter would be important.

I was thinking we could add more things to help diagnostics, next to the hash. I *think* there are two cases that matter, but there may be more. Either we have an ODR violation where the file:line are different, or if file and line are the same then the preprocessor state was different.  We could emit file and line from the starting loc of each of the hashes, and we could emit a preprocessor table with the list of initial defines and changes to those defines as the TU went along -- at each hash we could point to an index into that table to indicate where we are. Both of those give us enough information for the linker to say why the ODRs failed to match.

Do you have any idea how much data that is?  Your .o files are going to be *huge*.

The point of the design is to minimize the number of entries, so adding more data per-entry doesn't seem like a big problem.

No, I mean the “preprocessor table with the list of initial defines and changes to those defines as the TU went along”.  That’s global, sure, but it’s enormous.

Ah. Even more convincing to me is that this data is not normally included in .o files at all. You're right. This could get big.
This isn't going to be performant enough to do unconditionally at every load no matter how much you shrink it.

Every load of a shared object? That's not a fast operation even without odr checking, but the idea is to keep the total number of entries in the odr table small. It's less than the number of symbols, closer to the number of top-level decls.

Your ABI dependencies are every declaration *that you ever rely on*.  You've got to figure that that's going to be very large.  For a library of any significance, I'd be expecting this check to touch about half a megabyte of data, even with a 32-bit hash and some sort of clever prefixing scheme on the symbols.  That's a pretty major regression in library loading.

Fair point. If we want to include less in the hash just to make it more palatable for dynamic library users, we can have a flag for that. Some sort of ODR-checking lite.

I really don't care about .so files myself.

Maybe you should just spec this out as a static-linker-only technology, then.

That's fair criticism. I did exactly that, then was asked to please consider the dynamic linker case and lazily said "oh sure, I don't see any reason the same scheme can't verify at load time too". I'll retract any claims about handling the dynamic loading case.

The same scheme could work, but doing it at load time forces completely different performance considerations, so it really demands a different design.

I wonder whether the people who ask for ODR-violation detection at dynamic link time are looking for something different? A way to verify that a new release of their C++ library is free from a certain class of ABI mismatch against the previous version. This is something entirely different from what my goal was, even though it's also under the banner of enforcing C++'s ODR rule.
Also, you should have something analogous to symbol visibility as a way to tell the static linker that something only needs to be ODR-checked within a linkage unit.  It would be informed by actual symbol visibility, of course.

Great point, and that needs to flow into the .o files as well. If a class has one visibility and its method has another, we want to skip the method when hashing the class, and need to emit an additional entry for the method alone? Is that right?

Class hashes should probably only include virtual methods anyway, but yes, I think this is a good starting point.

What do you want in the hash for a function anyway?  Almost everything is already captured by (1) the separate hashes for the nominal types mentioned and (2) the symbol mangling.  You're pretty much only missing the return type.  Oh, I guess you need the body's dependencies for inline functions.

On the contrary, this is the case I care about! Two different definitions of the same function.

Allow me to suggest a structural hash of the statements, then, instead of dumping the entire preprocessor history.

But I am suggesting a structural hash of the statements!

With pointers into a complete preprocessor history for some reason!

By “structural”, I mean “considering only the result of preprocessing and semantic analysis”.

Yes. There's two things here being conflated. The ODR detection itself -- determining whether there is an ODR violation -- should be done considering only the result of preprocessing and semantic analysis. Absolutely. Violent agreement.

Once an ODR violation is found, how can the linker produce a better error message than "nope!"? That's where my suggestion of tagging each entry with the macro changes since previous top-level-decl came from. That should be small in most cases (who defines things in the middle of their TU?) but in reality we can't discard macros that come in from the system headers (and are consequently used to declare functions that are used by definitions later on). Ouch.

If I were to suppose a flag we could provide that means "help, I encountered an ODR violation and need the compiler's help to solve it" then there's no reason we couldn't just have that flag include snippets of source code in the .o instead of a condensed list of preprocessor changes. Though upon further consideration, maybe that should just be a clang-tool.

Nick


_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev