[llvm-dev] getelementptr inbounds with offset 0

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi all,

What exactly are the rules for `getelementptr inbounds` with offset 0?

In Rust, we are relying on the fact that if we use, for example, `inttoptr` to
turn `4` into a pointer, we can then do `getelementptr inbounds` with offset 0
on that without LLVM deducing that there actually is any dereferencable memory
at location 4.  The argument is that we can think of there being a zero-sized
allocation. Is that a reasonable assumption?  Can something like this be
documented in the LangRef?

Relatedly, how does the situation change if the pointer is not created "out of
thin air" from a fixed integer, but is actually a dangling pointer obtained
previously from `malloc` (or `alloca` or whatever)?  Is getelementptr inbounds`
with offset 0 on such a pointer a NOP, or does it result in `poison`?  And if
that makes a difference, how does that square with the fact that, e.g., the
integer `0x4000` could well be inside such an allocation, but doing
`getelementptr inbounds` with offset 0 on that would fall under the first
question above?

Kind regards,
Ralf
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
LLVM has no idea whether the address computed by GEP is actually
within a legal object. The "inbounds" keyword is just you, the
programmer, promising LLVM that you know it's ok and that you don't
care what happens if it is actually out of bounds.

https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds

On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
<[hidden email]> wrote:

>
> Hi all,
>
> What exactly are the rules for `getelementptr inbounds` with offset 0?
>
> In Rust, we are relying on the fact that if we use, for example, `inttoptr` to
> turn `4` into a pointer, we can then do `getelementptr inbounds` with offset 0
> on that without LLVM deducing that there actually is any dereferencable memory
> at location 4.  The argument is that we can think of there being a zero-sized
> allocation. Is that a reasonable assumption?  Can something like this be
> documented in the LangRef?
>
> Relatedly, how does the situation change if the pointer is not created "out of
> thin air" from a fixed integer, but is actually a dangling pointer obtained
> previously from `malloc` (or `alloca` or whatever)?  Is getelementptr inbounds`
> with offset 0 on such a pointer a NOP, or does it result in `poison`?  And if
> that makes a difference, how does that square with the fact that, e.g., the
> integer `0x4000` could well be inside such an allocation, but doing
> `getelementptr inbounds` with offset 0 on that would fall under the first
> question above?
>
> Kind regards,
> Ralf
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev


On Mon, 25 Feb 2019 at 13:11, Bruce Hoult via llvm-dev <[hidden email]> wrote:
LLVM has no idea whether the address computed by GEP is actually
within a legal object. The "inbounds" keyword is just you, the
programmer, promising LLVM that you know it's ok and that you don't
care what happens if it is actually out of bounds.

https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds

Hi Bruce,

it's not true in general that LLVM has no idea about (or doesn't care about) object sizes. It can infer object size and other things from allocas, global variables, and calls to built-in functions such as malloc(). In the case of Rust we even have an out of tree patch to teach LLVM the same for Rust's (global) heap allocation functions. You can see this information being computed in lib/Analysis/MemoryBuiltins.cpp.

More importantly, the question is *what* actually is being promised to LLVM, more specifically, what the definitions of the terms "out of bounds" and "object" are in this context. It is easy enough to answer intuitively in many specific cases whether a GEP should be considered "out of bounds", but in the cases Ralf described, where offsets and "object sizes" are equal to 0, it is not so clear-cut and depends on tricky matters such as whether zero-sized allocations exist. We (Rust developers) very much care what happens in those cases (it should be a NOP), so it's important to check whether that is compatible with the Rust compiler emitting inbounds GEPs.

It is true that in practice in many cases LLVM won't be able to determine conclusively whether an object exists or not and what its bounds are, but that doesn't answer the question.

Cheers,
Robin
 
On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
<[hidden email]> wrote:
>
> Hi all,
>
> What exactly are the rules for `getelementptr inbounds` with offset 0?
>
> In Rust, we are relying on the fact that if we use, for example, `inttoptr` to
> turn `4` into a pointer, we can then do `getelementptr inbounds` with offset 0
> on that without LLVM deducing that there actually is any dereferencable memory
> at location 4.  The argument is that we can think of there being a zero-sized
> allocation. Is that a reasonable assumption?  Can something like this be
> documented in the LangRef?
>
> Relatedly, how does the situation change if the pointer is not created "out of
> thin air" from a fixed integer, but is actually a dangling pointer obtained
> previously from `malloc` (or `alloca` or whatever)?  Is getelementptr inbounds`
> with offset 0 on such a pointer a NOP, or does it result in `poison`?  And if
> that makes a difference, how does that square with the fact that, e.g., the
> integer `0x4000` could well be inside such an allocation, but doing
> `getelementptr inbounds` with offset 0 on that would fall under the first
> question above?
>
> Kind regards,
> Ralf
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
In reply to this post by Finkel, Hal J. via llvm-dev
Hi Bruce,

On 25.02.19 13:10, Bruce Hoult wrote:
> LLVM has no idea whether the address computed by GEP is actually
> within a legal object. The "inbounds" keyword is just you, the
> programmer, promising LLVM that you know it's ok and that you don't
> care what happens if it is actually out of bounds.
>
> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds

The LangRef says I get a poison value when I am violating the bounds. What I am
asking is what exactly this means when the offset is 0 -- what *are* the
conditions under which an offset-by-0 is "out of bounds" and hence yields poison?
Of course LLVM cannot always statically determine this, but it relies on
(dynamically, on the "LLVM abstract machine") such things not happening, and I
am asking what exactly these dynamic conditions are.

Kind regards,
Ralf

>
> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
> <[hidden email]> wrote:
>>
>> Hi all,
>>
>> What exactly are the rules for `getelementptr inbounds` with offset 0?
>>
>> In Rust, we are relying on the fact that if we use, for example, `inttoptr` to
>> turn `4` into a pointer, we can then do `getelementptr inbounds` with offset 0
>> on that without LLVM deducing that there actually is any dereferencable memory
>> at location 4.  The argument is that we can think of there being a zero-sized
>> allocation. Is that a reasonable assumption?  Can something like this be
>> documented in the LangRef?
>>
>> Relatedly, how does the situation change if the pointer is not created "out of
>> thin air" from a fixed integer, but is actually a dangling pointer obtained
>> previously from `malloc` (or `alloca` or whatever)?  Is getelementptr inbounds`
>> with offset 0 on such a pointer a NOP, or does it result in `poison`?  And if
>> that makes a difference, how does that square with the fact that, e.g., the
>> integer `0x4000` could well be inside such an allocation, but doing
>> `getelementptr inbounds` with offset 0 on that would fall under the first
>> question above?
>>
>> Kind regards,
>> Ralf
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi Ralf,

I wanted to restart this discussion as it is important for my IPO
attribute deduction work as well. Let me share my take on the situation,
no guarantees!


From the Lang-Ref statement

  "With the inbounds keyword, the result value of the GEP is undefined
  if the address is outside the actual underlying allocated object and
  not the address one-past-the-end."

I'd argue that the actual offset value (here 0) is irrelevant. The GEP
value is undefined if inbounds is present and the resulting pointer does
not point into, or one-past-the-end, of an allocated object. This
object, in my understanding, has to be the same one the base pointer of
the GEP points into, or one-past-the-end, or you get again an undefined
result.


That being said, your initial "gep inbounds (int2ptr 4) 0" might cause
an undefined value if 4 is not part of a valid allocation, or
one-past-the-end.

Now if that might cause any problems, e.g., if LLVM is able to act on
this fact, depends on various factors including what you do with the
GEP. Your initial problem seemed to be that LLVM "might be able to
deduce dereferencable memory at location 4" but that should never be the
case if you only form the aforementioned GEP, with or without the
inbounds actually. Forming a pointer that has a undefined value is just
that, a pointer with an undefined value. A side-effect based on the GEP
will however __locally__ introduce an dereferencability assumption (in
my opinion at least). Let's say the code looks like this:


  %G = gep inbounds (int2ptr 4) 0
  ; We don't know anything about the dereferencability of
  ; the memory at address 4 here.
  br %cnd, %BB0, %BB1

BB0:
  ; We don't know anything about the dereferencability of
  ; the memory at address 4 here.
  load %G
  ; We know the memory at address 4 is dereferenceable here.
  ; Though, that is due to the load and not the inbounds.
  ...
  br %BB1

BB1:
  ; We don't know anything about the dereferencability of
  ; the memory at address 4 here.


It is a different story if you start to use the GEP in other operations,
e.g., to alter control flow. Then the (potential) undefined value can
propagate.


Any thought on this? Did I at least get your problem description right?

Cheers,
  Johannes



P.S. Sorry if this breaks the thread and apologies that I had to remove
     Bruce from the CC. It turns out replying to an email you did not
     receive is complicated and getting on the LLVM-Dev list is nowadays
     as well...


On 02/25, Ralf Jung via llvm-dev wrote:

> Hi Bruce,
>
> On 25.02.19 13:10, Bruce Hoult wrote:
> > LLVM has no idea whether the address computed by GEP is actually
> > within a legal object. The "inbounds" keyword is just you, the
> > programmer, promising LLVM that you know it's ok and that you don't
> > care what happens if it is actually out of bounds.
> >
> > https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
>
> The LangRef says I get a poison value when I am violating the bounds. What I am
> asking is what exactly this means when the offset is 0 -- what *are* the
> conditions under which an offset-by-0 is "out of bounds" and hence yields poison?
> Of course LLVM cannot always statically determine this, but it relies on
> (dynamically, on the "LLVM abstract machine") such things not happening, and I
> am asking what exactly these dynamic conditions are.
>
> Kind regards,
> Ralf
>
> >
> > On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
> > <[hidden email]> wrote:
> >>
> >> Hi all,
> >>
> >> What exactly are the rules for `getelementptr inbounds` with offset 0?
> >>
> >> In Rust, we are relying on the fact that if we use, for example, `inttoptr` to
> >> turn `4` into a pointer, we can then do `getelementptr inbounds` with offset 0
> >> on that without LLVM deducing that there actually is any dereferencable memory
> >> at location 4.  The argument is that we can think of there being a zero-sized
> >> allocation. Is that a reasonable assumption?  Can something like this be
> >> documented in the LangRef?
> >>
> >> Relatedly, how does the situation change if the pointer is not created "out of
> >> thin air" from a fixed integer, but is actually a dangling pointer obtained
> >> previously from `malloc` (or `alloca` or whatever)?  Is getelementptr inbounds`
> >> with offset 0 on such a pointer a NOP, or does it result in `poison`?  And if
> >> that makes a difference, how does that square with the fact that, e.g., the
> >> integer `0x4000` could well be inside such an allocation, but doing
> >> `getelementptr inbounds` with offset 0 on that would fall under the first
> >> question above?
> >>
> >> Kind regards,
> >> Ralf
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> [hidden email]
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev

After I read the message again I think the BB0 comments were wrong. It should have been:


BB0:
  ; We know the memory at address 4 is dereferenceable here.
  ; Though, that is due to the load and not the inbounds.

  load %G
  ; We know the memory at address 4 is dereferenceable here.
  ; Though, that is due to the load and not the inbounds.
  ...
  br %BB1


From: Johannes Doerfert <[hidden email]>
Sent: Thursday, March 7, 2019 11:52:55 AM
To: Ralf Jung
Cc: LLVM Dev
Subject: Re: [llvm-dev] getelementptr inbounds with offset 0
 
Hi Ralf,

I wanted to restart this discussion as it is important for my IPO
attribute deduction work as well. Let me share my take on the situation,
no guarantees!


From the Lang-Ref statement

  "With the inbounds keyword, the result value of the GEP is undefined
  if the address is outside the actual underlying allocated object and
  not the address one-past-the-end."

I'd argue that the actual offset value (here 0) is irrelevant. The GEP
value is undefined if inbounds is present and the resulting pointer does
not point into, or one-past-the-end, of an allocated object. This
object, in my understanding, has to be the same one the base pointer of
the GEP points into, or one-past-the-end, or you get again an undefined
result.


That being said, your initial "gep inbounds (int2ptr 4) 0" might cause
an undefined value if 4 is not part of a valid allocation, or
one-past-the-end.

Now if that might cause any problems, e.g., if LLVM is able to act on
this fact, depends on various factors including what you do with the
GEP. Your initial problem seemed to be that LLVM "might be able to
deduce dereferencable memory at location 4" but that should never be the
case if you only form the aforementioned GEP, with or without the
inbounds actually. Forming a pointer that has a undefined value is just
that, a pointer with an undefined value. A side-effect based on the GEP
will however __locally__ introduce an dereferencability assumption (in
my opinion at least). Let's say the code looks like this:


  %G = gep inbounds (int2ptr 4) 0
  ; We don't know anything about the dereferencability of
  ; the memory at address 4 here.
  br %cnd, %BB0, %BB1

BB0:
  ; We don't know anything about the dereferencability of
  ; the memory at address 4 here.
  load %G
  ; We know the memory at address 4 is dereferenceable here.
  ; Though, that is due to the load and not the inbounds.
  ...
  br %BB1

BB1:
  ; We don't know anything about the dereferencability of
  ; the memory at address 4 here.


It is a different story if you start to use the GEP in other operations,
e.g., to alter control flow. Then the (potential) undefined value can
propagate.


Any thought on this? Did I at least get your problem description right?

Cheers,
  Johannes



P.S. Sorry if this breaks the thread and apologies that I had to remove
     Bruce from the CC. It turns out replying to an email you did not
     receive is complicated and getting on the LLVM-Dev list is nowadays
     as well...


On 02/25, Ralf Jung via llvm-dev wrote:
> Hi Bruce,
>
> On 25.02.19 13:10, Bruce Hoult wrote:
> > LLVM has no idea whether the address computed by GEP is actually
> > within a legal object. The "inbounds" keyword is just you, the
> > programmer, promising LLVM that you know it's ok and that you don't
> > care what happens if it is actually out of bounds.
> >
> > https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
>
> The LangRef says I get a poison value when I am violating the bounds. What I am
> asking is what exactly this means when the offset is 0 -- what *are* the
> conditions under which an offset-by-0 is "out of bounds" and hence yields poison?
> Of course LLVM cannot always statically determine this, but it relies on
> (dynamically, on the "LLVM abstract machine") such things not happening, and I
> am asking what exactly these dynamic conditions are.
>
> Kind regards,
> Ralf
>
> >
> > On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
> > <[hidden email]> wrote:
> >>
> >> Hi all,
> >>
> >> What exactly are the rules for `getelementptr inbounds` with offset 0?
> >>
> >> In Rust, we are relying on the fact that if we use, for example, `inttoptr` to
> >> turn `4` into a pointer, we can then do `getelementptr inbounds` with offset 0
> >> on that without LLVM deducing that there actually is any dereferencable memory
> >> at location 4.  The argument is that we can think of there being a zero-sized
> >> allocation. Is that a reasonable assumption?  Can something like this be
> >> documented in the LangRef?
> >>
> >> Relatedly, how does the situation change if the pointer is not created "out of
> >> thin air" from a fixed integer, but is actually a dangling pointer obtained
> >> previously from `malloc` (or `alloca` or whatever)?  Is getelementptr inbounds`
> >> with offset 0 on such a pointer a NOP, or does it result in `poison`?  And if
> >> that makes a difference, how does that square with the fact that, e.g., the
> >> integer `0x4000` could well be inside such an allocation, but doing
> >> `getelementptr inbounds` with offset 0 on that would fall under the first
> >> question above?
> >>
> >> Kind regards,
> >> Ralf
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> [hidden email]
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
In reply to this post by Finkel, Hal J. via llvm-dev
Hi Johannes,

> From the Lang-Ref statement
>
>   "With the inbounds keyword, the result value of the GEP is undefined
>   if the address is outside the actual underlying allocated object and
>   not the address one-past-the-end."
>
> I'd argue that the actual offset value (here 0) is irrelevant. The GEP
> value is undefined if inbounds is present and the resulting pointer does
> not point into, or one-past-the-end, of an allocated object. This
> object, in my understanding, has to be the same one the base pointer of
> the GEP points into, or one-past-the-end, or you get again an undefined
> result.

Yes, I agree with that reading.

However, the notion of "allocated object" here is not entirely clear.  LLVM has
to operate under the assumption that there are allocations and allocators it doe
snot know anything about.  Just imagine some embedded project writing to
well-known address 0xDeadCafe because there is a hardware register there.

So, the thinking here is: LLVM cannot exclude the possibility of an object of
size 0 existing at any given address.  The pointer returned by "GEPi p 0" then
would be one-past-the-end of such a 0-sized object.  Thus, "GEPi p 0" is the
identitiy function for any p, it will not return poison.

> Now if that might cause any problems, e.g., if LLVM is able to act on
> this fact, depends on various factors including what you do with the
> GEP. Your initial problem seemed to be that LLVM "might be able to
> deduce dereferencable memory at location 4" but that should never be the
> case if you only form the aforementioned GEP, with or without the
> inbounds actually. Forming a pointer that has a undefined value is just
> that, a pointer with an undefined value.

Ah, good point.  First of all I was indeed unclear; the case I am worried about
here is GEPi returning poison.  (These values might be used in further
computations and eventually surface as UB.)
But also, clearly a "GEPi 0" alone cannot introduce any dereferencability
assumption because of the "one-past-the-end" case. That point is inbounds but
cannot be dereferenced.

So, for the sake of a more concrete example (and please excuse me butchering
LLVM syntax, I usually deal with this in terms of C or Rust syntax): Can %G in
the following programs be poison?  If yes, what is the analysis that would be
weakened or the optimization that could no longer happen if "GEPi %P 0" was
instead defined to always return %P?

# example1

%P = int2ptr 4
%G = gep inbounds %P 0

# example2

%P = call noalias i8* @malloc(i64 12)
call void @free(i8* %P)
%G = gep inbounds %P 0

The first happens in Rust all the time, and we rely on not getting poison.  The
second doesn't occur in Rust (to my knowledge), but it seems somewhat
inconsistent to return poison in one case and not the other.

Kind regards,
Ralf

> A side-effect based on the GEP
> will however __locally__ introduce an dereferencability assumption (in
> my opinion at least). Let's say the code looks like this:
>
>
>   %G = gep inbounds (int2ptr 4) 0
>   ; We don't know anything about the dereferencability of
>   ; the memory at address 4 here.
>   br %cnd, %BB0, %BB1
>
> BB0:
>   ; We don't know anything about the dereferencability of
>   ; the memory at address 4 here.
>   load %G
>   ; We know the memory at address 4 is dereferenceable here.
>   ; Though, that is due to the load and not the inbounds.
>   ...
>   br %BB1
>
> BB1:
>   ; We don't know anything about the dereferencability of
>   ; the memory at address 4 here.
>
>
> It is a different story if you start to use the GEP in other operations,
> e.g., to alter control flow. Then the (potential) undefined value can
> propagate.
>
>
> Any thought on this? Did I at least get your problem description right?
>
> Cheers,
>   Johannes
>
>
>
> P.S. Sorry if this breaks the thread and apologies that I had to remove
>      Bruce from the CC. It turns out replying to an email you did not
>      receive is complicated and getting on the LLVM-Dev list is nowadays
>      as well...
>
>
> On 02/25, Ralf Jung via llvm-dev wrote:
>> Hi Bruce,
>>
>> On 25.02.19 13:10, Bruce Hoult wrote:
>>> LLVM has no idea whether the address computed by GEP is actually
>>> within a legal object. The "inbounds" keyword is just you, the
>>> programmer, promising LLVM that you know it's ok and that you don't
>>> care what happens if it is actually out of bounds.
>>>
>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
>>
>> The LangRef says I get a poison value when I am violating the bounds. What I am
>> asking is what exactly this means when the offset is 0 -- what *are* the
>> conditions under which an offset-by-0 is "out of bounds" and hence yields poison?
>> Of course LLVM cannot always statically determine this, but it relies on
>> (dynamically, on the "LLVM abstract machine") such things not happening, and I
>> am asking what exactly these dynamic conditions are.
>>
>> Kind regards,
>> Ralf
>>
>>>
>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
>>> <[hidden email]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> What exactly are the rules for `getelementptr inbounds` with offset 0?
>>>>
>>>> In Rust, we are relying on the fact that if we use, for example, `inttoptr` to
>>>> turn `4` into a pointer, we can then do `getelementptr inbounds` with offset 0
>>>> on that without LLVM deducing that there actually is any dereferencable memory
>>>> at location 4.  The argument is that we can think of there being a zero-sized
>>>> allocation. Is that a reasonable assumption?  Can something like this be
>>>> documented in the LangRef?
>>>>
>>>> Relatedly, how does the situation change if the pointer is not created "out of
>>>> thin air" from a fixed integer, but is actually a dangling pointer obtained
>>>> previously from `malloc` (or `alloca` or whatever)?  Is getelementptr inbounds`
>>>> with offset 0 on such a pointer a NOP, or does it result in `poison`?  And if
>>>> that makes a difference, how does that square with the fact that, e.g., the
>>>> integer `0x4000` could well be inside such an allocation, but doing
>>>> `getelementptr inbounds` with offset 0 on that would fall under the first
>>>> question above?
>>>>
>>>> Kind regards,
>>>> Ralf
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> [hidden email]
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi Ralf,

On 03/15, Ralf Jung wrote:

> > From the Lang-Ref statement
> >
> >   "With the inbounds keyword, the result value of the GEP is undefined
> >   if the address is outside the actual underlying allocated object and
> >   not the address one-past-the-end."
> >
> > I'd argue that the actual offset value (here 0) is irrelevant. The GEP
> > value is undefined if inbounds is present and the resulting pointer does
> > not point into, or one-past-the-end, of an allocated object. This
> > object, in my understanding, has to be the same one the base pointer of
> > the GEP points into, or one-past-the-end, or you get again an undefined
> > result.
>
> Yes, I agree with that reading.
That's reassuring for me ;)


> However, the notion of "allocated object" here is not entirely clear.

True.


> LLVM has to operate under the assumption that there are allocations
> and allocators it doe snot know anything about.  Just imagine some
> embedded project writing to well-known address 0xDeadCafe because
> there is a hardware register there.

True.


> So, the thinking here is: LLVM cannot exclude the possibility of an
> object of size 0 existing at any given address.  The pointer returned
> by "GEPi p 0" then would be one-past-the-end of such a 0-sized object.
> Thus, "GEPi p 0" is the identitiy function for any p, it will not
> return poison.

I don't see the problem. The behavior I hope we want and implement is:

Either LLVM knows that %p points to an invalid address (=non-object) or
it doesn't. If it does, %p and all GEPs on it yield poison. If it
doesn't, it has to assume %p points to a valid address and offset 0, 1,
2, ... might all yield valid pointers. The special case is when we know
%p is valid and has extend of (at most) S, then all offsets <= S,
including 0, are potentially valid (negative extends are similar).


> > Now if that might cause any problems, e.g., if LLVM is able to act
> > on this fact, depends on various factors including what you do with
> > the GEP. Your initial problem seemed to be that LLVM "might be able
> > to deduce dereferencable memory at location 4" but that should never
> > be the case if you only form the aforementioned GEP, with or without
> > the inbounds actually. Forming a pointer that has a undefined value
> > is just that, a pointer with an undefined value.
>
> Ah, good point.  First of all I was indeed unclear; the case I am
> worried about here is GEPi returning poison.  (These values might be
> used in further computations and eventually surface as UB.) But also,
> clearly a "GEPi 0" alone cannot introduce any dereferencability
> assumption because of the "one-past-the-end" case. That point is
> inbounds but cannot be dereferenced.
>
> So, for the sake of a more concrete example (and please excuse me
> butchering LLVM syntax, I usually deal with this in terms of C or Rust
> syntax): Can %G in the following programs be poison?  If yes, what is
> the analysis that would be weakened or the optimization that could no
> longer happen if "GEPi %P 0" was instead defined to always return %P?
>
> # example1
>
> %P1 = int2ptr 4
> %G1 = gep inbounds %P1 0
>
> # example2
>
> %P2 = call noalias i8* @malloc(i64 12)
> call void @free(i8* %P2)
> %G2 = gep inbounds %P2 0
>
> The first happens in Rust all the time, and we rely on not getting
> poison.  The second doesn't occur in Rust (to my knowledge), but it
> seems somewhat inconsistent to return poison in one case and not the
> other.
Let's start with example2, note that I renamed the values above.

%P2 is dangling (and we know it) after the free. %P2 is therefore
poison* and so is %G2.

* or undef I'm always confused which might be bad in this conversation.



In example1, without further information, I'd say that there is no
poison (statically). Address 4 could be an allocated object until proven
otherwise.


I am still a little confused about the problem you see. If what I wrote
about the implemented behavior holds true (which I am not totally sure
of), you should not have a problem with poison even if you would
sprinkle GEP (inbounds) %p 0 all over the place. Either %p was known to
be invalid and so is the GEP, or %p was not known to be invalid and
neither is the GEP. Am I missing something here?

Cheers,
  Johannes

> > A side-effect based on the GEP will however __locally__ introduce an
> > dereferencability assumption (in my opinion at least). Let's say the
> > code looks like this:
> >
> >
> >   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the
> >   dereferencability of ; the memory at address 4 here.  br %cnd,
> >   %BB0, %BB1
> >
> > BB0: ; We don't know anything about the dereferencability of ; the
> > memory at address 4 here.  load %G ; We know the memory at address 4
> > is dereferenceable here.  ; Though, that is due to the load and not
> > the inbounds.  ...  br %BB1
> >
> > BB1: ; We don't know anything about the dereferencability of ; the
> > memory at address 4 here.
> >
> >
> > It is a different story if you start to use the GEP in other
> > operations, e.g., to alter control flow. Then the (potential)
> > undefined value can propagate.
> >
> >
> > Any thought on this? Did I at least get your problem description
> > right?
> >
> > Cheers, Johannes
> >
> >
> >
> > P.S. Sorry if this breaks the thread and apologies that I had to
> > remove Bruce from the CC. It turns out replying to an email you did
> > not receive is complicated and getting on the LLVM-Dev list is
> > nowadays as well...
> >
> >
> > On 02/25, Ralf Jung via llvm-dev wrote:
> >> Hi Bruce,
> >>
> >> On 25.02.19 13:10, Bruce Hoult wrote:
> >>> LLVM has no idea whether the address computed by GEP is actually
> >>> within a legal object. The "inbounds" keyword is just you, the
> >>> programmer, promising LLVM that you know it's ok and that you
> >>> don't care what happens if it is actually out of bounds.
> >>>
> >>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
> >>
> >> The LangRef says I get a poison value when I am violating the
> >> bounds. What I am asking is what exactly this means when the offset
> >> is 0 -- what *are* the conditions under which an offset-by-0 is
> >> "out of bounds" and hence yields poison?  Of course LLVM cannot
> >> always statically determine this, but it relies on (dynamically, on
> >> the "LLVM abstract machine") such things not happening, and I am
> >> asking what exactly these dynamic conditions are.
> >>
> >> Kind regards, Ralf
> >>
> >>>
> >>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
> >>> <[hidden email]> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> What exactly are the rules for `getelementptr inbounds` with
> >>>> offset 0?
> >>>>
> >>>> In Rust, we are relying on the fact that if we use, for example,
> >>>> `inttoptr` to turn `4` into a pointer, we can then do
> >>>> `getelementptr inbounds` with offset 0 on that without LLVM
> >>>> deducing that there actually is any dereferencable memory at
> >>>> location 4.  The argument is that we can think of there being a
> >>>> zero-sized allocation. Is that a reasonable assumption?  Can
> >>>> something like this be documented in the LangRef?
> >>>>
> >>>> Relatedly, how does the situation change if the pointer is not
> >>>> created "out of thin air" from a fixed integer, but is actually a
> >>>> dangling pointer obtained previously from `malloc` (or `alloca`
> >>>> or whatever)?  Is getelementptr inbounds` with offset 0 on such a
> >>>> pointer a NOP, or does it result in `poison`?  And if that makes
> >>>> a difference, how does that square with the fact that, e.g., the
> >>>> integer `0x4000` could well be inside such an allocation, but
> >>>> doing `getelementptr inbounds` with offset 0 on that would fall
> >>>> under the first question above?
> >>>>
> >>>> Kind regards, Ralf
> >>>> _______________________________________________ LLVM Developers
> >>>> mailing list [hidden email]
> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> _______________________________________________ LLVM Developers
> >> mailing list [hidden email]
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi Johannes,

>> So, the thinking here is: LLVM cannot exclude the possibility of an
>> object of size 0 existing at any given address.  The pointer returned
>> by "GEPi p 0" then would be one-past-the-end of such a 0-sized object.
>> Thus, "GEPi p 0" is the identitiy function for any p, it will not
>> return poison.
>
> I don't see the problem. The behavior I hope we want and implement is:
>
> Either LLVM knows that %p points to an invalid address (=non-object) or
> it doesn't. If it does, %p and all GEPs on it yield poison. If it
> doesn't, it has to assume %p points to a valid address and offset 0, 1,
> 2, ... might all yield valid pointers. The special case is when we know
> %p is valid and has extend of (at most) S, then all offsets <= S,
> including 0, are potentially valid (negative extends are similar).

So you are basically saying whether the offset is 0 or not does not matter, but
whether the base is an object LLVM can now about or not does?  I see.  That
makes sense.

The reason I restricted myself to offset 0 is that we'd like to do this without
actually having any accessible objects anywhere, which works out if the objects
have size 0.

FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway had to
make "getelementptr inbounds" on integer pointers (pointers obtained by casting
an integer to a pointer) never yield poison directly and instead defer the
in-bound check to the time when the actual access happens.  That nicely
accommodates all uses of getelementptr that just compute addresses without ever
using them for a memory access (using them only, e.g. to compute offsets or
compare pointers).  But this is not how the LLVM LangRef is written, unfortunately.

>> # example1
>>
>> %P1 = int2ptr 4
>> %G1 = gep inbounds %P1 0
>>
>> # example2
>>
>> %P2 = call noalias i8* @malloc(i64 12)
>> call void @free(i8* %P2)
>> %G2 = gep inbounds %P2 0
>>
>> The first happens in Rust all the time, and we rely on not getting
>> poison.  The second doesn't occur in Rust (to my knowledge), but it
>> seems somewhat inconsistent to return poison in one case and not the
>> other.
>
> Let's start with example2, note that I renamed the values above.
>
> %P2 is dangling (and we know it) after the free. %P2 is therefore
> poison* and so is %G2.
>
> * or undef I'm always confused which might be bad in this conversation.

Wait, I know that C has a rule that dangling pointers are "indeterminate" but
this is the first time I hear that LLVM has it as well.  Is that written down
anywhere?  Rust relies heavily in dangling pointers being well-behaved when used
only on comparisons and casts (no accesses), so this would be a big deal.
(Also, this rule in C is pretty much impossible to formalize and serves no
purpose that I know of, but that is a separate discussion.)

> In example1, without further information, I'd say that there is no
> poison (statically). Address 4 could be an allocated object until proven
> otherwise.
>
>
> I am still a little confused about the problem you see. If what I wrote
> about the implemented behavior holds true (which I am not totally sure
> of), you should not have a problem with poison even if you would
> sprinkle GEP (inbounds) %p 0 all over the place. Either %p was known to
> be invalid and so is the GEP, or %p was not known to be invalid and
> neither is the GEP. Am I missing something here?

The thing is, I am not asking about the behavior implemented today but about the
behavior of the "abstract LLVM machine" that is described by the LangRef and
that the optimizer has to justify its transformations against.  Analyses become
smarter every day, so looking at what LLVM deduces from certain instructions is
but a snapshot.

But also, your response assumes "dangling pointers are undef/posion", which is
new to me.  I'd be rather shocked if this is something LLVM actually relies on
anywhere.

Kind regards,
Ralf

>
> Cheers,
>   Johannes
>
>>> A side-effect based on the GEP will however __locally__ introduce an
>>> dereferencability assumption (in my opinion at least). Let's say the
>>> code looks like this:
>>>
>>>
>>>   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the
>>>   dereferencability of ; the memory at address 4 here.  br %cnd,
>>>   %BB0, %BB1
>>>
>>> BB0: ; We don't know anything about the dereferencability of ; the
>>> memory at address 4 here.  load %G ; We know the memory at address 4
>>> is dereferenceable here.  ; Though, that is due to the load and not
>>> the inbounds.  ...  br %BB1
>>>
>>> BB1: ; We don't know anything about the dereferencability of ; the
>>> memory at address 4 here.
>>>
>>>
>>> It is a different story if you start to use the GEP in other
>>> operations, e.g., to alter control flow. Then the (potential)
>>> undefined value can propagate.
>>>
>>>
>>> Any thought on this? Did I at least get your problem description
>>> right?
>>>
>>> Cheers, Johannes
>>>
>>>
>>>
>>> P.S. Sorry if this breaks the thread and apologies that I had to
>>> remove Bruce from the CC. It turns out replying to an email you did
>>> not receive is complicated and getting on the LLVM-Dev list is
>>> nowadays as well...
>>>
>>>
>>> On 02/25, Ralf Jung via llvm-dev wrote:
>>>> Hi Bruce,
>>>>
>>>> On 25.02.19 13:10, Bruce Hoult wrote:
>>>>> LLVM has no idea whether the address computed by GEP is actually
>>>>> within a legal object. The "inbounds" keyword is just you, the
>>>>> programmer, promising LLVM that you know it's ok and that you
>>>>> don't care what happens if it is actually out of bounds.
>>>>>
>>>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
>>>>
>>>> The LangRef says I get a poison value when I am violating the
>>>> bounds. What I am asking is what exactly this means when the offset
>>>> is 0 -- what *are* the conditions under which an offset-by-0 is
>>>> "out of bounds" and hence yields poison?  Of course LLVM cannot
>>>> always statically determine this, but it relies on (dynamically, on
>>>> the "LLVM abstract machine") such things not happening, and I am
>>>> asking what exactly these dynamic conditions are.
>>>>
>>>> Kind regards, Ralf
>>>>
>>>>>
>>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> What exactly are the rules for `getelementptr inbounds` with
>>>>>> offset 0?
>>>>>>
>>>>>> In Rust, we are relying on the fact that if we use, for example,
>>>>>> `inttoptr` to turn `4` into a pointer, we can then do
>>>>>> `getelementptr inbounds` with offset 0 on that without LLVM
>>>>>> deducing that there actually is any dereferencable memory at
>>>>>> location 4.  The argument is that we can think of there being a
>>>>>> zero-sized allocation. Is that a reasonable assumption?  Can
>>>>>> something like this be documented in the LangRef?
>>>>>>
>>>>>> Relatedly, how does the situation change if the pointer is not
>>>>>> created "out of thin air" from a fixed integer, but is actually a
>>>>>> dangling pointer obtained previously from `malloc` (or `alloca`
>>>>>> or whatever)?  Is getelementptr inbounds` with offset 0 on such a
>>>>>> pointer a NOP, or does it result in `poison`?  And if that makes
>>>>>> a difference, how does that square with the fact that, e.g., the
>>>>>> integer `0x4000` could well be inside such an allocation, but
>>>>>> doing `getelementptr inbounds` with offset 0 on that would fall
>>>>>> under the first question above?
>>>>>>
>>>>>> Kind regards, Ralf
>>>>>> _______________________________________________ LLVM Developers
>>>>>> mailing list [hidden email]
>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> _______________________________________________ LLVM Developers
>>>> mailing list [hidden email]
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi Ralf,

On 03/26, Ralf Jung wrote:

> >> So, the thinking here is: LLVM cannot exclude the possibility of an
> >> object of size 0 existing at any given address.  The pointer returned
> >> by "GEPi p 0" then would be one-past-the-end of such a 0-sized object.
> >> Thus, "GEPi p 0" is the identitiy function for any p, it will not
> >> return poison.
> >
> > I don't see the problem. The behavior I hope we want and implement is:
> >
> > Either LLVM knows that %p points to an invalid address (=non-object) or
> > it doesn't. If it does, %p and all GEPs on it yield poison. If it
> > doesn't, it has to assume %p points to a valid address and offset 0, 1,
> > 2, ... might all yield valid pointers. The special case is when we know
> > %p is valid and has extend of (at most) S, then all offsets <= S,
> > including 0, are potentially valid (negative extends are similar).
>
> So you are basically saying whether the offset is 0 or not does not matter, but
> whether the base is an object LLVM can now about or not does?  I see.  That
> makes sense.
Yes, if we are not in the special case (object valid and extend is known).

> The reason I restricted myself to offset 0 is that we'd like to do this without
> actually having any accessible objects anywhere, which works out if the objects
> have size 0.

Now that reasoning works from a conceptual standpoint only for
non-inbounds GEPs, I think. From a practical standpoint my above
description will probably make sure everything works out just fine (see
also my rephrased answer down below!). I say this because I think the
following lang-ref passage makes sure everything, not only memory
accesses, involving a non-pointer-to-object* GEP is poison:
  "If the inbounds keyword is present, the result value of the
   getelementptr is a poison value if the base pointer is not an in
   bounds address of an allocated object"

* I would argue every object needs to have an extend, hence cannot be
  zero-sized.


> FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway had to
> make "getelementptr inbounds" on integer pointers (pointers obtained by casting
> an integer to a pointer) never yield poison directly and instead defer the
> in-bound check to the time when the actual access happens.  That nicely
> accommodates all uses of getelementptr that just compute addresses without ever
> using them for a memory access (using them only, e.g. to compute offsets or
> compare pointers).  But this is not how the LLVM LangRef is written, unfortunately.

I see. Is there a quick answer to the questions why you need inbounds
GEPs in that case? Can't you just use non-inbounds GEPs if you know you
might not have a valid base ptr and "optimize" it to inbounds once that
is proven?

> >> # example1
> >>
> >> %P1 = int2ptr 4
> >> %G1 = gep inbounds %P1 0
> >>
> >> # example2
> >>
> >> %P2 = call noalias i8* @malloc(i64 12)
> >> call void @free(i8* %P2)
> >> %G2 = gep inbounds %P2 0
> >>
> >> The first happens in Rust all the time, and we rely on not getting
> >> poison.  The second doesn't occur in Rust (to my knowledge), but it
> >> seems somewhat inconsistent to return poison in one case and not the
> >> other.
> >
> > Let's start with example2, note that I renamed the values above.
> >
> > %P2 is dangling (and we know it) after the free. %P2 is therefore
> > poison* and so is %G2.
> >
> > * or undef I'm always confused which might be bad in this conversation.
>
> Wait, I know that C has a rule that dangling pointers are "indeterminate" but
> this is the first time I hear that LLVM has it as well.  Is that written down
> anywhere?  Rust relies heavily in dangling pointers being well-behaved when used
> only on comparisons and casts (no accesses), so this would be a big deal.
> (Also, this rule in C is pretty much impossible to formalize and serves no
> purpose that I know of, but that is a separate discussion.)
I am not very formal in this thread and I realize that this might be a
problem, sorry. The above quote from the lang-ref [0] is why I think
"dangling" inbounds GEPs are poison, do you concur?

[0] https://llvm.org/docs/LangRef.html#getelementptr-instruction


> > In example1, without further information, I'd say that there is no
> > poison (statically). Address 4 could be an allocated object until proven
> > otherwise.
> >
> >
> > I am still a little confused about the problem you see. If what I wrote
> > about the implemented behavior holds true (which I am not totally sure
> > of), you should not have a problem with poison even if you would
> > sprinkle GEP (inbounds) %p 0 all over the place. Either %p was known to
> > be invalid and so is the GEP, or %p was not known to be invalid and
> > neither is the GEP. Am I missing something here?
>
> The thing is, I am not asking about the behavior implemented today but about the
> behavior of the "abstract LLVM machine" that is described by the LangRef and
> that the optimizer has to justify its transformations against.  Analyses become
> smarter every day, so looking at what LLVM deduces from certain instructions is
> but a snapshot.
I agree with your intent, but: My argument here was not to say we cannot
figure X out today so all is good. What I wanted to say/should have said
is something more along the line of:
  Undefined behavior in C/LLVM-IR is often (runtime) value dependent and
  therefore statically not decidable. If it is not, the code must be
  assumed to have defined (="the normal") behavior statically. This
  should be preserved by current and future LLVM passes. Your particular
  example (example1) seems to me like such a case in which the semantics
  is statically not decidable and therefore I do not see any problem.

Again, I might just be wrong about. Please don't pin it on me at the end
of the day.

> But also, your response assumes "dangling pointers are undef/posion", which is
> new to me.  I'd be rather shocked if this is something LLVM actually relies on
> anywhere.

Again, that is how I read the quoted lang-ref wording above for
inbounds GEPs. I agree with you that non-inbounds GEPs have a "normal"
value that can be used for all non-access instructions in the usual way
without producing undef/poison.

Cheers,
  Johannes


> >>> A side-effect based on the GEP will however __locally__ introduce an
> >>> dereferencability assumption (in my opinion at least). Let's say the
> >>> code looks like this:
> >>>
> >>>
> >>>   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the
> >>>   dereferencability of ; the memory at address 4 here.  br %cnd,
> >>>   %BB0, %BB1
> >>>
> >>> BB0: ; We don't know anything about the dereferencability of ; the
> >>> memory at address 4 here.  load %G ; We know the memory at address 4
> >>> is dereferenceable here.  ; Though, that is due to the load and not
> >>> the inbounds.  ...  br %BB1
> >>>
> >>> BB1: ; We don't know anything about the dereferencability of ; the
> >>> memory at address 4 here.
> >>>
> >>>
> >>> It is a different story if you start to use the GEP in other
> >>> operations, e.g., to alter control flow. Then the (potential)
> >>> undefined value can propagate.
> >>>
> >>>
> >>> Any thought on this? Did I at least get your problem description
> >>> right?
> >>>
> >>> Cheers, Johannes
> >>>
> >>>
> >>>
> >>> P.S. Sorry if this breaks the thread and apologies that I had to
> >>> remove Bruce from the CC. It turns out replying to an email you did
> >>> not receive is complicated and getting on the LLVM-Dev list is
> >>> nowadays as well...
> >>>
> >>>
> >>> On 02/25, Ralf Jung via llvm-dev wrote:
> >>>> Hi Bruce,
> >>>>
> >>>> On 25.02.19 13:10, Bruce Hoult wrote:
> >>>>> LLVM has no idea whether the address computed by GEP is actually
> >>>>> within a legal object. The "inbounds" keyword is just you, the
> >>>>> programmer, promising LLVM that you know it's ok and that you
> >>>>> don't care what happens if it is actually out of bounds.
> >>>>>
> >>>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
> >>>>
> >>>> The LangRef says I get a poison value when I am violating the
> >>>> bounds. What I am asking is what exactly this means when the offset
> >>>> is 0 -- what *are* the conditions under which an offset-by-0 is
> >>>> "out of bounds" and hence yields poison?  Of course LLVM cannot
> >>>> always statically determine this, but it relies on (dynamically, on
> >>>> the "LLVM abstract machine") such things not happening, and I am
> >>>> asking what exactly these dynamic conditions are.
> >>>>
> >>>> Kind regards, Ralf
> >>>>
> >>>>>
> >>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
> >>>>> <[hidden email]> wrote:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> What exactly are the rules for `getelementptr inbounds` with
> >>>>>> offset 0?
> >>>>>>
> >>>>>> In Rust, we are relying on the fact that if we use, for example,
> >>>>>> `inttoptr` to turn `4` into a pointer, we can then do
> >>>>>> `getelementptr inbounds` with offset 0 on that without LLVM
> >>>>>> deducing that there actually is any dereferencable memory at
> >>>>>> location 4.  The argument is that we can think of there being a
> >>>>>> zero-sized allocation. Is that a reasonable assumption?  Can
> >>>>>> something like this be documented in the LangRef?
> >>>>>>
> >>>>>> Relatedly, how does the situation change if the pointer is not
> >>>>>> created "out of thin air" from a fixed integer, but is actually a
> >>>>>> dangling pointer obtained previously from `malloc` (or `alloca`
> >>>>>> or whatever)?  Is getelementptr inbounds` with offset 0 on such a
> >>>>>> pointer a NOP, or does it result in `poison`?  And if that makes
> >>>>>> a difference, how does that square with the fact that, e.g., the
> >>>>>> integer `0x4000` could well be inside such an allocation, but
> >>>>>> doing `getelementptr inbounds` with offset 0 on that would fall
> >>>>>> under the first question above?
> >>>>>>
> >>>>>> Kind regards, Ralf
> >>>>>> _______________________________________________ LLVM Developers
> >>>>>> mailing list [hidden email]
> >>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>> _______________________________________________ LLVM Developers
> >>>> mailing list [hidden email]
> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>
> >
--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi Johannes,

> Now that reasoning works from a conceptual standpoint only for
> non-inbounds GEPs, I think. From a practical standpoint my above
> description will probably make sure everything works out just fine (see
> also my rephrased answer down below!). I say this because I think the
> following lang-ref passage makes sure everything, not only memory
> accesses, involving a non-pointer-to-object* GEP is poison:
>   "If the inbounds keyword is present, the result value of the
>    getelementptr is a poison value if the base pointer is not an in
>    bounds address of an allocated object"
>
> * I would argue every object needs to have an extend, hence cannot be
>   zero-sized.

I would find that a rather surprising exception / special case.  There's nothing
wrong with objects of size 0.

>> FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway had to
>> make "getelementptr inbounds" on integer pointers (pointers obtained by casting
>> an integer to a pointer) never yield poison directly and instead defer the
>> in-bound check to the time when the actual access happens.  That nicely
>> accommodates all uses of getelementptr that just compute addresses without ever
>> using them for a memory access (using them only, e.g. to compute offsets or
>> compare pointers).  But this is not how the LLVM LangRef is written, unfortunately.
>
> I see. Is there a quick answer to the questions why you need inbounds
> GEPs in that case? Can't you just use non-inbounds GEPs if you know you
> might not have a valid base ptr and "optimize" it to inbounds once that
> is proven?

You mean on the Rust side?  We emit GEPi for field accesses and array indexing.
 We cannot always statically determine if this is happening for a ZST or not.
At the same time, given that no memory access ever happens for a ZST, allocating
a ZST (Box::new in Rust, think of it like new in C++) does not actually allocate
any memory, it just returns an integer (sufficiently aligned) cast to a pointer.

>>> Let's start with example2, note that I renamed the values above.
>>>
>>> %P2 is dangling (and we know it) after the free. %P2 is therefore
>>> poison* and so is %G2.
>>>
>>> * or undef I'm always confused which might be bad in this conversation.
>>
>> Wait, I know that C has a rule that dangling pointers are "indeterminate" but
>> this is the first time I hear that LLVM has it as well.  Is that written down
>> anywhere?  Rust relies heavily in dangling pointers being well-behaved when used
>> only on comparisons and casts (no accesses), so this would be a big deal.
>> (Also, this rule in C is pretty much impossible to formalize and serves no
>> purpose that I know of, but that is a separate discussion.)
>
> I am not very formal in this thread and I realize that this might be a
> problem, sorry. The above quote from the lang-ref [0] is why I think
> "dangling" inbounds GEPs are poison, do you concur?
>
> [0] https://llvm.org/docs/LangRef.html#getelementptr-instruction

You said above that even the *input* (%P2) would be poison.  That is the part I
am doubting.
If the input is not poison (just dangling), then we come back to the original
question behind example2 -- and yes, I can see a reading of the GEPi spec that
makes this poison.  On the other hand, this would make a dangling pointer
(formerly pointing to an object) behave different than an integer pointer that
never pointed to any object, which seems odd.

> I agree with your intent, but: My argument here was not to say we cannot
> figure X out today so all is good. What I wanted to say/should have said
> is something more along the line of:
>   Undefined behavior in C/LLVM-IR is often (runtime) value dependent and
>   therefore statically not decidable. If it is not, the code must be
>   assumed to have defined (="the normal") behavior statically. This
>   should be preserved by current and future LLVM passes. Your particular
>   example (example1) seems to me like such a case in which the semantics
>   is statically not decidable and therefore I do not see any problem.
>
> Again, I might just be wrong about. Please don't pin it on me at the end
> of the day.

Sure, UB is definitely *defined* in a runtime-value dependent way.  The problem
here is that it is not defined in a precise way -- something where one could
write an interpreter that tracks all the extra state that is needed (like
poison/undef and where allocations lie) and then says precisely under which
conditions we have UB and under which we do not.
What I am asking here for is the exact definition of GEPi if, *at run-time*, the
offset is 0, and the base pointer is (a) an integer, or (b) dangling.

>> But also, your response assumes "dangling pointers are undef/posion", which is
>> new to me.  I'd be rather shocked if this is something LLVM actually relies on
>> anywhere.
>
> Again, that is how I read the quoted lang-ref wording above for
> inbounds GEPs. I agree with you that non-inbounds GEPs have a "normal"
> value that can be used for all non-access instructions in the usual way
> without producing undef/poison.

I must be missing something here.  You said above "%P2 is dangling (and we know
it) after the free. %P2 is therefore poison" -- at this point, GEPi has not even
happened yet!  If GEPi does something, it will make the *output* poison (%G2),
but you are saying the *input* becomes poison (%P1), and that cannot be a
consequence of GEPi at all.

Kind regards,
Ralf

>
> Cheers,
>   Johannes
>
>
>>>>> A side-effect based on the GEP will however __locally__ introduce an
>>>>> dereferencability assumption (in my opinion at least). Let's say the
>>>>> code looks like this:
>>>>>
>>>>>
>>>>>   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the
>>>>>   dereferencability of ; the memory at address 4 here.  br %cnd,
>>>>>   %BB0, %BB1
>>>>>
>>>>> BB0: ; We don't know anything about the dereferencability of ; the
>>>>> memory at address 4 here.  load %G ; We know the memory at address 4
>>>>> is dereferenceable here.  ; Though, that is due to the load and not
>>>>> the inbounds.  ...  br %BB1
>>>>>
>>>>> BB1: ; We don't know anything about the dereferencability of ; the
>>>>> memory at address 4 here.
>>>>>
>>>>>
>>>>> It is a different story if you start to use the GEP in other
>>>>> operations, e.g., to alter control flow. Then the (potential)
>>>>> undefined value can propagate.
>>>>>
>>>>>
>>>>> Any thought on this? Did I at least get your problem description
>>>>> right?
>>>>>
>>>>> Cheers, Johannes
>>>>>
>>>>>
>>>>>
>>>>> P.S. Sorry if this breaks the thread and apologies that I had to
>>>>> remove Bruce from the CC. It turns out replying to an email you did
>>>>> not receive is complicated and getting on the LLVM-Dev list is
>>>>> nowadays as well...
>>>>>
>>>>>
>>>>> On 02/25, Ralf Jung via llvm-dev wrote:
>>>>>> Hi Bruce,
>>>>>>
>>>>>> On 25.02.19 13:10, Bruce Hoult wrote:
>>>>>>> LLVM has no idea whether the address computed by GEP is actually
>>>>>>> within a legal object. The "inbounds" keyword is just you, the
>>>>>>> programmer, promising LLVM that you know it's ok and that you
>>>>>>> don't care what happens if it is actually out of bounds.
>>>>>>>
>>>>>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
>>>>>>
>>>>>> The LangRef says I get a poison value when I am violating the
>>>>>> bounds. What I am asking is what exactly this means when the offset
>>>>>> is 0 -- what *are* the conditions under which an offset-by-0 is
>>>>>> "out of bounds" and hence yields poison?  Of course LLVM cannot
>>>>>> always statically determine this, but it relies on (dynamically, on
>>>>>> the "LLVM abstract machine") such things not happening, and I am
>>>>>> asking what exactly these dynamic conditions are.
>>>>>>
>>>>>> Kind regards, Ralf
>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
>>>>>>> <[hidden email]> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> What exactly are the rules for `getelementptr inbounds` with
>>>>>>>> offset 0?
>>>>>>>>
>>>>>>>> In Rust, we are relying on the fact that if we use, for example,
>>>>>>>> `inttoptr` to turn `4` into a pointer, we can then do
>>>>>>>> `getelementptr inbounds` with offset 0 on that without LLVM
>>>>>>>> deducing that there actually is any dereferencable memory at
>>>>>>>> location 4.  The argument is that we can think of there being a
>>>>>>>> zero-sized allocation. Is that a reasonable assumption?  Can
>>>>>>>> something like this be documented in the LangRef?
>>>>>>>>
>>>>>>>> Relatedly, how does the situation change if the pointer is not
>>>>>>>> created "out of thin air" from a fixed integer, but is actually a
>>>>>>>> dangling pointer obtained previously from `malloc` (or `alloca`
>>>>>>>> or whatever)?  Is getelementptr inbounds` with offset 0 on such a
>>>>>>>> pointer a NOP, or does it result in `poison`?  And if that makes
>>>>>>>> a difference, how does that square with the fact that, e.g., the
>>>>>>>> integer `0x4000` could well be inside such an allocation, but
>>>>>>>> doing `getelementptr inbounds` with offset 0 on that would fall
>>>>>>>> under the first question above?
>>>>>>>>
>>>>>>>> Kind regards, Ralf
>>>>>>>> _______________________________________________ LLVM Developers
>>>>>>>> mailing list [hidden email]
>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>> _______________________________________________ LLVM Developers
>>>>>> mailing list [hidden email]
>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>
>
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
On 03/27, Ralf Jung wrote:

> > Now that reasoning works from a conceptual standpoint only for
> > non-inbounds GEPs, I think. From a practical standpoint my above
> > description will probably make sure everything works out just fine (see
> > also my rephrased answer down below!). I say this because I think the
> > following lang-ref passage makes sure everything, not only memory
> > accesses, involving a non-pointer-to-object* GEP is poison:
> >   "If the inbounds keyword is present, the result value of the
> >    getelementptr is a poison value if the base pointer is not an in
> >    bounds address of an allocated object"
> >
> > * I would argue every object needs to have an extend, hence cannot be
> >   zero-sized.
>
> I would find that a rather surprising exception / special case.  There's nothing
> wrong with objects of size 0.
I guess you're right. It will not change my argumentation above though,
if it is known that is has extend 0 it falls under the known extend
special case.

> >> FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway had to
> >> make "getelementptr inbounds" on integer pointers (pointers obtained by casting
> >> an integer to a pointer) never yield poison directly and instead defer the
> >> in-bound check to the time when the actual access happens.  That nicely
> >> accommodates all uses of getelementptr that just compute addresses without ever
> >> using them for a memory access (using them only, e.g. to compute offsets or
> >> compare pointers).  But this is not how the LLVM LangRef is written, unfortunately.
> >
> > I see. Is there a quick answer to the questions why you need inbounds
> > GEPs in that case? Can't you just use non-inbounds GEPs if you know you
> > might not have a valid base ptr and "optimize" it to inbounds once that
> > is proven?
>
> You mean on the Rust side?  We emit GEPi for field accesses and array indexing.
>  We cannot always statically determine if this is happening for a ZST or not.
> At the same time, given that no memory access ever happens for a ZST, allocating
> a ZST (Box::new in Rust, think of it like new in C++) does not actually allocate
> any memory, it just returns an integer (sufficiently aligned) cast to a pointer.
OK, but why not emit non-inbonuds GEPs instead? They do not come with
the problems you have now, or maybe I misunderstand.

> >>> Let's start with example2, note that I renamed the values above.
> >>>
> >>> %P2 is dangling (and we know it) after the free. %P2 is therefore
> >>> poison* and so is %G2.
> >>>
> >>> * or undef I'm always confused which might be bad in this conversation.
> >>
> >> Wait, I know that C has a rule that dangling pointers are "indeterminate" but
> >> this is the first time I hear that LLVM has it as well.  Is that written down
> >> anywhere?  Rust relies heavily in dangling pointers being well-behaved when used
> >> only on comparisons and casts (no accesses), so this would be a big deal.
> >> (Also, this rule in C is pretty much impossible to formalize and serves no
> >> purpose that I know of, but that is a separate discussion.)
> >
> > I am not very formal in this thread and I realize that this might be a
> > problem, sorry. The above quote from the lang-ref [0] is why I think
> > "dangling" inbounds GEPs are poison, do you concur?
> >
> > [0] https://llvm.org/docs/LangRef.html#getelementptr-instruction
>
> You said above that even the *input* (%P2) would be poison.  That is the part I
> am doubting.
You are right again I guess. %P2, the input, is probably not poison
after the free but just a dangling pointer.


> If the input is not poison (just dangling), then we come back to the original
> question behind example2 -- and yes, I can see a reading of the GEPi spec that
> makes this poison.  On the other hand, this would make a dangling pointer
> (formerly pointing to an object) behave different than an integer pointer that
> never pointed to any object, which seems odd.

I do not know how to read this but one way to make sense of it is to
assume the GEPi on an object is fine but its value becomes poison in the
moment you deallocate the object. That means, %G2 in the example is
non-poison until the free, then it is. That would at least be what I
assume to be the semantic here.


> > I agree with your intent, but: My argument here was not to say we cannot
> > figure X out today so all is good. What I wanted to say/should have said
> > is something more along the line of:
> >   Undefined behavior in C/LLVM-IR is often (runtime) value dependent and
> >   therefore statically not decidable. If it is not, the code must be
> >   assumed to have defined (="the normal") behavior statically. This
> >   should be preserved by current and future LLVM passes. Your particular
> >   example (example1) seems to me like such a case in which the semantics
> >   is statically not decidable and therefore I do not see any problem.
> >
> > Again, I might just be wrong about. Please don't pin it on me at the end
> > of the day.
>
> Sure, UB is definitely *defined* in a runtime-value dependent way.  The problem
> here is that it is not defined in a precise way -- something where one could
> write an interpreter that tracks all the extra state that is needed (like
> poison/undef and where allocations lie) and then says precisely under which
> conditions we have UB and under which we do not.
> What I am asking here for is the exact definition of GEPi if, *at run-time*, the
> offset is 0, and the base pointer is (a) an integer, or (b) dangling.
That last part is given by the lang-ref (imo):
  "If the inbounds keyword is present, the result value of the
   getelementptr is a poison value if the base pointer is not an in
   bounds address of an allocated object"

I read this as: If you have a GEPi, you get poison if the base pointer
is not an allocated object. That is a dangling pointer (b) causes the
GEPi to be poison and a pointer from integer (a) may, if the address
denoted by the integer is not inside, or one past, an allocated object.
Now any offset except 0 will add more possible ways to generate a poison
value.

> >> But also, your response assumes "dangling pointers are undef/posion", which is
> >> new to me.  I'd be rather shocked if this is something LLVM actually relies on
> >> anywhere.
> >
> > Again, that is how I read the quoted lang-ref wording above for
> > inbounds GEPs. I agree with you that non-inbounds GEPs have a "normal"
> > value that can be used for all non-access instructions in the usual way
> > without producing undef/poison.
>
> I must be missing something here.  You said above "%P2 is dangling (and we know
> it) after the free. %P2 is therefore poison" -- at this point, GEPi has not even
> happened yet!  If GEPi does something, it will make the *output* poison (%G2),
> but you are saying the *input* becomes poison (%P1), and that cannot be a
> consequence of GEPi at all.
True and corrected above. Sorry for the confusion.


> > Cheers,
> >   Johannes
> >
> >
> >>>>> A side-effect based on the GEP will however __locally__ introduce an
> >>>>> dereferencability assumption (in my opinion at least). Let's say the
> >>>>> code looks like this:
> >>>>>
> >>>>>
> >>>>>   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the
> >>>>>   dereferencability of ; the memory at address 4 here.  br %cnd,
> >>>>>   %BB0, %BB1
> >>>>>
> >>>>> BB0: ; We don't know anything about the dereferencability of ; the
> >>>>> memory at address 4 here.  load %G ; We know the memory at address 4
> >>>>> is dereferenceable here.  ; Though, that is due to the load and not
> >>>>> the inbounds.  ...  br %BB1
> >>>>>
> >>>>> BB1: ; We don't know anything about the dereferencability of ; the
> >>>>> memory at address 4 here.
> >>>>>
> >>>>>
> >>>>> It is a different story if you start to use the GEP in other
> >>>>> operations, e.g., to alter control flow. Then the (potential)
> >>>>> undefined value can propagate.
> >>>>>
> >>>>>
> >>>>> Any thought on this? Did I at least get your problem description
> >>>>> right?
> >>>>>
> >>>>> Cheers, Johannes
> >>>>>
> >>>>>
> >>>>>
> >>>>> P.S. Sorry if this breaks the thread and apologies that I had to
> >>>>> remove Bruce from the CC. It turns out replying to an email you did
> >>>>> not receive is complicated and getting on the LLVM-Dev list is
> >>>>> nowadays as well...
> >>>>>
> >>>>>
> >>>>> On 02/25, Ralf Jung via llvm-dev wrote:
> >>>>>> Hi Bruce,
> >>>>>>
> >>>>>> On 25.02.19 13:10, Bruce Hoult wrote:
> >>>>>>> LLVM has no idea whether the address computed by GEP is actually
> >>>>>>> within a legal object. The "inbounds" keyword is just you, the
> >>>>>>> programmer, promising LLVM that you know it's ok and that you
> >>>>>>> don't care what happens if it is actually out of bounds.
> >>>>>>>
> >>>>>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
> >>>>>>
> >>>>>> The LangRef says I get a poison value when I am violating the
> >>>>>> bounds. What I am asking is what exactly this means when the offset
> >>>>>> is 0 -- what *are* the conditions under which an offset-by-0 is
> >>>>>> "out of bounds" and hence yields poison?  Of course LLVM cannot
> >>>>>> always statically determine this, but it relies on (dynamically, on
> >>>>>> the "LLVM abstract machine") such things not happening, and I am
> >>>>>> asking what exactly these dynamic conditions are.
> >>>>>>
> >>>>>> Kind regards, Ralf
> >>>>>>
> >>>>>>>
> >>>>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
> >>>>>>> <[hidden email]> wrote:
> >>>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> What exactly are the rules for `getelementptr inbounds` with
> >>>>>>>> offset 0?
> >>>>>>>>
> >>>>>>>> In Rust, we are relying on the fact that if we use, for example,
> >>>>>>>> `inttoptr` to turn `4` into a pointer, we can then do
> >>>>>>>> `getelementptr inbounds` with offset 0 on that without LLVM
> >>>>>>>> deducing that there actually is any dereferencable memory at
> >>>>>>>> location 4.  The argument is that we can think of there being a
> >>>>>>>> zero-sized allocation. Is that a reasonable assumption?  Can
> >>>>>>>> something like this be documented in the LangRef?
> >>>>>>>>
> >>>>>>>> Relatedly, how does the situation change if the pointer is not
> >>>>>>>> created "out of thin air" from a fixed integer, but is actually a
> >>>>>>>> dangling pointer obtained previously from `malloc` (or `alloca`
> >>>>>>>> or whatever)?  Is getelementptr inbounds` with offset 0 on such a
> >>>>>>>> pointer a NOP, or does it result in `poison`?  And if that makes
> >>>>>>>> a difference, how does that square with the fact that, e.g., the
> >>>>>>>> integer `0x4000` could well be inside such an allocation, but
> >>>>>>>> doing `getelementptr inbounds` with offset 0 on that would fall
> >>>>>>>> under the first question above?
> >>>>>>>>
> >>>>>>>> Kind regards, Ralf
> >>>>>>>> _______________________________________________ LLVM Developers
> >>>>>>>> mailing list [hidden email]
> >>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>>>> _______________________________________________ LLVM Developers
> >>>>>> mailing list [hidden email]
> >>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>>>
> >>>
> >
--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi,

>>> I see. Is there a quick answer to the questions why you need inbounds
>>> GEPs in that case? Can't you just use non-inbounds GEPs if you know you
>>> might not have a valid base ptr and "optimize" it to inbounds once that
>>> is proven?
>>
>> You mean on the Rust side?  We emit GEPi for field accesses and array indexing.
>>  We cannot always statically determine if this is happening for a ZST or not.
>> At the same time, given that no memory access ever happens for a ZST, allocating
>> a ZST (Box::new in Rust, think of it like new in C++) does not actually allocate
>> any memory, it just returns an integer (sufficiently aligned) cast to a pointer.
>
> OK, but why not emit non-inbonuds GEPs instead? They do not come with
> the problems you have now, or maybe I misunderstand.

The problem is statically figuring out whether it should be inbounds or
non-inbounds.  When we have code like `&x[n]`, this might be an offset-by-0 in
an empty slice and hence fall into the scope of my question, or it might be a
"normal" array access where we definitely want inbounds.

>> Sure, UB is definitely *defined* in a runtime-value dependent way.  The problem
>> here is that it is not defined in a precise way -- something where one could
>> write an interpreter that tracks all the extra state that is needed (like
>> poison/undef and where allocations lie) and then says precisely under which
>> conditions we have UB and under which we do not.
>> What I am asking here for is the exact definition of GEPi if, *at run-time*, the
>> offset is 0, and the base pointer is (a) an integer, or (b) dangling.
>
> That last part is given by the lang-ref (imo):
>   "If the inbounds keyword is present, the result value of the
>    getelementptr is a poison value if the base pointer is not an in
>    bounds address of an allocated object"
>
> I read this as: If you have a GEPi, you get poison if the base pointer
> is not an allocated object. That is a dangling pointer (b) causes the
> GEPi to be poison and a pointer from integer (a) may, if the address
> denoted by the integer is not inside, or one past, an allocated object.
> Now any offset except 0 will add more possible ways to generate a poison
> value.

Thanks.  That makes sense from reading the docs (though I am not convinced that
it actually helps with optimizations to be this strict here).

For the (a) case, the question about "0-sized objects" remains, but it doesn't
seem like the answer could affect what LLVM does.

It would be really nice to have a reference interpreter for LLVM IR that can
explicitly check for all the UB.  Maybe, one day... ;)

Kind regards,
Ralf
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi Ralf,

On 04/10, Ralf Jung wrote:

> >>> I see. Is there a quick answer to the questions why you need inbounds
> >>> GEPs in that case? Can't you just use non-inbounds GEPs if you know you
> >>> might not have a valid base ptr and "optimize" it to inbounds once that
> >>> is proven?
> >>
> >> You mean on the Rust side?  We emit GEPi for field accesses and array indexing.
> >>  We cannot always statically determine if this is happening for a ZST or not.
> >> At the same time, given that no memory access ever happens for a ZST, allocating
> >> a ZST (Box::new in Rust, think of it like new in C++) does not actually allocate
> >> any memory, it just returns an integer (sufficiently aligned) cast to a pointer.
> >
> > OK, but why not emit non-inbonuds GEPs instead? They do not come with
> > the problems you have now, or maybe I misunderstand.
>
> The problem is statically figuring out whether it should be inbounds or
> non-inbounds.  When we have code like `&x[n]`, this might be an offset-by-0 in
> an empty slice and hence fall into the scope of my question, or it might be a
> "normal" array access where we definitely want inbounds.
I'd argue, after all this discussion at least, use non-inbounds if you
do not know you have a valid object (and want to avoid undef and all
what it entails). This might cause performance regressions, if you try
it, it would be interesting to know how much. We could even look into an
"inbounds" detection in the "Attributor framework" [0] to get some of
the performance back.

[0] https://reviews.llvm.org/D59919 (but see also the "Stack" tab that shows
                                     related commits)

> >> Sure, UB is definitely *defined* in a runtime-value dependent way.  The problem
> >> here is that it is not defined in a precise way -- something where one could
> >> write an interpreter that tracks all the extra state that is needed (like
> >> poison/undef and where allocations lie) and then says precisely under which
> >> conditions we have UB and under which we do not.
> >> What I am asking here for is the exact definition of GEPi if, *at run-time*, the
> >> offset is 0, and the base pointer is (a) an integer, or (b) dangling.
> >
> > That last part is given by the lang-ref (imo):
> >   "If the inbounds keyword is present, the result value of the
> >    getelementptr is a poison value if the base pointer is not an in
> >    bounds address of an allocated object"
> >
> > I read this as: If you have a GEPi, you get poison if the base pointer
> > is not an allocated object. That is a dangling pointer (b) causes the
> > GEPi to be poison and a pointer from integer (a) may, if the address
> > denoted by the integer is not inside, or one past, an allocated object.
> > Now any offset except 0 will add more possible ways to generate a poison
> > value.
>
> Thanks.  That makes sense from reading the docs (though I am not convinced that
> it actually helps with optimizations to be this strict here).
I never argued it does "make sense" ;)


> For the (a) case, the question about "0-sized objects" remains, but it doesn't
> seem like the answer could affect what LLVM does.

I think I now see (maybe part of) your point.
Something like:

  x = malloc(0);
  // ... anything except free(x) or equivalent
  y = gep inbounds x, 0
  // ... anything except free(x) or equivalent
  use_but_not_dereference(y);

should be OK (= no undef/poison appears). Does that at least go in the
right direction? I think this should be OK from the IR definition or
something is broken. Obviously, there is always the possibility, or
better the certainty, that the implementation is somewhere broken ;)


> It would be really nice to have a reference interpreter for LLVM IR that can
> explicitly check for all the UB.  Maybe, one day... ;)

Let me know once you start working on one, I'd be quite interested ;)

Cheers,
  Johannes


--

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

[hidden email]

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

signature.asc (235 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] getelementptr inbounds with offset 0

Finkel, Hal J. via llvm-dev
Hi,

>> For the (a) case, the question about "0-sized objects" remains, but it doesn't
>> seem like the answer could affect what LLVM does.
>
> I think I now see (maybe part of) your point.
> Something like:
>
>   x = malloc(0);
>   // ... anything except free(x) or equivalent
>   y = gep inbounds x, 0
>   // ... anything except free(x) or equivalent
>   use_but_not_dereference(y);
>
> should be OK (= no undef/poison appears). Does that at least go in the
> right direction? I think this should be OK from the IR definition or
> something is broken. Obviously, there is always the possibility, or
> better the certainty, that the implementation is somewhere broken ;)

I guess that is a way to look at it -- though malloc can return NULL, and that
may (or may not) change the rules here.
But yes, this is the closest that you can get to in C when trying to mirror what
we do in Rust.  C does not have 0-sized types, Rust does, so there is no more
direct equivalent.

Kind regards,
Ralf
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev