[llvm-dev] What should a truncating store do?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
For example, truncating store of an i32 to i6. My assumption was that this should write the low six bits of the i32 to somewhere in memory.

Should the top 24 bits of a corresponding 32 bit region of memory be unchanged, zero,  undefined?

Should the two bits that would round the i6 up to a byte be preserved, zero, undefined?

I can't write six bits directly so am trying to determine what set of bitwise ops to apply between a load and subsequent store to emulate the truncating store.

Thanks!

Jon

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
On 9/15/2017 5:49 AM, Jon Chesterfield via llvm-dev wrote:
> For example, truncating store of an i32 to i6. My assumption was that
> this should write the low six bits of the i32 to somewhere in memory.
>
> Should the top 24 bits of a corresponding 32 bit region of memory be
> unchanged, zero,  undefined?

Unchanged.

> Should the two bits that would round the i6 up to a byte be preserved,
> zero, undefined?

Zero.  Legalization will normally handle this for you, though, by
transforming it to an i8 store.

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
In reply to this post by Joel E. Denny via llvm-dev
Interesting, thank you. I expected both answers to be "unchanged" so was surprised by the zero extend in the legaliser.

The motivation here is that it's faster for us to load N bytes, apply whatever masks are necessary to reproduce the truncating store then store all N bytes. This is only a good plan if there's no change to the semantics :)

Are scalar integer types zero extended to the next multiple of 8 or to the next power of 2 greater than 7? For example, i17 => i24 or i17 => i32?

I think this means truncating stores of vector types will introduce zero bits at the end of each element instead grouping all the zeros at the end. For example, <i6 63, i6 63> writes to sixteen bits as 0b0011111100111111, not as 0b0000111111111111?


Thanks!

Jon



On Fri, Sep 15, 2017 at 6:55 PM, Friedman, Eli <[hidden email]> wrote:
On 9/15/2017 5:49 AM, Jon Chesterfield via llvm-dev wrote:
For example, truncating store of an i32 to i6. My assumption was that this should write the low six bits of the i32 to somewhere in memory.

Should the top 24 bits of a corresponding 32 bit region of memory be unchanged, zero,  undefined?

Unchanged.

Should the two bits that would round the i6 up to a byte be preserved, zero, undefined?

Zero.  Legalization will normally handle this for you, though, by transforming it to an i8 store.

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
On 9/15/2017 11:30 AM, Jon Chesterfield wrote:
> Interesting, thank you. I expected both answers to be "unchanged" so
> was surprised by the zero extend in the legaliser.
>
> The motivation here is that it's faster for us to load N bytes, apply
> whatever masks are necessary to reproduce the truncating store then
> store all N bytes. This is only a good plan if there's no change to
> the semantics :)

See http://llvm.org/docs/LangRef.html#store-instruction .  In general,
you have to be careful to avoid data races, but that might not apply to
your target.

> Are scalar integer types zero extended to the next multiple of 8 or to
> the next power of 2 greater than 7? For example, i17 => i24 or i17 => i32?

Multiple of 8.

> I think this means truncating stores of vector types will introduce
> zero bits at the end of each element instead grouping all the zeros at
> the end. For example, <i6 63, i6 63> writes to sixteen bits as
> 0b0011111100111111, not as 0b0000111111111111?

Vector types are tightly packed, so <8 x i1> is 1 byte, not 8 bytes.

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
In reply to this post by Joel E. Denny via llvm-dev
OK, I'm clear on scalars. Data races are thankfully OK in this context.

Densely packing vectors sounds efficient and is clear in the case where lanes * width is a multiple of 8 bits. I don't think I understand how it works in other cases.

If we could take store <4 x i8> truncating to <4 x i7> as an example. This can be converted into four scalar i8 -> i7 stores with corresponding increments to the address, in which case the final layout in memory is 0b01111111011111110111111101111111. Or it can be written as a packed vector which I think would resemble 0b00001111111111111111111111111111.

This would mean the memory layout changes depending on how/whether the legaliser breaks large vectors down into smaller types. Is this the case? For example, <4xi32> => <4 x i31> converts to two <2 x i32> => <2 x i31> stores on a target with <2 x i32> legal but would not be split if <4 x i32> were declared legal.

Thanks

Jon


On Fri, Sep 15, 2017 at 7:41 PM, Friedman, Eli <[hidden email]> wrote:
On 9/15/2017 11:30 AM, Jon Chesterfield wrote:
Interesting, thank you. I expected both answers to be "unchanged" so was surprised by the zero extend in the legaliser.

The motivation here is that it's faster for us to load N bytes, apply whatever masks are necessary to reproduce the truncating store then store all N bytes. This is only a good plan if there's no change to the semantics :)

See http://llvm.org/docs/LangRef.html#store-instruction .  In general, you have to be careful to avoid data races, but that might not apply to your target.

Are scalar integer types zero extended to the next multiple of 8 or to the next power of 2 greater than 7? For example, i17 => i24 or i17 => i32?

Multiple of 8.

I think this means truncating stores of vector types will introduce zero bits at the end of each element instead grouping all the zeros at the end. For example, <i6 63, i6 63> writes to sixteen bits as 0b0011111100111111, not as 0b0000111111111111?

Vector types are tightly packed, so <8 x i1> is 1 byte, not 8 bytes.


-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
On 9/15/2017 12:10 PM, Jon Chesterfield wrote:

> OK, I'm clear on scalars. Data races are thankfully OK in this context.
>
> Densely packing vectors sounds efficient and is clear in the case
> where lanes * width is a multiple of 8 bits. I don't think I
> understand how it works in other cases.
>
> If we could take store <4 x i8> truncating to <4 x i7> as an example.
> This can be converted into four scalar i8 -> i7 stores with
> corresponding increments to the address, in which case the final
> layout in memory is 0b01111111011111110111111101111111. Or it can be
> written as a packed vector which I think would
> resemble 0b00001111111111111111111111111111.
>
> This would mean the memory layout changes depending on how/whether the
> legaliser breaks large vectors down into smaller types. Is this the
> case? For example, <4xi32> => <4 x i31> converts to two <2 x i32> =>
> <2 x i31> stores on a target with <2 x i32> legal but would not be
> split if <4 x i32> were declared legal.

Vectors get complicated; I don't recall all the details of what the code
generator currently does/is supposed to do.  See also
https://bugs.llvm.org/show_bug.cgi?id=31265 .

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
They are starting to look complicated. The patch linked is interesting, perhaps v1 vectors are special cased. It shouldn't be too onerous to work out what one or two in tree back ends do by experimentation.

Thanks again, it's great to have context beyond the source.

On Fri, Sep 15, 2017 at 9:41 PM, Friedman, Eli <[hidden email]> wrote:
On 9/15/2017 12:10 PM, Jon Chesterfield wrote:
OK, I'm clear on scalars. Data races are thankfully OK in this context.

Densely packing vectors sounds efficient and is clear in the case where lanes * width is a multiple of 8 bits. I don't think I understand how it works in other cases.

If we could take store <4 x i8> truncating to <4 x i7> as an example. This can be converted into four scalar i8 -> i7 stores with corresponding increments to the address, in which case the final layout in memory is 0b01111111011111110111111101111111. Or it can be written as a packed vector which I think would resemble 0b00001111111111111111111111111111.

This would mean the memory layout changes depending on how/whether the legaliser breaks large vectors down into smaller types. Is this the case? For example, <4xi32> => <4 x i31> converts to two <2 x i32> => <2 x i31> stores on a target with <2 x i32> legal but would not be split if <4 x i32> were declared legal.

Vectors get complicated; I don't recall all the details of what the code generator currently does/is supposed to do.  See also https://bugs.llvm.org/show_bug.cgi?id=31265 .


-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev

(Not sure if this exactly maps to “truncating store”, but I think it at least touches some of the subjects discussed in this thread)

 

Our out-of-tree-target need several patches to get things working correctly for us.

We have introduced i24 and i40 types in ValueTypes/MachineValueTypes (in addition to the normal pow-of-2 types). And we have vectors of those (v2i40, v4i40).

And the byte size in our target is 16 bits.

 

When storing an i40 we need to store it as three 16-bit bytes, i.e. 48 bits.

When storing a v4i40 vector it will be stored as 4x48 bits.

 

One thing that we have had to patch is the getStoreSize() method in ValueTypes/MachineValueTypes where we assume that vectors are bitpacked when the element size is smaller than the byte size (“BitsPerByte”):

 

     /// Return the number of bytes overwritten by a store of the specified value

     /// type.

     unsigned getStoreSize() const {

-      return (getSizeInBits() + 7) / 8;

+      // We assume that vectors with elements smaller than the byte size are

+      // bitpacked. And that elements larger than the byte size should be padded

+      // (e.g. i40 type for Phoenix is stored using 3 bytes (48 bits)).

+      bool PadElementsToByteSize =

+        isVector() && getScalarSizeInBits() >= BitsPerByte;

+      if (PadElementsToByteSize)

+        return getVectorNumElements() * getScalarType().getStoreSize();

+      return (getSizeInBits() + (BitsPerByte-1)) / BitsPerByte;

     }

 

The patch seems to work for in-tree-target tests as well as our out-of-tree target.

If it is a correct assumption for all targets is beyond my knowledge. Maybe only i1 vectors should be bitpacked?

 

Anyway, I think the bitpacked cases is very special (we do not use it for our target…).

AFAIK bitcast is defined as writing to memory followed by a load using a different type. And I think that doing several scalar operations should give the same result as when using vectors. So bitcast of bitpacked vectors should probably be avoided?

 

 

This also reminded me of the following test case that is in trunk:  test/CodeGen/X86/pr20011.ll

 

%destTy = type { i2, i2 }

define void @crash(i64 %x0, i64 %y0, %destTy* nocapture %dest) nounwind {

; X64-LABEL: crash:

; X64:       # BB#0:

; X64-NEXT:    andl $3, %esi

; X64-NEXT:    movb %sil, (%rdx)

; X64-NEXT:    andl $3, %edi

; X64-NEXT:    movb %dil, (%rdx)

; X64-NEXT:    retq

  %x1 = trunc i64 %x0 to i2

  %y1 = trunc i64 %y0 to i2

  %1 = bitcast %destTy* %dest to <2 x i2>*

  %2 = insertelement <2 x i2> undef, i2 %x1, i32 0

  %3 = insertelement <2 x i2> %2, i2 %y1, i32 1

  store <2 x i2> %3, <2 x i2>* %1, align 1

  ret void

}

 

As you can see by the “X64” checks the behavior is quite weird.

Both movb instructions writes to the same address. So the result of the store <2 x i2> will be the same as when only storing one of the elements.

Is this really expected?

 

We have emailed Simon Pilgrim who added the test case to show that we no longer crash on this test case (see https://bugs.llvm.org/show_bug.cgi?id=20011). But even if the compiler doesn’t crash, the behavior seems wrong to me.

 

 

Regards,

Björn Pettersson

 

From: llvm-dev [mailto:[hidden email]] On Behalf Of Jon Chesterfield via llvm-dev
Sent: den 15 september 2017 23:28
To: Friedman, Eli <[hidden email]>
Cc: llvm-dev <[hidden email]>
Subject: Re: [llvm-dev] What should a truncating store do?

 

They are starting to look complicated. The patch linked is interesting, perhaps v1 vectors are special cased. It shouldn't be too onerous to work out what one or two in tree back ends do by experimentation.

Thanks again, it's great to have context beyond the source.

 

On Fri, Sep 15, 2017 at 9:41 PM, Friedman, Eli <[hidden email]> wrote:

On 9/15/2017 12:10 PM, Jon Chesterfield wrote:

OK, I'm clear on scalars. Data races are thankfully OK in this context.

Densely packing vectors sounds efficient and is clear in the case where lanes * width is a multiple of 8 bits. I don't think I understand how it works in other cases.

If we could take store <4 x i8> truncating to <4 x i7> as an example. This can be converted into four scalar i8 -> i7 stores with corresponding increments to the address, in which case the final layout in memory is 0b01111111011111110111111101111111. Or it can be written as a packed vector which I think would resemble 0b00001111111111111111111111111111.

This would mean the memory layout changes depending on how/whether the legaliser breaks large vectors down into smaller types. Is this the case? For example, <4xi32> => <4 x i31> converts to two <2 x i32> => <2 x i31> stores on a target with <2 x i32> legal but would not be split if <4 x i32> were declared legal.


Vectors get complicated; I don't recall all the details of what the code generator currently does/is supposed to do.  See also https://bugs.llvm.org/show_bug.cgi?id=31265 .



-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

 


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
On 9/25/2017 9:14 AM, Björn Pettersson A wrote:

(Not sure if this exactly maps to “truncating store”, but I think it at least touches some of the subjects discussed in this thread)

 

Our out-of-tree-target need several patches to get things working correctly for us.

We have introduced i24 and i40 types in ValueTypes/MachineValueTypes (in addition to the normal pow-of-2 types). And we have vectors of those (v2i40, v4i40).

And the byte size in our target is 16 bits.

 

When storing an i40 we need to store it as three 16-bit bytes, i.e. 48 bits.

When storing a v4i40 vector it will be stored as 4x48 bits.

 

One thing that we have had to patch is the getStoreSize() method in ValueTypes/MachineValueTypes where we assume that vectors are bitpacked when the element size is smaller than the byte size (“BitsPerByte”):

 

     /// Return the number of bytes overwritten by a store of the specified value

     /// type.

     unsigned getStoreSize() const {

-      return (getSizeInBits() + 7) / 8;

+      // We assume that vectors with elements smaller than the byte size are

+      // bitpacked. And that elements larger than the byte size should be padded

+      // (e.g. i40 type for Phoenix is stored using 3 bytes (48 bits)).

+      bool PadElementsToByteSize =

+        isVector() && getScalarSizeInBits() >= BitsPerByte;

+      if (PadElementsToByteSize)

+        return getVectorNumElements() * getScalarType().getStoreSize();

+      return (getSizeInBits() + (BitsPerByte-1)) / BitsPerByte;

     }

 

The patch seems to work for in-tree-target tests as well as our out-of-tree target.

If it is a correct assumption for all targets is beyond my knowledge. Maybe only i1 vectors should be bitpacked?

 

Anyway, I think the bitpacked cases is very special (we do not use it for our target…).

AFAIK bitcast is defined as writing to memory followed by a load using a different type. And I think that doing several scalar operations should give the same result as when using vectors. So bitcast of bitpacked vectors should probably be avoided?


Yes, store+load is the right definition of bitcast.  And in fact, the backend will lower a bitcast to a store+load to a stack temporary in cases where there isn't some other lowering specified.

The end result is probably going to be pretty inefficient unless your target has a special instruction to handle it (x86 has pmovmskb for i1 vector bitcasts, but otherwise you probably end up with some terrible lowering involving a lot of shifts).

 

 

This also reminded me of the following test case that is in trunk:  test/CodeGen/X86/pr20011.ll

 

%destTy = type { i2, i2 }

define void @crash(i64 %x0, i64 %y0, %destTy* nocapture %dest) nounwind {

; X64-LABEL: crash:

; X64:       # BB#0:

; X64-NEXT:    andl $3, %esi

; X64-NEXT:    movb %sil, (%rdx)

; X64-NEXT:    andl $3, %edi

; X64-NEXT:    movb %dil, (%rdx)

; X64-NEXT:    retq

  %x1 = trunc i64 %x0 to i2

  %y1 = trunc i64 %y0 to i2

  %1 = bitcast %destTy* %dest to <2 x i2>*

  %2 = insertelement <2 x i2> undef, i2 %x1, i32 0

  %3 = insertelement <2 x i2> %2, i2 %y1, i32 1

  store <2 x i2> %3, <2 x i2>* %1, align 1

  ret void

}

 

As you can see by the “X64” checks the behavior is quite weird.

Both movb instructions writes to the same address. So the result of the store <2 x i2> will be the same as when only storing one of the elements.

Is this really expected?

 

We have emailed Simon Pilgrim who added the test case to show that we no longer crash on this test case (see https://bugs.llvm.org/show_bug.cgi?id=20011). But even if the compiler doesn’t crash, the behavior seems wrong to me.

Yes, the behavior here is wrong.  DAGTypeLegalizer::SplitVecOp_STORE/DAGTypeLegalizer::SplitVecRes_LOAD/etc. assume the element size is a multiple of 8.  I'm sure this has been discussed before, but I guess nobody ever wrote a patch to fix it...?

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
So what is the correct behavior of the store <2 x i2>. Storing two bytes with a zext:ed i2 in each, or a bit packed vector?

I can't remember any documentation mentioning that vectors are bit packed. But if LLVM is supposed to bit pack vectors, should we do it for any element size, or only when element size is less than the byte size, or only for i1 vectors?

Maybe bit packing should be optional (target specific)?

Btw, the IR/ISD note for a <2 x i2> store indicates that it is an one byte (ST1) store, due to the getStoreSize() methods saying one byte for this kind of vector. So alias analysis etc thinks that only one byte is written.
________________________________________
From: Friedman, Eli <[hidden email]>
Sent: Monday, September 25, 2017 8:08:01 PM
To: Björn Pettersson A; Jon Chesterfield
Cc: [hidden email]
Subject: Re: [llvm-dev] What should a truncating store do?

On 9/25/2017 9:14 AM, Björn Pettersson A wrote:
(Not sure if this exactly maps to “truncating store”, but I think it at least touches some of the subjects discussed in this thread)

Our out-of-tree-target need several patches to get things working correctly for us.
We have introduced i24 and i40 types in ValueTypes/MachineValueTypes (in addition to the normal pow-of-2 types). And we have vectors of those (v2i40, v4i40).
And the byte size in our target is 16 bits.

When storing an i40 we need to store it as three 16-bit bytes, i.e. 48 bits.
When storing a v4i40 vector it will be stored as 4x48 bits.

One thing that we have had to patch is the getStoreSize() method in ValueTypes/MachineValueTypes where we assume that vectors are bitpacked when the element size is smaller than the byte size (“BitsPerByte”):

     /// Return the number of bytes overwritten by a store of the specified value
     /// type.
     unsigned getStoreSize() const {
-      return (getSizeInBits() + 7) / 8;
+      // We assume that vectors with elements smaller than the byte size are
+      // bitpacked. And that elements larger than the byte size should be padded
+      // (e.g. i40 type for Phoenix is stored using 3 bytes (48 bits)).
+      bool PadElementsToByteSize =
+        isVector() && getScalarSizeInBits() >= BitsPerByte;
+      if (PadElementsToByteSize)
+        return getVectorNumElements() * getScalarType().getStoreSize();
+      return (getSizeInBits() + (BitsPerByte-1)) / BitsPerByte;
     }

The patch seems to work for in-tree-target tests as well as our out-of-tree target.
If it is a correct assumption for all targets is beyond my knowledge. Maybe only i1 vectors should be bitpacked?

Anyway, I think the bitpacked cases is very special (we do not use it for our target…).
AFAIK bitcast is defined as writing to memory followed by a load using a different type. And I think that doing several scalar operations should give the same result as when using vectors. So bitcast of bitpacked vectors should probably be avoided?

Yes, store+load is the right definition of bitcast.  And in fact, the backend will lower a bitcast to a store+load to a stack temporary in cases where there isn't some other lowering specified.

The end result is probably going to be pretty inefficient unless your target has a special instruction to handle it (x86 has pmovmskb for i1 vector bitcasts, but otherwise you probably end up with some terrible lowering involving a lot of shifts).



This also reminded me of the following test case that is in trunk:  test/CodeGen/X86/pr20011.ll

%destTy = type { i2, i2 }
define void @crash(i64 %x0, i64 %y0, %destTy* nocapture %dest) nounwind {
; X64-LABEL: crash:
; X64:       # BB#0:
; X64-NEXT:    andl $3, %esi
; X64-NEXT:    movb %sil, (%rdx)
; X64-NEXT:    andl $3, %edi
; X64-NEXT:    movb %dil, (%rdx)
; X64-NEXT:    retq
  %x1 = trunc i64 %x0 to i2
  %y1 = trunc i64 %y0 to i2
  %1 = bitcast %destTy* %dest to <2 x i2>*
  %2 = insertelement <2 x i2> undef, i2 %x1, i32 0
  %3 = insertelement <2 x i2> %2, i2 %y1, i32 1
  store <2 x i2> %3, <2 x i2>* %1, align 1
  ret void
}

As you can see by the “X64” checks the behavior is quite weird.
Both movb instructions writes to the same address. So the result of the store <2 x i2> will be the same as when only storing one of the elements.
Is this really expected?

We have emailed Simon Pilgrim who added the test case to show that we no longer crash on this test case (see https://bugs.llvm.org/show_bug.cgi?id=20011). But even if the compiler doesn’t crash, the behavior seems wrong to me.

Yes, the behavior here is wrong.  DAGTypeLegalizer::SplitVecOp_STORE/DAGTypeLegalizer::SplitVecRes_LOAD/etc. assume the element size is a multiple of 8.  I'm sure this has been discussed before, but I guess nobody ever wrote a patch to fix it...?

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
On 9/25/2017 12:13 PM, Björn Pettersson A wrote:
> So what is the correct behavior of the store <2 x i2>. Storing two bytes with a zext:ed i2 in each, or a bit packed vector?
>
> I can't remember any documentation mentioning that vectors are bit packed. But if LLVM is supposed to bit pack vectors, should we do it for any element size, or only when element size is less than the byte size, or only for i1 vectors?
For lack of any formal documentation, the most canonical source of
information about the size of a type is DataLayout.  And currently
DataLayout::getTypeSizeInBits() assumes vectors are bit-packed.
> Maybe bit packing should be optional (target specific)?

That's... maybe possible?  It would break the current rule that whether
a bitcast is legal is independent of the target, and I don't see a
compelling reason.  (If you're trying to comply with some external ABI,
you can always explicitly extend a value before you store it.)

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
In reply to this post by Joel E. Denny via llvm-dev
Bit packing vectors doesn't appear to interact well with splitting or concatenating vectors since it redistributes where the padding goes.

The behaviour I find strange is writing zeros to round up to a byte boundary when storing instead of preserving whatever was already there. Writing a vector of size N bits shouldn't clobber more than N bits of memory.

On Mon, Sep 25, 2017 at 9:54 PM, Friedman, Eli <[hidden email]> wrote:
On 9/25/2017 12:13 PM, Björn Pettersson A wrote:
So what is the correct behavior of the store <2 x i2>. Storing two bytes with a zext:ed i2 in each, or a bit packed vector?

I can't remember any documentation mentioning that vectors are bit packed. But if LLVM is supposed to bit pack vectors, should we do it for any element size, or only when element size is less than the byte size, or only for i1 vectors?
For lack of any formal documentation, the most canonical source of information about the size of a type is DataLayout.  And currently DataLayout::getTypeSizeInBits() assumes vectors are bit-packed.
Maybe bit packing should be optional (target specific)?

That's... maybe possible?  It would break the current rule that whether a bitcast is legal is independent of the target, and I don't see a compelling reason.  (If you're trying to comply with some external ABI, you can always explicitly extend a value before you store it.)


-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
In reply to this post by Joel E. Denny via llvm-dev

On 25 Sep 2017, at 19:08, Friedman, Eli via llvm-dev <[hidden email]> wrote:

On 9/25/2017 9:14 AM, Björn Pettersson A wrote:
(Not sure if this exactly maps to “truncating store”, but I think it at least touches some of the subjects discussed in this thread)
 
Our out-of-tree-target need several patches to get things working correctly for us.
We have introduced i24 and i40 types in ValueTypes/MachineValueTypes (in addition to the normal pow-of-2 types). And we have vectors of those (v2i40, v4i40).
And the byte size in our target is 16 bits.
 
When storing an i40 we need to store it as three 16-bit bytes, i.e. 48 bits.
When storing a v4i40 vector it will be stored as 4x48 bits.
 
One thing that we have had to patch is the getStoreSize() method in ValueTypes/MachineValueTypes where we assume that vectors are bitpacked when the element size is smaller than the byte size (“BitsPerByte”):
 
     /// Return the number of bytes overwritten by a store of the specified value
     /// type.
     unsigned getStoreSize() const {
-      return (getSizeInBits() + 7) / 8;
+      // We assume that vectors with elements smaller than the byte size are
+      // bitpacked. And that elements larger than the byte size should be padded
+      // (e.g. i40 type for Phoenix is stored using 3 bytes (48 bits)).
+      bool PadElementsToByteSize =
+        isVector() && getScalarSizeInBits() >= BitsPerByte;
+      if (PadElementsToByteSize)
+        return getVectorNumElements() * getScalarType().getStoreSize();
+      return (getSizeInBits() + (BitsPerByte-1)) / BitsPerByte;
     }
 
The patch seems to work for in-tree-target tests as well as our out-of-tree target.
If it is a correct assumption for all targets is beyond my knowledge. Maybe only i1 vectors should be bitpacked?
 
Anyway, I think the bitpacked cases is very special (we do not use it for our target…).
AFAIK bitcast is defined as writing to memory followed by a load using a different type. And I think that doing several scalar operations should give the same result as when using vectors. So bitcast of bitpacked vectors should probably be avoided?

Yes, store+load is the right definition of bitcast.  And in fact, the backend will lower a bitcast to a store+load to a stack temporary in cases where there isn't some other lowering specified.

The end result is probably going to be pretty inefficient unless your target has a special instruction to handle it (x86 has pmovmskb for i1 vector bitcasts, but otherwise you probably end up with some terrible lowering involving a lot of shifts).

We still struggle with this in many cases - llvm/test/CodeGen/X86/vector-compare-results.ll has some pretty shocking cases that haven’t been addressed.

Weren’t the Embescom chaps working on better support for targets with base types other than 8 bits?

This also reminded me of the following test case that is in trunk:  test/CodeGen/X86/pr20011.ll
 
%destTy = type { i2, i2 }
define void @crash(i64 %x0, i64 %y0, %destTy* nocapture %dest) nounwind {
; X64-LABEL: crash:
; X64:       # BB#0:
; X64-NEXT:    andl $3, %esi
; X64-NEXT:    movb %sil, (%rdx)
; X64-NEXT:    andl $3, %edi
; X64-NEXT:    movb %dil, (%rdx)
; X64-NEXT:    retq
  %x1 = trunc i64 %x0 to i2
  %y1 = trunc i64 %y0 to i2
  %1 = bitcast %destTy* %dest to <2 x i2>*
  %2 = insertelement <2 x i2> undef, i2 %x1, i32 0
  %3 = insertelement <2 x i2> %2, i2 %y1, i32 1
  store <2 x i2> %3, <2 x i2>* %1, align 1
  ret void
}
 
As you can see by the “X64” checks the behavior is quite weird.
Both movb instructions writes to the same address. So the result of the store <2 x i2> will be the same as when only storing one of the elements.
Is this really expected?
 
We have emailed Simon Pilgrim who added the test case to show that we no longer crash on this test case (see https://bugs.llvm.org/show_bug.cgi?id=20011). But even if the compiler doesn’t crash, the behavior seems wrong to me. 

Yes, the behavior here is wrong.  DAGTypeLegalizer::SplitVecOp_STORE/DAGTypeLegalizer::SplitVecRes_LOAD/etc. assume the element size is a multiple of 8.  I'm sure this has been discussed before, but I guess nobody ever wrote a patch to fix it...?

Sorry I might have missed that email. I ended up creating PR31265 as a meta because there are far too many cases like this, where we don’t correctly pack illegal vector types, PR1784 goes into more details on this. My particular interest was in bool vectors, especially after AVX512 went with the bitpacked data representations for mask registers. 

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
In reply to this post by Joel E. Denny via llvm-dev
On 09/15/2017 10:55 AM, Friedman, Eli via llvm-dev wrote:

> On 9/15/2017 5:49 AM, Jon Chesterfield via llvm-dev wrote:
>> For example, truncating store of an i32 to i6. My assumption was that this should write the low six bits of the i32 to somewhere in memory.
>>
>> Should the top 24 bits of a corresponding 32 bit region of memory be unchanged, zero,  undefined?
>
> Unchanged.
>
>> Should the two bits that would round the i6 up to a byte be preserved, zero, undefined?
>
> Zero.  Legalization will normally handle this for you, though, by transforming it to an i8 store.
>

Why is this Zero?  The language ref says the value of those bits are
unspecified.

-Tom

> -Eli
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] What should a truncating store do?

Joel E. Denny via llvm-dev
On 5/9/2018 9:54 AM, Tom Stellard wrote:
On 09/15/2017 10:55 AM, Friedman, Eli via llvm-dev wrote:
On 9/15/2017 5:49 AM, Jon Chesterfield via llvm-dev wrote:
For example, truncating store of an i32 to i6. My assumption was that this should write the low six bits of the i32 to somewhere in memory.

Should the top 24 bits of a corresponding 32 bit region of memory be unchanged, zero,  undefined?
Unchanged.

Should the two bits that would round the i6 up to a byte be preserved, zero, undefined?
Zero.  Legalization will normally handle this for you, though, by transforming it to an i8 store.

Why is this Zero?  The language ref says the value of those bits are
unspecified.

Zero, because that makes it more convenient to lower the corresponding "load" instruction.  Note the corresponding language for loads: "When loading a value of a type like i20 with a size that is not an integral number of bytes, the result is undefined if the value was not originally written using a store of the same type."

Maybe we should fix the bits to zero instead, to match what happens in practice.

-Eli
-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev