[llvm-dev] Vectorizer has trouble with vpmovmskb and store

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] Vectorizer has trouble with vpmovmskb and store

Alberto Barbaro via llvm-dev
Hi all,
  I've run into a case where the optimizer seems to be having trouble doing the "obvious" thing.

Consider this code:
```
define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %a1 = icmp slt <16 x i8> %a0, zeroinitializer
    %a2 = bitcast <16 x i1> %a1 to i16
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    ;store i16 %a2, i16* %astore
    ret i16 %a2
}
```
The optimizer recognizes this and llc nicely outputs a vpmovmskb instruction:
```
foo: # @foo
    vpmovmskb eax, xmm0
    ret
```

Writing to the output vector also works well:
```
define void @writing(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    store i16 123, i16* %astore
    ret void
}
```
outputs:
```
writing: # @writing
    mov word ptr [rdi + 14], 123
    ret
```

Now, combining these two by uncommenting the store in `foo()` suddenly results in a very large function, instead of just:
    vpmovmskb eaxxmm0
    mov word ptr [rdi + 14], ax
    ret

Is there something wrong with my IR code, or is the optimizer somehow confused? Can I rewrite the code such that the optimizer does understand?


Thanks a lot for the help.
Cheers,
  Johan


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vectorizer has trouble with vpmovmskb and store

Alberto Barbaro via llvm-dev
Here's a quick patch that fixes this. I don't know to avoid it in IR. I haven't checked any other tests, but it does fix your case. I'll try to put up a real phabricator tonight or tomorrow.

diff --git a/lib/Target/X86/X86ISelLowering.cpp b/lib/Target/X86/X86ISelLowering.cpp
index e31f2a6..d79c0be 100644
--- a/lib/Target/X86/X86ISelLowering.cpp
+++ b/lib/Target/X86/X86ISelLowering.cpp
@@ -4837,6 +4837,11 @@ bool X86TargetLowering::isCheapToSpeculateCtlz() const {

 bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
                                                 EVT BitcastVT) const {
+  if (!LoadVT.isVector() && BitcastVT.isVector() &&
+      BitcastVT.getVectorElementType() == MVT::i1 &&
+      !Subtarget.hasAVX512())
+    return false;
+
   if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1)
     return false;


~Craig


On Mon, Nov 26, 2018 at 2:51 PM Johan Engelen via llvm-dev <[hidden email]> wrote:
Hi all,
  I've run into a case where the optimizer seems to be having trouble doing the "obvious" thing.

Consider this code:
```
define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %a1 = icmp slt <16 x i8> %a0, zeroinitializer
    %a2 = bitcast <16 x i1> %a1 to i16
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    ;store i16 %a2, i16* %astore
    ret i16 %a2
}
```
The optimizer recognizes this and llc nicely outputs a vpmovmskb instruction:
```
foo: # @foo
    vpmovmskb eax, xmm0
    ret
```

Writing to the output vector also works well:
```
define void @writing(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    store i16 123, i16* %astore
    ret void
}
```
outputs:
```
writing: # @writing
    mov word ptr [rdi + 14], 123
    ret
```

Now, combining these two by uncommenting the store in `foo()` suddenly results in a very large function, instead of just:
    vpmovmskb eaxxmm0
    mov word ptr [rdi + 14], ax
    ret

Is there something wrong with my IR code, or is the optimizer somehow confused? Can I rewrite the code such that the optimizer does understand?


Thanks a lot for the help.
Cheers,
  Johan

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vectorizer has trouble with vpmovmskb and store

Alberto Barbaro via llvm-dev
We should handle this a lot better after r34763

~Craig


On Mon, Nov 26, 2018 at 3:13 PM Craig Topper <[hidden email]> wrote:
Here's a quick patch that fixes this. I don't know to avoid it in IR. I haven't checked any other tests, but it does fix your case. I'll try to put up a real phabricator tonight or tomorrow.

diff --git a/lib/Target/X86/X86ISelLowering.cpp b/lib/Target/X86/X86ISelLowering.cpp
index e31f2a6..d79c0be 100644
--- a/lib/Target/X86/X86ISelLowering.cpp
+++ b/lib/Target/X86/X86ISelLowering.cpp
@@ -4837,6 +4837,11 @@ bool X86TargetLowering::isCheapToSpeculateCtlz() const {

 bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
                                                 EVT BitcastVT) const {
+  if (!LoadVT.isVector() && BitcastVT.isVector() &&
+      BitcastVT.getVectorElementType() == MVT::i1 &&
+      !Subtarget.hasAVX512())
+    return false;
+
   if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1)
     return false;


~Craig


On Mon, Nov 26, 2018 at 2:51 PM Johan Engelen via llvm-dev <[hidden email]> wrote:
Hi all,
  I've run into a case where the optimizer seems to be having trouble doing the "obvious" thing.

Consider this code:
```
define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %a1 = icmp slt <16 x i8> %a0, zeroinitializer
    %a2 = bitcast <16 x i1> %a1 to i16
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    ;store i16 %a2, i16* %astore
    ret i16 %a2
}
```
The optimizer recognizes this and llc nicely outputs a vpmovmskb instruction:
```
foo: # @foo
    vpmovmskb eax, xmm0
    ret
```

Writing to the output vector also works well:
```
define void @writing(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    store i16 123, i16* %astore
    ret void
}
```
outputs:
```
writing: # @writing
    mov word ptr [rdi + 14], 123
    ret
```

Now, combining these two by uncommenting the store in `foo()` suddenly results in a very large function, instead of just:
    vpmovmskb eaxxmm0
    mov word ptr [rdi + 14], ax
    ret

Is there something wrong with my IR code, or is the optimizer somehow confused? Can I rewrite the code such that the optimizer does understand?


Thanks a lot for the help.
Cheers,
  Johan

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vectorizer has trouble with vpmovmskb and store

Alberto Barbaro via llvm-dev
Hello Craig,
  Thank you for the quick response and fix.
However, the improvement turns out to be quite fragile. If I run `opt` on the original testcase, and run the output through `llc` then the previous very long assembly output results.  (things work for a bitcast from <16 x i1> to i16, but not for a <16 x i1>* store) 

regards,
  Johan



On Tue, Nov 27, 2018 at 4:00 AM Craig Topper <[hidden email]> wrote:
We should handle this a lot better after r34763

~Craig


On Mon, Nov 26, 2018 at 3:13 PM Craig Topper <[hidden email]> wrote:
Here's a quick patch that fixes this. I don't know to avoid it in IR. I haven't checked any other tests, but it does fix your case. I'll try to put up a real phabricator tonight or tomorrow.

diff --git a/lib/Target/X86/X86ISelLowering.cpp b/lib/Target/X86/X86ISelLowering.cpp
index e31f2a6..d79c0be 100644
--- a/lib/Target/X86/X86ISelLowering.cpp
+++ b/lib/Target/X86/X86ISelLowering.cpp
@@ -4837,6 +4837,11 @@ bool X86TargetLowering::isCheapToSpeculateCtlz() const {

 bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
                                                 EVT BitcastVT) const {
+  if (!LoadVT.isVector() && BitcastVT.isVector() &&
+      BitcastVT.getVectorElementType() == MVT::i1 &&
+      !Subtarget.hasAVX512())
+    return false;
+
   if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1)
     return false;


~Craig


On Mon, Nov 26, 2018 at 2:51 PM Johan Engelen via llvm-dev <[hidden email]> wrote:
Hi all,
  I've run into a case where the optimizer seems to be having trouble doing the "obvious" thing.

Consider this code:
```
define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %a1 = icmp slt <16 x i8> %a0, zeroinitializer
    %a2 = bitcast <16 x i1> %a1 to i16
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    ;store i16 %a2, i16* %astore
    ret i16 %a2
}
```
The optimizer recognizes this and llc nicely outputs a vpmovmskb instruction:
```
foo: # @foo
    vpmovmskb eax, xmm0
    ret
```

Writing to the output vector also works well:
```
define void @writing(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    store i16 123, i16* %astore
    ret void
}
```
outputs:
```
writing: # @writing
    mov word ptr [rdi + 14], 123
    ret
```

Now, combining these two by uncommenting the store in `foo()` suddenly results in a very large function, instead of just:
    vpmovmskb eaxxmm0
    mov word ptr [rdi + 14], ax
    ret

Is there something wrong with my IR code, or is the optimizer somehow confused? Can I rewrite the code such that the optimizer does understand?


Thanks a lot for the help.
Cheers,
  Johan

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vectorizer has trouble with vpmovmskb and store

Alberto Barbaro via llvm-dev
I was afraid of that. I thought I had checked whether InstCombine would remove the bitcast here, but I guess I didn't or didn't do it right. I'll see what I can do to fix this.

~Craig


On Sat, Dec 1, 2018 at 4:39 AM Johan Engelen <[hidden email]> wrote:
Hello Craig,
  Thank you for the quick response and fix.
However, the improvement turns out to be quite fragile. If I run `opt` on the original testcase, and run the output through `llc` then the previous very long assembly output results.  (things work for a bitcast from <16 x i1> to i16, but not for a <16 x i1>* store) 

regards,
  Johan



On Tue, Nov 27, 2018 at 4:00 AM Craig Topper <[hidden email]> wrote:
We should handle this a lot better after r34763

~Craig


On Mon, Nov 26, 2018 at 3:13 PM Craig Topper <[hidden email]> wrote:
Here's a quick patch that fixes this. I don't know to avoid it in IR. I haven't checked any other tests, but it does fix your case. I'll try to put up a real phabricator tonight or tomorrow.

diff --git a/lib/Target/X86/X86ISelLowering.cpp b/lib/Target/X86/X86ISelLowering.cpp
index e31f2a6..d79c0be 100644
--- a/lib/Target/X86/X86ISelLowering.cpp
+++ b/lib/Target/X86/X86ISelLowering.cpp
@@ -4837,6 +4837,11 @@ bool X86TargetLowering::isCheapToSpeculateCtlz() const {

 bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
                                                 EVT BitcastVT) const {
+  if (!LoadVT.isVector() && BitcastVT.isVector() &&
+      BitcastVT.getVectorElementType() == MVT::i1 &&
+      !Subtarget.hasAVX512())
+    return false;
+
   if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1)
     return false;


~Craig


On Mon, Nov 26, 2018 at 2:51 PM Johan Engelen via llvm-dev <[hidden email]> wrote:
Hi all,
  I've run into a case where the optimizer seems to be having trouble doing the "obvious" thing.

Consider this code:
```
define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %a1 = icmp slt <16 x i8> %a0, zeroinitializer
    %a2 = bitcast <16 x i1> %a1 to i16
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    ;store i16 %a2, i16* %astore
    ret i16 %a2
}
```
The optimizer recognizes this and llc nicely outputs a vpmovmskb instruction:
```
foo: # @foo
    vpmovmskb eax, xmm0
    ret
```

Writing to the output vector also works well:
```
define void @writing(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    store i16 123, i16* %astore
    ret void
}
```
outputs:
```
writing: # @writing
    mov word ptr [rdi + 14], 123
    ret
```

Now, combining these two by uncommenting the store in `foo()` suddenly results in a very large function, instead of just:
    vpmovmskb eaxxmm0
    mov word ptr [rdi + 14], ax
    ret

Is there something wrong with my IR code, or is the optimizer somehow confused? Can I rewrite the code such that the optimizer does understand?


Thanks a lot for the help.
Cheers,
  Johan

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] Vectorizer has trouble with vpmovmskb and store

Alberto Barbaro via llvm-dev
Ok I've made another fix in r348104

~Craig


On Sat, Dec 1, 2018 at 11:28 AM Craig Topper <[hidden email]> wrote:
I was afraid of that. I thought I had checked whether InstCombine would remove the bitcast here, but I guess I didn't or didn't do it right. I'll see what I can do to fix this.

~Craig


On Sat, Dec 1, 2018 at 4:39 AM Johan Engelen <[hidden email]> wrote:
Hello Craig,
  Thank you for the quick response and fix.
However, the improvement turns out to be quite fragile. If I run `opt` on the original testcase, and run the output through `llc` then the previous very long assembly output results.  (things work for a bitcast from <16 x i1> to i16, but not for a <16 x i1>* store) 

regards,
  Johan



On Tue, Nov 27, 2018 at 4:00 AM Craig Topper <[hidden email]> wrote:
We should handle this a lot better after r34763

~Craig


On Mon, Nov 26, 2018 at 3:13 PM Craig Topper <[hidden email]> wrote:
Here's a quick patch that fixes this. I don't know to avoid it in IR. I haven't checked any other tests, but it does fix your case. I'll try to put up a real phabricator tonight or tomorrow.

diff --git a/lib/Target/X86/X86ISelLowering.cpp b/lib/Target/X86/X86ISelLowering.cpp
index e31f2a6..d79c0be 100644
--- a/lib/Target/X86/X86ISelLowering.cpp
+++ b/lib/Target/X86/X86ISelLowering.cpp
@@ -4837,6 +4837,11 @@ bool X86TargetLowering::isCheapToSpeculateCtlz() const {

 bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
                                                 EVT BitcastVT) const {
+  if (!LoadVT.isVector() && BitcastVT.isVector() &&
+      BitcastVT.getVectorElementType() == MVT::i1 &&
+      !Subtarget.hasAVX512())
+    return false;
+
   if (!Subtarget.hasDQI() && BitcastVT == MVT::v8i1)
     return false;


~Craig


On Mon, Nov 26, 2018 at 2:51 PM Johan Engelen via llvm-dev <[hidden email]> wrote:
Hi all,
  I've run into a case where the optimizer seems to be having trouble doing the "obvious" thing.

Consider this code:
```
define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %a1 = icmp slt <16 x i8> %a0, zeroinitializer
    %a2 = bitcast <16 x i1> %a1 to i16
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    ;store i16 %a2, i16* %astore
    ret i16 %a2
}
```
The optimizer recognizes this and llc nicely outputs a vpmovmskb instruction:
```
foo: # @foo
    vpmovmskb eax, xmm0
    ret
```

Writing to the output vector also works well:
```
define void @writing(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
    %astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
    store i16 123, i16* %astore
    ret void
}
```
outputs:
```
writing: # @writing
    mov word ptr [rdi + 14], 123
    ret
```

Now, combining these two by uncommenting the store in `foo()` suddenly results in a very large function, instead of just:
    vpmovmskb eaxxmm0
    mov word ptr [rdi + 14], ax
    ret

Is there something wrong with my IR code, or is the optimizer somehow confused? Can I rewrite the code such that the optimizer does understand?


Thanks a lot for the help.
Cheers,
  Johan

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev