[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Alberto Barbaro via llvm-dev
Hi,

I have noticed some loops that build a wider element by loading small
elements, zero-extending them, shifting them (with different amounts) to
then 'or' them all together. They are either equivalent of a wider load,
or to that of a byte-swapped one.

DAGCombiner::MatchLoadCombine() will combine this to a single wide load,
but only in the scalar cases of i16, i32 and i64. The result is that
these loops (I have seen a dozen or so on SPEC) get vectorized with a
lot of ugly code.

I have begun to experiment with handling the vectorized loop also, and
would like to know if people think this would be a good idea? Also, am I
right to assume that it probably should be run before type legalization?

/Jonas


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Alberto Barbaro via llvm-dev
On 12/3/2018 8:20 AM, Jonas Paulsson wrote:

> Hi,
>
> I have noticed some loops that build a wider element by loading small
> elements, zero-extending them, shifting them (with different amounts)
> to then 'or' them all together. They are either equivalent of a wider
> load, or to that of a byte-swapped one.
>
> DAGCombiner::MatchLoadCombine() will combine this to a single wide
> load, but only in the scalar cases of i16, i32 and i64. The result is
> that these loops (I have seen a dozen or so on SPEC) get vectorized
> with a lot of ugly code.
>
> I have begun to experiment with handling the vectorized loop also, and
> would like to know if people think this would be a good idea? Also, am
> I right to assume that it probably should be run before type
> legalization?
>
You mean, trying to merge some combination of vector loads and shuffles
into a single vector load in DAGCombine?  That seems sort of late, given
the cost modeling involved in vectorization.

See also
http://lists.llvm.org/pipermail/llvm-dev/2018-February/121000.html ?

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Alberto Barbaro via llvm-dev
Hi Eli,

On 2018-12-04 00:37, Friedman, Eli wrote:

> On 12/3/2018 8:20 AM, Jonas Paulsson wrote:
>> Hi,
>>
>> I have noticed some loops that build a wider element by loading small
>> elements, zero-extending them, shifting them (with different amounts)
>> to then 'or' them all together. They are either equivalent of a wider
>> load, or to that of a byte-swapped one.
>>
>> DAGCombiner::MatchLoadCombine() will combine this to a single wide
>> load, but only in the scalar cases of i16, i32 and i64. The result is
>> that these loops (I have seen a dozen or so on SPEC) get vectorized
>> with a lot of ugly code.
>>
>> I have begun to experiment with handling the vectorized loop also,
>> and would like to know if people think this would be a good idea?
>> Also, am I right to assume that it probably should be run before type
>> legalization?
>>
> You mean, trying to merge some combination of vector loads and
> shuffles into a single vector load in DAGCombine?  That seems sort of
> late, given the cost modeling involved in vectorization.
>
What happens specifically is that a scalar loop that is written (most
likely at source level) like

///  i8 *a = ...
///  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)

becomes during DAGCombine (this is from the comment in DAGCombiner.cpp:5750)

/// =>
///  i32 val = *((i32)a)

I also wondered why this is done so late but found in the commit message
(b52af07 / r293036) that there had been a discussion where this had
intentionally been addressed late: "...We came to the conclusion that we
want to do this transformation late in the pipeline because in presence
of atomic loads load widening is irreversible transformation and it
might hinder other optimizations..."

The loop vectorizer looks at such a loop and thinks the scalar loop
costs 13 and VF=4 costs 5. The cost for the vectorized loop is about
right, but the scalar loop becomes much better with the help of
MatchLoadCombine(): just 2 instructions (load + store). The vector loop
could also have been just 2 instructions: Vector Load + Vector Store,
but instead it does Vector Load + vector shuffles, vector zero
extensions, vector element shifts, vector ors and the Vector Store :-(

It would be very far-fetched to correct the cost for the scalar loop in
the LoopVectorizer. It seems like a better idea to implement the
handling for the vectorized loop also in DAGCombiner. As an alternative,
we could just ignore this I guess...

My patch does a recursive search starting at an ISD::Or node and then
tries to prove that the sub-DAG with all the ors, shifts and extends is
equivalent to the original Vector Load, possibly in reversed order. I
think that's a somewhat simpler problem than what the scalar loop
handling of MatchLoadCombine() is doing.


> See also
> http://lists.llvm.org/pipermail/llvm-dev/2018-February/121000.html ?
>
+@Florian:

I think SLP awareness in LoopVectorizer would be great and wonder now
what happened with this? (It seems however SLP is not able to handle
this type of loop).

/Jonas


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Alberto Barbaro via llvm-dev
On 12/4/2018 1:01 AM, Jonas Paulsson wrote:

> Hi Eli,
>
> On 2018-12-04 00:37, Friedman, Eli wrote:
>> On 12/3/2018 8:20 AM, Jonas Paulsson wrote:
>>> Hi,
>>>
>>> I have noticed some loops that build a wider element by loading
>>> small elements, zero-extending them, shifting them (with different
>>> amounts) to then 'or' them all together. They are either equivalent
>>> of a wider load, or to that of a byte-swapped one.
>>>
>>> DAGCombiner::MatchLoadCombine() will combine this to a single wide
>>> load, but only in the scalar cases of i16, i32 and i64. The result
>>> is that these loops (I have seen a dozen or so on SPEC) get
>>> vectorized with a lot of ugly code.
>>>
>>> I have begun to experiment with handling the vectorized loop also,
>>> and would like to know if people think this would be a good idea?
>>> Also, am I right to assume that it probably should be run before
>>> type legalization?
>>>
>> You mean, trying to merge some combination of vector loads and
>> shuffles into a single vector load in DAGCombine?  That seems sort of
>> late, given the cost modeling involved in vectorization.
>>
> What happens specifically is that a scalar loop that is written (most
> likely at source level) like
>
> ///  i8 *a = ...
> ///  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
>
> becomes during DAGCombine (this is from the comment in
> DAGCombiner.cpp:5750)
>
> /// =>
> ///  i32 val = *((i32)a)
>
> I also wondered why this is done so late but found in the commit
> message (b52af07 / r293036) that there had been a discussion where
> this had intentionally been addressed late: "...We came to the
> conclusion that we want to do this transformation late in the pipeline
> because in presence of atomic loads load widening is irreversible
> transformation and it might hinder other optimizations..."

Yes.  (I'm not completely convinced that we need to delay the transform
all the way to DAGCombine, though...)

> The loop vectorizer looks at such a loop and thinks the scalar loop
> costs 13 and VF=4 costs 5. The cost for the vectorized loop is about
> right, but the scalar loop becomes much better with the help of
> MatchLoadCombine(): just 2 instructions (load + store). The vector
> loop could also have been just 2 instructions: Vector Load + Vector
> Store, but instead it does Vector Load + vector shuffles, vector zero
> extensions, vector element shifts, vector ors and the Vector Store :-(
>
> It would be very far-fetched to correct the cost for the scalar loop
> in the LoopVectorizer. It seems like a better idea to implement the
> handling for the vectorized loop also in DAGCombiner. As an
> alternative, we could just ignore this I guess...

The current vectorizer doesn't really have infrastructure for this, at
least.

>
> My patch does a recursive search starting at an ISD::Or node and then
> tries to prove that the sub-DAG with all the ors, shifts and extends
> is equivalent to the original Vector Load, possibly in reversed order.
> I think that's a somewhat simpler problem than what the scalar loop
> handling of MatchLoadCombine() is doing.

Oh, it's simpler because it doesn't actually have to deal with memory? 
Makes sense.  Still seems kind of awkward to generate the terrible
vector loop and recover it later, though; it's fragile based on the
shuffle costs.

If you are going to do this, probably simpler to do in instcombine?

>
>
>> See also
>> http://lists.llvm.org/pipermail/llvm-dev/2018-February/121000.html ?
>>
> +@Florian:
>
> I think SLP awareness in LoopVectorizer would be great and wonder now
> what happened with this? (It seems however SLP is not able to handle
> this type of loop).

The existing SLP pass actually could handle this sort of sequence, but
we currently suppress it if the result would be too narrow.  See also
https://reviews.llvm.org/D48725 .

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Alberto Barbaro via llvm-dev
Hi Eli,


>>>>
>>>> I have noticed some loops that build a wider element by loading
>>>> small elements, zero-extending them, shifting them (with different
>>>> amounts) to then 'or' them all together. They are either equivalent
>>>> of a wider load, or to that of a byte-swapped one.
>>>>
>>>> DAGCombiner::MatchLoadCombine() will combine this to a single wide
>>>> load, but only in the scalar cases of i16, i32 and i64. The result
>>>> is that these loops (I have seen a dozen or so on SPEC) get
>>>> vectorized with a lot of ugly code.
>>>>
>>>> I have begun to experiment with handling the vectorized loop also,
>>>> and would like to know if people think this would be a good idea?
>>>> Also, am I right to assume that it probably should be run before
>>>> type legalization?
>>>>
>>> You mean, trying to merge some combination of vector loads and
>>> shuffles into a single vector load in DAGCombine?  That seems sort
>>> of late, given the cost modeling involved in vectorization.
>>>
>> What happens specifically is that a scalar loop that is written (most
>> likely at source level) like
>>
>> ///  i8 *a = ...
>> ///  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
>>
>> becomes during DAGCombine (this is from the comment in
>> DAGCombiner.cpp:5750)
>>
>> /// =>
>> ///  i32 val = *((i32)a)
>>
>> I also wondered why this is done so late but found in the commit
>> message (b52af07 / r293036) that there had been a discussion where
>> this had intentionally been addressed late: "...We came to the
>> conclusion that we want to do this transformation late in the
>> pipeline because in presence of atomic loads load widening is
>> irreversible transformation and it might hinder other optimizations..."
>
> Yes.  (I'm not completely convinced that we need to delay the
> transform all the way to DAGCombine, though...)
>
>> The loop vectorizer looks at such a loop and thinks the scalar loop
>> costs 13 and VF=4 costs 5. The cost for the vectorized loop is about
>> right, but the scalar loop becomes much better with the help of
>> MatchLoadCombine(): just 2 instructions (load + store). The vector
>> loop could also have been just 2 instructions: Vector Load + Vector
>> Store, but instead it does Vector Load + vector shuffles, vector zero
>> extensions, vector element shifts, vector ors and the Vector Store :-(
>>
>> It would be very far-fetched to correct the cost for the scalar loop
>> in the LoopVectorizer. It seems like a better idea to implement the
>> handling for the vectorized loop also in DAGCombiner. As an
>> alternative, we could just ignore this I guess...
>
> The current vectorizer doesn't really have infrastructure for this, at
> least.
>
>>
>> My patch does a recursive search starting at an ISD::Or node and then
>> tries to prove that the sub-DAG with all the ors, shifts and extends
>> is equivalent to the original Vector Load, possibly in reversed
>> order. I think that's a somewhat simpler problem than what the scalar
>> loop handling of MatchLoadCombine() is doing.
>
> Oh, it's simpler because it doesn't actually have to deal with
> memory?  Makes sense.  Still seems kind of awkward to generate the
> terrible vector loop and recover it later, though; it's fragile based
> on the shuffle costs.
>
> If you are going to do this, probably simpler to do in instcombine?
>
I have tried previously to extend instcombine with optimizations of
vector shuffles. I was told then that instcombine should be very careful
with optimizing permutations since the backends may then produce worse
code. I suppose that in this case, if we get just the vector load
instruction instead of the shifts and ors etc, that might be acceptable?

The reason I chose the DAGCombine was due to the explanation for the
scalar case to deliberately make this very late. I thought I might as
well put it in the same spot...

I am not sure I want to do this seriously without at least one other
person thinking it would be valuable and willing to review, and also not
before I know what pass should do it. I have not seen any real need for
this in benchmarks, it's just an observation...

> The existing SLP pass actually could handle this sort of sequence, but
> we currently suppress it if the result would be too narrow.  See also
> https://reviews.llvm.org/D48725 .

Could SLP really handle this case so as to get rid of the shifts and ors?

/Jonas


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] MatchLoadCombine(): handling for vectorized loop.

Alberto Barbaro via llvm-dev
On 12/10/2018 4:51 AM, Jonas Paulsson wrote:

> Hi Eli,
>
>
>>>>>
>>>>> I have noticed some loops that build a wider element by loading
>>>>> small elements, zero-extending them, shifting them (with different
>>>>> amounts) to then 'or' them all together. They are either
>>>>> equivalent of a wider load, or to that of a byte-swapped one.
>>>>>
>>>>> DAGCombiner::MatchLoadCombine() will combine this to a single wide
>>>>> load, but only in the scalar cases of i16, i32 and i64. The result
>>>>> is that these loops (I have seen a dozen or so on SPEC) get
>>>>> vectorized with a lot of ugly code.
>>>>>
>>>>> I have begun to experiment with handling the vectorized loop also,
>>>>> and would like to know if people think this would be a good idea?
>>>>> Also, am I right to assume that it probably should be run before
>>>>> type legalization?
>>>>>
>>>> You mean, trying to merge some combination of vector loads and
>>>> shuffles into a single vector load in DAGCombine?  That seems sort
>>>> of late, given the cost modeling involved in vectorization.
>>>>
>>> What happens specifically is that a scalar loop that is written
>>> (most likely at source level) like
>>>
>>> ///  i8 *a = ...
>>> ///  i32 val = a[0] | (a[1] << 8) | (a[2] << 16) | (a[3] << 24)
>>>
>>> becomes during DAGCombine (this is from the comment in
>>> DAGCombiner.cpp:5750)
>>>
>>> /// =>
>>> ///  i32 val = *((i32)a)
>>>
>>> I also wondered why this is done so late but found in the commit
>>> message (b52af07 / r293036) that there had been a discussion where
>>> this had intentionally been addressed late: "...We came to the
>>> conclusion that we want to do this transformation late in the
>>> pipeline because in presence of atomic loads load widening is
>>> irreversible transformation and it might hinder other optimizations..."
>>
>> Yes.  (I'm not completely convinced that we need to delay the
>> transform all the way to DAGCombine, though...)
>>
>>> The loop vectorizer looks at such a loop and thinks the scalar loop
>>> costs 13 and VF=4 costs 5. The cost for the vectorized loop is about
>>> right, but the scalar loop becomes much better with the help of
>>> MatchLoadCombine(): just 2 instructions (load + store). The vector
>>> loop could also have been just 2 instructions: Vector Load + Vector
>>> Store, but instead it does Vector Load + vector shuffles, vector
>>> zero extensions, vector element shifts, vector ors and the Vector
>>> Store :-(
>>>
>>> It would be very far-fetched to correct the cost for the scalar loop
>>> in the LoopVectorizer. It seems like a better idea to implement the
>>> handling for the vectorized loop also in DAGCombiner. As an
>>> alternative, we could just ignore this I guess...
>>
>> The current vectorizer doesn't really have infrastructure for this,
>> at least.
>>
>>>
>>> My patch does a recursive search starting at an ISD::Or node and
>>> then tries to prove that the sub-DAG with all the ors, shifts and
>>> extends is equivalent to the original Vector Load, possibly in
>>> reversed order. I think that's a somewhat simpler problem than what
>>> the scalar loop handling of MatchLoadCombine() is doing.
>>
>> Oh, it's simpler because it doesn't actually have to deal with
>> memory?  Makes sense.  Still seems kind of awkward to generate the
>> terrible vector loop and recover it later, though; it's fragile based
>> on the shuffle costs.
>>
>> If you are going to do this, probably simpler to do in instcombine?
>>
> I have tried previously to extend instcombine with optimizations of
> vector shuffles. I was told then that instcombine should be very
> careful with optimizing permutations since the backends may then
> produce worse code. I suppose that in this case, if we get just the
> vector load instruction instead of the shifts and ors etc, that might
> be acceptable?

Yes, the primary issue with shuffle combining in instcombine is avoiding
creating new shuffles which backends don't know how to handle (it's a
hard problem in general for longer vectors). Removing shuffles is fine. 
And I think the vector case is less likely to run into the same issues
as scalars because the loaded values are interleaved.

There's also been more movement towards distinguishing between early and
late combines... so passes can run on IR, but late in the pass pipeline
so they don't interfere with GVN etc.

>
> The reason I chose the DAGCombine was due to the explanation for the
> scalar case to deliberately make this very late. I thought I might as
> well put it in the same spot...
>
> I am not sure I want to do this seriously without at least one other
> person thinking it would be valuable and willing to review, and also
> not before I know what pass should do it. I have not seen any real
> need for this in benchmarks, it's just an observation...

I've seen code like this before, but not in a context where vectorizing
it would be helpful.

>> The existing SLP pass actually could handle this sort of sequence,
>> but we currently suppress it if the result would be too narrow.  See
>> also https://reviews.llvm.org/D48725 .
>
> Could SLP really handle this case so as to get rid of the shifts and ors?

Oh, no, probably not at the moment; I didn't think that through. It
could theoretically be extended, but currently it doesn't try to split
an integer into multiple lanes.

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev