[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev

Hi all,

 

While working on a patch to improve codegen for fadd reductions on AArch64, I stumbled upon an outstanding issue with the experimental.vector.reduce.fadd/fmul intrinsics where the accumulator argument is ignored when fast-math is set on the intrinsic call. This behaviour is described in the LangRef (https://www.llvm.org/docs/LangRef.html#id1905) and is mentioned in https://bugs.llvm.org/show_bug.cgi?id=36734 and further discussed in D45336 and D59356.

 

This means that for example:

    %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %v)

 

does not result in %res being 'undef', but rather a reduction of <4 x float> %v. The definition of these intrinsics are different from their corresponding SelectionDAG nodes which explicitly split out a non-strict VECREDUCE_FADD that explicitly does not take a start-value operand, and a VECREDUCE_STRICT_FADD which does.

 

With the vector reduction intrinsics still experimental, I would like to propose to change this behaviour. I would also like to take this opportunity to ask what other changes are required, so that we can make an effort towards dropping the 'experimental' prefix at some point.

 

 

Proposed change:

----------------------------

In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.

 

[Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)

 

  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

 

This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.

 

[Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).

 

  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)

 

This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.

 

Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.

 

 

Further efforts:

----------------------------

Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:

  • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
  • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
  • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
     
    i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)

since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.

  • Dropping the 'experimental' from the reduction intrinsics and auto-upgrading to the new intrinsics once there is agreement on stability of the definitions.

 

Please let me know if anything obvious is missing here.

 

Thanks,

 

Sander


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
Sander De Smalen via llvm-dev <[hidden email]> writes:

> This means that for example:
>
>     %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %v)
>
>  
>
> does not result in %res being 'undef', but rather a reduction of <4 x
> float> %v. The definition of these intrinsics are different from their
> corresponding SelectionDAG nodes which explicitly split out a
> non-strict VECREDUCE_FADD that explicitly does not take a start-value
> operand, and a VECREDUCE_STRICT_FADD which does.

This seems very strange to me.  What was the rationale for ignoring the
first argument?  What was the rationale for the first argument existing
at all?  Because that's how SVE reductions work?  The asymmetry with
llvm.experimental.vector.reduce.add is odd.

>
> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
>
>  
>
>   declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>
>   declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>
>  
>
> This will mean that the behaviour is explicit from the intrinsic and
> the use of 'fast' or ‘reassoc’ on the call has no effect on how that
> intrinsic is lowered. The ordered reduction intrinsic will take a
> scalar start-value operand, where the unordered reduction intrinsic
> will only take a vector operand.

This seems by far the better solution.  I'd much rather have things be
explicit in the IR than implicit via flags that might accidentally get
dropped.

Again, the asymmetry between these (one with a start value and one
without) seems strange and arbitrary.  Why do we need start values at
all?  Is it really difficult for isel to match s + vector.reduce(v)?

                       -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
Hi David,

The reason for the asymmetry and requiring an explicit start-value operand is to be able to describe strict reductions that need to preserve the same associativity of a scalarized reduction.

For example:
  %res = call float @llvm.experimental.vector.reduce.fadd(float %start, <4 x float> <float %elt0, float %elt1, float %elt2, float %elt3>)

describes the following reduction:
  %res = (((%start + %elt0) + %elt1) + %elt2) + %elt3

Where:
  %tmp = call float @llvm.experimental.vector.reduce.fadd(<4 x float> <float %elt0, float %elt1, float %elt2, float %elt3>)
  %res = add float %start, %tmp

Describes:
  %res = %start + (((%elt0 + %elt1) + %elt2) + %elt3)

Which is not the same, hence why the start operand is needed in the intrinsic itself. For fast-math (specifically the 'reassoc' property) the compiler is free to reassociate the expression, so the start/accumulator operand isn't needed.

Cheers,

Sander


On 04/04/2019, 16:44, "David Greene" <[hidden email]> wrote:

    Sander De Smalen via llvm-dev <[hidden email]> writes:
   
    > This means that for example:
    >
    >     %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %v)
    >
    >  
    >
    > does not result in %res being 'undef', but rather a reduction of <4 x
    > float> %v. The definition of these intrinsics are different from their
    > corresponding SelectionDAG nodes which explicitly split out a
    > non-strict VECREDUCE_FADD that explicitly does not take a start-value
    > operand, and a VECREDUCE_STRICT_FADD which does.
   
    This seems very strange to me.  What was the rationale for ignoring the
    first argument?  What was the rationale for the first argument existing
    at all?  Because that's how SVE reductions work?  The asymmetry with
    llvm.experimental.vector.reduce.add is odd.
    >
    > [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
    >
    >  
    >
    >   declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
    >
    >   declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
    >
    >  
    >
    > This will mean that the behaviour is explicit from the intrinsic and
    > the use of 'fast' or ‘reassoc’ on the call has no effect on how that
    > intrinsic is lowered. The ordered reduction intrinsic will take a
    > scalar start-value operand, where the unordered reduction intrinsic
    > will only take a vector operand.
   
    This seems by far the better solution.  I'd much rather have things be
    explicit in the IR than implicit via flags that might accidentally get
    dropped.
   
    Again, the asymmetry between these (one with a start value and one
    without) seems strange and arbitrary.  Why do we need start values at
    all?  Is it really difficult for isel to match s + vector.reduce(v)?
   
                           -David
   

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
Sander De Smalen <[hidden email]> writes:

> Hi David,
>
> The reason for the asymmetry and requiring an explicit start-value
> operand is to be able to describe strict reductions that need to
> preserve the same associativity of a scalarized reduction.
>
> For example:
>   %res = call float @llvm.experimental.vector.reduce.fadd(float %start, <4 x float> <float %elt0, float %elt1, float %elt2, float %elt3>)
>
> describes the following reduction:
>   %res = (((%start + %elt0) + %elt1) + %elt2) + %elt3
>
> Where:
>   %tmp = call float @llvm.experimental.vector.reduce.fadd(<4 x float> <float %elt0, float %elt1, float %elt2, float %elt3>)
>   %res = add float %start, %tmp
>
> Describes:
>   %res = %start + (((%elt0 + %elt1) + %elt2) + %elt3)
>
> Which is not the same, hence why the start operand is needed in the
> intrinsic itself. For fast-math (specifically the 'reassoc' property)
> the compiler is free to reassociate the expression, so the
> start/accumulator operand isn't needed.

Ok, I see.  I was assuming the scalar would just be folded into the
vector and then the vector would be reduced but that could be awkward,
necessitating use of insertelement/extractelement.

Thanks for explaining.

                     -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
In reply to this post by Chris Lattner via llvm-dev
Hi Sander,

thank you for pushing these intrinsics forward. I am not familiar
enough with the details to say whether anything else might be missing
for them to become less experimental, but the changes you propose all
look like clear-cut improvements to me.

On Thu, 4 Apr 2019 at 15:11, Sander De Smalen via llvm-dev
<[hidden email]> wrote:

>
> Hi all,
>
>
>
> While working on a patch to improve codegen for fadd reductions on AArch64, I stumbled upon an outstanding issue with the experimental.vector.reduce.fadd/fmul intrinsics where the accumulator argument is ignored when fast-math is set on the intrinsic call. This behaviour is described in the LangRef (https://www.llvm.org/docs/LangRef.html#id1905) and is mentioned in https://bugs.llvm.org/show_bug.cgi?id=36734 and further discussed in D45336 and D59356.
>
>
>
> This means that for example:
>
>     %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %v)
>
>
>
> does not result in %res being 'undef', but rather a reduction of <4 x float> %v. The definition of these intrinsics are different from their corresponding SelectionDAG nodes which explicitly split out a non-strict VECREDUCE_FADD that explicitly does not take a start-value operand, and a VECREDUCE_STRICT_FADD which does.
>
>
>
> With the vector reduction intrinsics still experimental, I would like to propose to change this behaviour. I would also like to take this opportunity to ask what other changes are required, so that we can make an effort towards dropping the 'experimental' prefix at some point.

Big +1 for removing this pitfall. It's a very counterintuitive design
IMO and is partially responsible for some unnecessarily complex code
in the Rust compiler, so I'd be happy to see it go.

>
>
>
> Proposed change:
>
> ----------------------------
>
> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
>
>
>
> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>
>
>
>   declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>
>
>
> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.

I like that this option will allow some code (e.g., frontends,
constant folding, certain instcombines) to treat unordered reductions
as essentially the same as ordered ones without having to go out its
way to cover two different intrinsics with slightly different argument
lists.

Cheers,
Robin
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
In reply to this post by Chris Lattner via llvm-dev
On 04/04/2019 14:11, Sander De Smalen wrote:
Proposed change:

----------------------------

In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.

 

[Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)

 

  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

 

This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.

 

[Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).

 

  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)

 

This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.

 

Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.

Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.

Further efforts:

----------------------------

Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:

  • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
  • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
  • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
     
    i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)

since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.

Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.

Simon.


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev


On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
On 04/04/2019 14:11, Sander De Smalen wrote:
Proposed change:

----------------------------

In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.

 

[Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)

 

  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

 

This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.

 

[Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).

 

  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)

 

This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.

 

Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.

Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.

Further efforts:

----------------------------

Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:

  • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
  • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
  • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
     
    i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)

since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.

Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.

Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
Hi Simon,

Thanks for your feedback! See my comments inline.

On 5 Apr 2019, at 09:47, Simon Pilgrim via llvm-dev <[hidden email]> wrote:


On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
On 04/04/2019 14:11, Sander De Smalen wrote:
Proposed change:
----------------------------
In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
 
[Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
 
  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
 
This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
 
[Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
 
  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
 
This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
 
Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.

Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely. 

ARM SVE has the FADDA instruction for strict fadd reductions (see for example test/MC/AArch64/SVE/fadda.s). This instruction takes an explicit start-value operand. The reduction intrinsics were originally introduced for SVE where we modelled the fadd/fmul reductions with this instruction in mind.

Just to clarify, is this what you are suggesting regarding extract/fadd/insert?

  %first = extractelement <4 x float> %input, i32 0
  %first.new = fadd float %start, %first
  %input.new = insertelement <4 x float> %input, float %first.new, i32 0
  %red = call float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(<4 x float> %input.new)

My only reservation here is that LLVM might obfuscate this code so that CodeGen couldn't easily match the extract/fadd/insert pattern, thus adding the extra fadd instruction. This could for example happen if the loop would be rotated/pipelined to load the next iteration and doing the first 'fadd' before the next iteration. 
In such case having the extra operand would be more descriptive.

Further efforts:
----------------------------
Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
  • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
  • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
  • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
      
    i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.

Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.

Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
The current intrinsics explicitly specify that:
   "The return type matches the element-type of the vector input"

This was done to avoid having explicit signed/unsigned add reductions, reasoning that zero- and sign-extension can be done on the input values to the reduction. We had a bit of debate on this internally, and it would come down to similar reasons as for the extra 'start value' operand to fadd reductions. I think we'd welcome the signed/unsigned variants as they would be more descriptive and would safeguard the code from transformations that make it difficult to fold the sign/zero extend into the operation during CodeGen. The downside however is that for signed/unsigned add reductions it would mean that both operations are the same when the result type equals the element type.

Saturating vector reductions sound sensible, but are there any targets that implement these at the moment?

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
On 05/04/2019 16:26, Sander De Smalen wrote:
Hi Simon,

Thanks for your feedback! See my comments inline.

On 5 Apr 2019, at 09:47, Simon Pilgrim via llvm-dev <[hidden email]> wrote:


On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
On 04/04/2019 14:11, Sander De Smalen wrote:
Proposed change:
----------------------------
In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
 
[Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
 
  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
 
This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
 
[Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
 
  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
 
This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
 
Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.

Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely. 

ARM SVE has the FADDA instruction for strict fadd reductions (see for example test/MC/AArch64/SVE/fadda.s). This instruction takes an explicit start-value operand. The reduction intrinsics were originally introduced for SVE where we modelled the fadd/fmul reductions with this instruction in mind.

Just to clarify, is this what you are suggesting regarding extract/fadd/insert?

  %first = extractelement <4 x float> %input, i32 0
  %first.new = fadd float %start, %first
  %input.new = insertelement <4 x float> %input, float %first.new, i32 0
  %red = call float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(<4 x float> %input.new)

My only reservation here is that LLVM might obfuscate this code so that CodeGen couldn't easily match the extract/fadd/insert pattern, thus adding the extra fadd instruction. This could for example happen if the loop would be rotated/pipelined to load the next iteration and doing the first 'fadd' before the next iteration. 
In such case having the extra operand would be more descriptive.

Yes that was the IR I had in mind, but you're right in that its probably useful for chained fadd reductions as well as the SVE specific instruction. If we're getting rid of the fast math 'undef' special case and we expect a 'identity' start value (fadd = 0.0f, fmul = 1.0f) that we can optimize away then I've no objections.

Further efforts:
----------------------------
Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
  • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
  • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
  • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
      
    i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.

Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.

Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
The current intrinsics explicitly specify that:
   "The return type matches the element-type of the vector input"

This was done to avoid having explicit signed/unsigned add reductions, reasoning that zero- and sign-extension can be done on the input values to the reduction. We had a bit of debate on this internally, and it would come down to similar reasons as for the extra 'start value' operand to fadd reductions. I think we'd welcome the signed/unsigned variants as they would be more descriptive and would safeguard the code from transformations that make it difficult to fold the sign/zero extend into the operation during CodeGen. The downside however is that for signed/unsigned add reductions it would mean that both operations are the same when the result type equals the element type.

An alternative would be that we limit the existing add/mul cases to the same result type (along with and/or/xor/smax/smin/umax/umin) and we add sadd/uadd/smul/umul extending reductions as well.

Saturating vector reductions sound sensible, but are there any targets that implement these at the moment?
X86/SSE has the v8i16 HADDS/HSUBS horizontal signed saturation instructions, and X86/XOP has extend+horizontal-add/sub instructions (https://en.wikipedia.org/wiki/XOP_instruction_set).

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
In reply to this post by Chris Lattner via llvm-dev
Hi,

On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
On 04/04/2019 14:11, Sander De Smalen wrote:
Proposed change:

----------------------------

In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.

Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.

Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.

 

[Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)

 

  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

 

This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.

 

[Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).

 

  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)

 

This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.

 

Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.

Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.

NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html

Further efforts:

----------------------------

Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:

  • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
  • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
  • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
     
    i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)

since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.

Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.

Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : [hidden email]
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
In reply to this post by Chris Lattner via llvm-dev

> On 7 Apr 2019, at 14:56, Simon Pilgrim via llvm-dev <[hidden email]> wrote:
>
> On 05/04/2019 16:26, Sander De Smalen wrote:
>> Hi Simon,
>>
>> Thanks for your feedback! See my comments inline.
>>
>>> On 5 Apr 2019, at 09:47, Simon Pilgrim via llvm-dev <[hidden email]> wrote:
>>>
>>>
>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>> Proposed change:
>>>>> ----------------------------
>>>>> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
>>>>>  
>>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>>>>>  
>>>>>   declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>  
>>>>> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
>>>>>  
>>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
>>>>>  
>>>>>   declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>   declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>>  
>>>>> This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
>>>>>  
>>>>> Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
>>>> Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.
>>>>
>> ARM SVE has the FADDA instruction for strict fadd reductions (see for example test/MC/AArch64/SVE/fadda.s). This instruction takes an explicit start-value operand. The reduction intrinsics were originally introduced for SVE where we modelled the fadd/fmul reductions with this instruction in mind.
>>
>> Just to clarify, is this what you are suggesting regarding extract/fadd/insert?
>>
>>   %first = extractelement <4 x float> %input, i32 0
>>   %first.new = fadd float %start, %first
>>   %input.new = insertelement <4 x float> %input, float %first.new, i32 0
>>   %red = call float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(<4 x float> %input.new)
>>
>> My only reservation here is that LLVM might obfuscate this code so that CodeGen couldn't easily match the extract/fadd/insert pattern, thus adding the extra fadd instruction. This could for example happen if the loop would be rotated/pipelined to load the next iteration and doing the first 'fadd' before the next iteration. In such case having the extra operand would be more descriptive.
> Yes that was the IR I had in mind, but you're right in that its probably useful for chained fadd reductions as well as the SVE specific instruction. If we're getting rid of the fast math 'undef' special case and we expect a 'identity' start value (fadd = 0.0f, fmul = 1.0f) that we can optimize away then I've no objections.
Correct. A common use-case for these reduction intrinsics are to work inside vectorized loops, where chaining happens through the reduction value's PHI node (i.e. the scalar reduction value from one iteration will be the input to the next vector iteration). Indeed if this intrinsic is scalarized the identitiy value can be optimized away.

>
>>>>> Further efforts:
>>>>> ----------------------------
>>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>>
>>>>> • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
>>>>> • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>
>>>>> • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
>>>>>   i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>> since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.
>>>>>
>>>> Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.
>>>>
>>> Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
>> The current intrinsics explicitly specify that:
>>    "The return type matches the element-type of the vector input"
>>
>> This was done to avoid having explicit signed/unsigned add reductions, reasoning that zero- and sign-extension can be done on the input values to the reduction. We had a bit of debate on this internally, and it would come down to similar reasons as for the extra 'start value' operand to fadd reductions. I think we'd welcome the signed/unsigned variants as they would be more descriptive and would safeguard the code from transformations that make it difficult to fold the sign/zero extend into the operation during CodeGen. The downside however is that for signed/unsigned add reductions it would mean that both operations are the same when the result type equals the element type.
> An alternative would be that we limit the existing add/mul cases to the same result type (along with and/or/xor/smax/smin/umax/umin) and we add sadd/uadd/smul/umul extending reductions as well.
If we add signed/unsigned variants as separate intrinsics alongside the existing 'add' and 'mul', then we have the benefit that LLVM can canonicalise to use the existing intrinsics for cases where the result- and element types are the same. The intrinsics would then be:

  i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %input)      ; reduce elements into i32 result
  i64 @llvm.experimental.vector.reduce.sadd.i64.v4i32(<4 x i32> %input) ; sign extend input elements and reduce into i64 result
  i64 @llvm.experimental.vector.reduce.uadd.i64.v4i32(<4 x i32> %input) ; zero extend input elements and reduce into i64 result

(note that for the first intrinsic, I've left out the extra '.i32' to emphasise the constraint that the element type must match the result type, although the future work to overload this operand is mentioned above)

>
>> Saturating vector reductions sound sensible, but are there any targets that implement these at the moment?
> X86/SSE has the v8i16 HADDS/HSUBS horizontal signed saturation instructions, and X86/XOP has extend+horizontal-add/sub instructions (https://en.wikipedia.org/wiki/XOP_instruction_set).
Since these intrinsics would be ordered, I guess they would also require a start value for the same reasons as the ordered fadd/fmul reductions. The signedness of the operation matters so there is little benefit for keeping a generic 'saturating.add', leaving us with:

 i64 @llvm.experimental.vector.reduce.saturating.sadd.i64.v4i32(i64 %start, <4 x i32> %input)
          ; sign extend input elements and ordered signed reduction into i64 result, saturating to min_val(i64) or max_val(i64)

 i64 @llvm.experimental.vector.reduce.saturating.uadd.i64.v4i32(i64 %start, <4 x i32> %input)
          ; zero extend input elements and ordered unsigned reduction into i64 result, saturating to max_val(i64)

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
In reply to this post by Chris Lattner via llvm-dev

> On 8 Apr 2019, at 11:37, Simon Moll <[hidden email]> wrote:
>
> Hi,
>
> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>> Proposed change:
>>>> ----------------------------
>>>> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
> Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.
Thanks for pointing out Simon. I think for now we should keep this proposal separate from LLVM-VP as they serve different purposes and have different scope. But yes we can easily rename the intrinsics again when the VP proposal lands.

>
> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>
>>>>  
>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>>>>  
>>>>   declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>  
>>>> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
>>>>  
>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
>>>>  
>>>>   declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>   declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>  
>>>> This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
>>>>  
>>>> Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
>>> Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.
>>>
> NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
Great, I think combined with the argument for chaining of ordered reductions (often inside vectorized loops) and two architectures (ARM SVE and SX-Aurora) taking a scalar start register, this is enough of an argument to keep the explicit operand for the ordered reductions.

>>>
>>>> Further efforts:
>>>> ----------------------------
>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>
>>>> • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
>>>> • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>
>>>> • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
>>>>   i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>> since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.
>>>>
>>> Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.
>>>
>> Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>>
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> --
>
> Simon Moll
> Researcher / PhD Student
>
> Compiler Design Lab (Prof. Hack)
> Saarland University, Computer Science
> Building E1.3, Room 4.31
>
> Tel. +49 (0)681 302-57521 :
> [hidden email]
>
> Fax. +49 (0)681 302-3065  :
> http://compilers.cs.uni-saarland.de/people/moll

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
I’m fine with the direction this is going, but let’s keep renaming to a minimum. They’ve been experimental long enough now that we should be able to now jump to a final form after all the feedback.

Amara

On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <[hidden email]> wrote:


On 8 Apr 2019, at 11:37, Simon Moll <[hidden email]> wrote:

Hi,

On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
On 04/04/2019 14:11, Sander De Smalen wrote:
Proposed change:
----------------------------
In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.
Thanks for pointing out Simon. I think for now we should keep this proposal separate from LLVM-VP as they serve different purposes and have different scope. But yes we can easily rename the intrinsics again when the VP proposal lands.


Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.


[Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)

 declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)

This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.

[Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).

 declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
 declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)

This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.

Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely. 

NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
Great, I think combined with the argument for chaining of ordered reductions (often inside vectorized loops) and two architectures (ARM SVE and SX-Aurora) taking a scalar start register, this is enough of an argument to keep the explicit operand for the ordered reductions.


Further efforts:
----------------------------
Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:

• Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
• Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).

• I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
 i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.

Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.

Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?


_______________________________________________
LLVM Developers mailing list

[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : 
[hidden email]

Fax. +49 (0)681 302-3065  : 
http://compilers.cs.uni-saarland.de/people/moll

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
Hello again,

I've been meaning to follow up on this thread for the last couple of weeks, my apologies for the delay.

To summarise the feedback on the proposal for vector.reduce.fadd/fmul:

There seems to be consensus to keep the explicit start value to better accommodate chained reductions (as opposed to generating IR that performs the reduction of the first element using extract/fadd/insert pattern). An important use-case for these reductions is to work inside vectorized loops, where chaining happens through the reduction value's PHI node (i.e. the scalar reduction value from one iteration will be the input to the next iteration). This intrinsic would also naturally match reduction instructions of ARM SVE and NEC SX-aurora.

For Option A (https://reviews.llvm.org/D60261), there is an argument that code creating or operating on these intrinsics can treat ordered and unordered reductions the same (in that they have the same arguments). Fast-math flags determine whether or not the intrinsic needs to be evaluated in strict order. Codegen for non-strict reductions should be able to fold away the identity-value.

For Option B (https://reviews.llvm.org/D60262), David made the argument that making the reduction-order explicit (as opposed to deducing this from fast-math flags) would ensure the ordering is always as expected, even when FMF on the call sites are dropped for some reason.



Is it correct that I sensed a slight preference for Option A? i.e. Renaming the intrinsics and keeping the same signature, but dropping the special-cased behaviour for the identity-value with non-strict reductions. For David's argument, I think that although the extra expressiveness would be nice to have, LLVM normally depends on the FMF being propagated correctly to produce faster code so this should also be sufficient for reductions.

If we go for Option A, I suggest we drop the 'experimental' prefix from experimental.vector.reduce.fadd/fmul to avoid having to add an awkward '.v2' suffix to the new intrinsic. When we implement all the suggestions from this proposal (possibly including the one mentioned below), I wouldn't really know what other features we could add other than predication (which would be covered by the LLVM-VP proposal and thus require another renaming), or possibly adding 'constrained' variants which I assume would have separate intrinsics. So we might as well drop the 'experimental' prefix.

Finally, do we want to remove the restriction that the result type must always match the vector-element type? A wider result type would then allow the reduction to be performed in the wider type.

Thanks,

Sander

> On 10 Apr 2019, at 18:56, Amara Emerson <[hidden email]> wrote:
>
> I’m fine with the direction this is going, but let’s keep renaming to a minimum. They’ve been experimental long enough now that we should be able to now jump to a final form after all the feedback.
>
> Amara
>
>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <[hidden email]> wrote:
>>
>>>
>>> On 8 Apr 2019, at 11:37, Simon Moll <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>> Proposed change:
>>>>>> ----------------------------
>>>>>> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
>>> Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.
>> Thanks for pointing out Simon. I think for now we should keep this proposal separate from LLVM-VP as they serve different purposes and have different scope. But yes we can easily rename the intrinsics again when the VP proposal lands.
>>
>>>
>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>>>
>>>>>>
>>>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>>>>>>
>>>>>>  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>
>>>>>> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
>>>>>>
>>>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
>>>>>>
>>>>>>  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>>>
>>>>>> This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
>>>>>>
>>>>>> Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>> Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.
>>>>>
>>> NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>> Great, I think combined with the argument for chaining of ordered reductions (often inside vectorized loops) and two architectures (ARM SVE and SX-Aurora) taking a scalar start register, this is enough of an argument to keep the explicit operand for the ordered reductions.
>>
>>>>>
>>>>>> Further efforts:
>>>>>> ----------------------------
>>>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>>>
>>>>>> • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
>>>>>> • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>
>>>>>> • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
>>>>>>  i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>> since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.
>>>>>>
>>>>> Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.
>>>>>
>>>> Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>>
>>>> [hidden email]
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> --
>>>
>>> Simon Moll
>>> Researcher / PhD Student
>>>
>>> Compiler Design Lab (Prof. Hack)
>>> Saarland University, Computer Science
>>> Building E1.3, Room 4.31
>>>
>>> Tel. +49 (0)681 302-57521 :
>>> [hidden email]
>>>
>>> Fax. +49 (0)681 302-3065  :
>>> http://compilers.cs.uni-saarland.de/people/moll
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
Hello.

Thanks for working on this.

Are we talking only about the floating point versions of these, or the integer ones as well?

For the integer ones, there are a number of new MVE instructions that reduce from, for example i16's into an i32. Or long versions that accumulate into an i64.

For example the VADDVA.U16 Rda, Qm instruction will accumulate into a 32bit register.
The VADDLVA.U16  RdaLo, RdaHi, Qm instruction will accumulate into a pair of 32bit registers (so a 64bit value).

Thanks,
Dave

P.S. There are two different people on this thread that are named "David Greene"/"David Green". Sorry in advanced for the confusion.



From: llvm-dev <[hidden email]> on behalf of Sander De Smalen via llvm-dev <[hidden email]>
Sent: 16 May 2019 13:53
To: [hidden email]
Cc: David Greene; nd
Subject: Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
 
Hello again,

I've been meaning to follow up on this thread for the last couple of weeks, my apologies for the delay.

To summarise the feedback on the proposal for vector.reduce.fadd/fmul:

There seems to be consensus to keep the explicit start value to better accommodate chained reductions (as opposed to generating IR that performs the reduction of the first element using extract/fadd/insert pattern). An important use-case for these reductions is to work inside vectorized loops, where chaining happens through the reduction value's PHI node (i.e. the scalar reduction value from one iteration will be the input to the next iteration). This intrinsic would also naturally match reduction instructions of ARM SVE and NEC SX-aurora.

For Option A (https://reviews.llvm.org/D60261), there is an argument that code creating or operating on these intrinsics can treat ordered and unordered reductions the same (in that they have the same arguments). Fast-math flags determine whether or not the intrinsic needs to be evaluated in strict order. Codegen for non-strict reductions should be able to fold away the identity-value.

For Option B (https://reviews.llvm.org/D60262), David made the argument that making the reduction-order explicit (as opposed to deducing this from fast-math flags) would ensure the ordering is always as expected, even when FMF on the call sites are dropped for some reason.



Is it correct that I sensed a slight preference for Option A? i.e. Renaming the intrinsics and keeping the same signature, but dropping the special-cased behaviour for the identity-value with non-strict reductions. For David's argument, I think that although the extra expressiveness would be nice to have, LLVM normally depends on the FMF being propagated correctly to produce faster code so this should also be sufficient for reductions.

If we go for Option A, I suggest we drop the 'experimental' prefix from experimental.vector.reduce.fadd/fmul to avoid having to add an awkward '.v2' suffix to the new intrinsic. When we implement all the suggestions from this proposal (possibly including the one mentioned below), I wouldn't really know what other features we could add other than predication (which would be covered by the LLVM-VP proposal and thus require another renaming), or possibly adding 'constrained' variants which I assume would have separate intrinsics. So we might as well drop the 'experimental' prefix.

Finally, do we want to remove the restriction that the result type must always match the vector-element type? A wider result type would then allow the reduction to be performed in the wider type.

Thanks,

Sander

> On 10 Apr 2019, at 18:56, Amara Emerson <[hidden email]> wrote:
>
> I’m fine with the direction this is going, but let’s keep renaming to a minimum. They’ve been experimental long enough now that we should be able to now jump to a final form after all the feedback.
>
> Amara
>
>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <[hidden email]> wrote:
>>
>>>
>>> On 8 Apr 2019, at 11:37, Simon Moll <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>> Proposed change:
>>>>>> ----------------------------
>>>>>> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
>>> Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.
>> Thanks for pointing out Simon. I think for now we should keep this proposal separate from LLVM-VP as they serve different purposes and have different scope. But yes we can easily rename the intrinsics again when the VP proposal lands.
>>
>>>
>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>>>
>>>>>>
>>>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>>>>>>
>>>>>>  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>
>>>>>> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
>>>>>>
>>>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
>>>>>>
>>>>>>  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>>>
>>>>>> This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
>>>>>>
>>>>>> Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>> Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.
>>>>>
>>> NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>> Great, I think combined with the argument for chaining of ordered reductions (often inside vectorized loops) and two architectures (ARM SVE and SX-Aurora) taking a scalar start register, this is enough of an argument to keep the explicit operand for the ordered reductions.
>>
>>>>>
>>>>>> Further efforts:
>>>>>> ----------------------------
>>>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>>>
>>>>>>   • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
>>>>>>   • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>
>>>>>>   • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
>>>>>>  i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>> since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.
>>>>>>
>>>>> Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.
>>>>>
>>>> Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>>
>>>> [hidden email]
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> --
>>>
>>> Simon Moll
>>> Researcher / PhD Student
>>>
>>> Compiler Design Lab (Prof. Hack)
>>> Saarland University, Computer Science
>>> Building E1.3, Room 4.31
>>>
>>> Tel. +49 (0)681 302-57521 :
>>> [hidden email]
>>>
>>> Fax. +49 (0)681 302-3065  :
>>> http://compilers.cs.uni-saarland.de/people/moll
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> [hidden email]
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
In reply to this post by Chris Lattner via llvm-dev

On 5/16/19 5:53 AM, Sander De Smalen via llvm-dev wrote:

> Hello again,
>
> I've been meaning to follow up on this thread for the last couple of weeks, my apologies for the delay.
>
> To summarise the feedback on the proposal for vector.reduce.fadd/fmul:
>
> There seems to be consensus to keep the explicit start value to better accommodate chained reductions (as opposed to generating IR that performs the reduction of the first element using extract/fadd/insert pattern). An important use-case for these reductions is to work inside vectorized loops, where chaining happens through the reduction value's PHI node (i.e. the scalar reduction value from one iteration will be the input to the next iteration). This intrinsic would also naturally match reduction instructions of ARM SVE and NEC SX-aurora.
>
> For Option A (https://reviews.llvm.org/D60261), there is an argument that code creating or operating on these intrinsics can treat ordered and unordered reductions the same (in that they have the same arguments). Fast-math flags determine whether or not the intrinsic needs to be evaluated in strict order. Codegen for non-strict reductions should be able to fold away the identity-value.
>
> For Option B (https://reviews.llvm.org/D60262), David made the argument that making the reduction-order explicit (as opposed to deducing this from fast-math flags) would ensure the ordering is always as expected, even when FMF on the call sites are dropped for some reason.

>
>
> Is it correct that I sensed a slight preference for Option A? i.e. Renaming the intrinsics and keeping the same signature, but dropping the special-cased behaviour for the identity-value with non-strict reductions. For David's argument, I think that although the extra expressiveness would be nice to have, LLVM normally depends on the FMF being propagated correctly to produce faster code so this should also be sufficient for reductions.
From your summary, Option A sounds better to me, but I'll defer to those
actively involved in this area.
> If we go for Option A, I suggest we drop the 'experimental' prefix from experimental.vector.reduce.fadd/fmul to avoid having to add an awkward '.v2' suffix to the new intrinsic. When we implement all the suggestions from this proposal (possibly including the one mentioned below), I wouldn't really know what other features we could add other than predication (which would be covered by the LLVM-VP proposal and thus require another renaming), or possibly adding 'constrained' variants which I assume would have separate intrinsics. So we might as well drop the 'experimental' prefix.
I'd suggest separating the experimental renaming proposal.  You don't
need it right now, you can simply change the IR semantics and forward
serialize old bitcode if desired.  Removing the experimental prefix has
support obligations, and I would recommend doing that as a separate
proposal if you want to.

>
> Finally, do we want to remove the restriction that the result type must always match the vector-element type? A wider result type would then allow the reduction to be performed in the wider type.
>
> Thanks,
>
> Sander
>
>> On 10 Apr 2019, at 18:56, Amara Emerson <[hidden email]> wrote:
>>
>> I’m fine with the direction this is going, but let’s keep renaming to a minimum. They’ve been experimental long enough now that we should be able to now jump to a final form after all the feedback.
>>
>> Amara
>>
>>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <[hidden email]> wrote:
>>>
>>>> On 8 Apr 2019, at 11:37, Simon Moll <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>>> Proposed change:
>>>>>>> ----------------------------
>>>>>>> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
>>>> Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.
>>> Thanks for pointing out Simon. I think for now we should keep this proposal separate from LLVM-VP as they serve different purposes and have different scope. But yes we can easily rename the intrinsics again when the VP proposal lands.
>>>
>>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>>>>
>>>>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>>>>>>>
>>>>>>>  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>
>>>>>>> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
>>>>>>>
>>>>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
>>>>>>>
>>>>>>>  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>>>>
>>>>>>> This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
>>>>>>>
>>>>>>> Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>>> Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.
>>>>>>
>>>> NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>>> Great, I think combined with the argument for chaining of ordered reductions (often inside vectorized loops) and two architectures (ARM SVE and SX-Aurora) taking a scalar start register, this is enough of an argument to keep the explicit operand for the ordered reductions.
>>>
>>>>>>> Further efforts:
>>>>>>> ----------------------------
>>>>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>>>>
>>>>>>> • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
>>>>>>> • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>>
>>>>>>> • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
>>>>>>>  i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>>> since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.
>>>>>>>
>>>>>> Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.
>>>>>>
>>>>> Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>>
>>>>> [hidden email]
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> --
>>>>
>>>> Simon Moll
>>>> Researcher / PhD Student
>>>>
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland University, Computer Science
>>>> Building E1.3, Room 4.31
>>>>
>>>> Tel. +49 (0)681 302-57521 :
>>>> [hidden email]
>>>>
>>>> Fax. +49 (0)681 302-3065  :
>>>> http://compilers.cs.uni-saarland.de/people/moll
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
In reply to this post by Chris Lattner via llvm-dev
I am fine with either option, really.

I think we do want to remove the result type restriction.

I think I'm fine with removing experimental but do we need more time
with the new versions to be sure?

                           -David

Sander De Smalen <[hidden email]> writes:

> Hello again,
>
> I've been meaning to follow up on this thread for the last couple of weeks, my apologies for the delay.
>
> To summarise the feedback on the proposal for vector.reduce.fadd/fmul:
>
> There seems to be consensus to keep the explicit start value to better
> accommodate chained reductions (as opposed to generating IR that
> performs the reduction of the first element using extract/fadd/insert
> pattern). An important use-case for these reductions is to work inside
> vectorized loops, where chaining happens through the reduction value's
> PHI node (i.e. the scalar reduction value from one iteration will be
> the input to the next iteration). This intrinsic would also naturally
> match reduction instructions of ARM SVE and NEC SX-aurora.
>
> For Option A (https://reviews.llvm.org/D60261), there is an argument
> that code creating or operating on these intrinsics can treat ordered
> and unordered reductions the same (in that they have the same
> arguments). Fast-math flags determine whether or not the intrinsic
> needs to be evaluated in strict order. Codegen for non-strict
> reductions should be able to fold away the identity-value.
>
> For Option B (https://reviews.llvm.org/D60262), David made the
> argument that making the reduction-order explicit (as opposed to
> deducing this from fast-math flags) would ensure the ordering is
> always as expected, even when FMF on the call sites are dropped for
> some reason.

>
>
> Is it correct that I sensed a slight preference for Option A?
> i.e. Renaming the intrinsics and keeping the same signature, but
> dropping the special-cased behaviour for the identity-value with
> non-strict reductions. For David's argument, I think that although the
> extra expressiveness would be nice to have, LLVM normally depends on
> the FMF being propagated correctly to produce faster code so this
> should also be sufficient for reductions.
>
> If we go for Option A, I suggest we drop the 'experimental' prefix
> from experimental.vector.reduce.fadd/fmul to avoid having to add an
> awkward '.v2' suffix to the new intrinsic. When we implement all the
> suggestions from this proposal (possibly including the one mentioned
> below), I wouldn't really know what other features we could add other
> than predication (which would be covered by the LLVM-VP proposal and
> thus require another renaming), or possibly adding 'constrained'
> variants which I assume would have separate intrinsics. So we might as
> well drop the 'experimental' prefix.
>
> Finally, do we want to remove the restriction that the result type
> must always match the vector-element type? A wider result type would
> then allow the reduction to be performed in the wider type.
>
> Thanks,
>
> Sander
>
>> On 10 Apr 2019, at 18:56, Amara Emerson <[hidden email]> wrote:
>>
>> I’m fine with the direction this is going, but let’s keep renaming
>> to a minimum. They’ve been experimental long enough now that we
>> should be able to now jump to a final form after all the feedback.
>>
>> Amara
>>
>>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <[hidden email]> wrote:
>>>
>>>>
>>>> On 8 Apr 2019, at 11:37, Simon Moll <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>>> Proposed change:
>>>>>>> ----------------------------
>>>>>>> In this RFC I propose changing the intrinsics for
>>>>>>> llvm.experimental.vector.reduce.fadd and
>>>>>>> llvm.experimental.vector.reduce.fmul (see options A and B). I
>>>>>>> also propose renaming the 'accumulator' operand to 'start
>>>>>>> value' because for fmul this is the start value of the
>>>>>>> reduction, rather than a value to which the fmul reduction is
>>>>>>> accumulated into.
>>>> Note that the LLVM-VP proposal also changes the way reductions are
>>>> handled in IR (https://reviews.llvm.org/D57504). This could be an
>>>> opportunity to avoid the "v2" suffix issue: LLVM-VP moves the
>>>> intrinsic to the "llvm.vp.*" namespace and we can fix the
>>>> reduction semantics in the progress.
>>> Thanks for pointing out Simon. I think for now we should keep this
>>> proposal separate from LLVM-VP as they serve different purposes and
>>> have different scope. But yes we can easily rename the intrinsics
>>> again when the VP proposal lands.
>>>
>>>>
>>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>>>>
>>>>>>>
>>>>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)>>>>>>
>>>>>>>  declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>
>>>>>>> This means that if the start value is 'undef', the result will
>>>>>>> be undef and all code creating such a reduction will need to
>>>>>>> ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0
>>>>>>> for fmul). When using 'fast' or ‘reassoc’ on the call it will
>>>>>>> be implemented using an unordered reduction, otherwise it will
>>>>>>> be implemented with an ordered reduction. Note that a new
>>>>>>> intrinsic is required to capture the new semantics. In this
>>>>>>> proposal the intrinsic is prefixed with a 'v2' for the time
>>>>>>> being, with the expectation this will be dropped when we remove
>>>>>>> 'experimental' from the reduction intrinsics in the future.
>>>>>>>
>>>>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).>>>>>>
>>>>>>>  declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>  declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>>>>
>>>>>>> This will mean that the behaviour is explicit from the
>>>>>>> intrinsic and the use of 'fast' or ‘reassoc’ on the call has no
>>>>>>> effect on how that intrinsic is lowered. The ordered reduction
>>>>>>> intrinsic will take a scalar start-value operand, where the
>>>>>>> unordered reduction intrinsic will only take a vector operand.
>>>>>>>
>>>>>>> Both options auto-upgrade the IR to use the new (version of
>>>>>>> the) intrinsics. I'm personally slightly in favour of [Option
>>>>>>> B], because it better aligns with the definition of the
>>>>>>> SelectionDAG nodes and is more explicit in its semantics. We
>>>>>>> also avoid having to use an artificial 'v2' like prefix to
>>>>>>> denote the new behaviour of the intrinsic.
>>>>>> Do we have any targets with instructions that can actually use
>>>>>> the start value? TBH I'd be tempted to suggest we just make the
>>>>>> initial extractelement/fadd/insertelement pattern a manual extra
>>>>>> stage and avoid having having that argument entirely.
>>>>>>
>>>> NEC SX-Aurora has reduction instructions that take in a start
>>>> value in a scalar register. We are hoping to upstream the backend:
>>>> http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html>> Great, I think combined with the argument for chaining of ordered
>>> reductions (often inside vectorized loops) and two architectures
>>> (ARM SVE and SX-Aurora) taking a scalar start register, this is
>>> enough of an argument to keep the explicit operand for the ordered
>>> reductions.
>>>
>>>>>>
>>>>>>> Further efforts:
>>>>>>> ----------------------------
>>>>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>>>>
>>>>>>> • Adding SelectionDAG legalization for the _STRICT reduction
>>>>>>> SDNodes. After some great work from Nikita in D58015, unordered
>>>>>>> reductions are now legalized/expanded in SelectionDAG, so if we
>>>>>>> add expansion in SelectionDAG for strict reductions this would
>>>>>>> make the ExpandReductionsPass redundant.
>>>>>>> • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>>
>>>>>>> • I think we'll also want to be able to overload the result
>>>>>>> operand based on the vector element type for the intrinsics
>>>>>>> having the constraint that the result type must match the
>>>>>>> vector element type. e.g. dropping the redundant 'i32' in:
>>>>>>>  i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32>
>>>>>>> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32>
>>>>>>> %a)
>>>>>>> since i32 is implied by <4 x i32>. This would have the added
>>>>>>> benefit that LLVM would automatically check for the operands to
>>>>>>> match.
>>>>>>>
>>>>>> Won't this cause issues with overflow? Isn't the point of an add
>>>>>> (or mul....) reduction of say, <64 x i8> giving a larger (i32 or
>>>>>> i64) result so we don't lose anything? I agree for bitop
>>>>>> reductions it doesn't make sense though.
>>>>>>
>>>>> Sorry - I forgot to add: which asks the question - should we be
>>>>> considering signed/unsigned add/mul and possibly saturation
>>>>> reductions?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>>
>>>>> [hidden email]
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>>> --
>>>>
>>>> Simon Moll
>>>> Researcher / PhD Student
>>>>
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland University, Computer Science
>>>> Building E1.3, Room 4.31
>>>>
>>>> Tel. +49 (0)681 302-57521 :
>>>> [hidden email]
>>>>
>>>> Fax. +49 (0)681 302-3065  :
>>>> http://compilers.cs.uni-saarland.de/people/moll>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
In reply to this post by Chris Lattner via llvm-dev
Hi David,

Thanks for bringing that up! Integer ones were also suggested earlier in this thread, I just thought it would be good to clear up the changes to FP intrinsics first.

Yes, similar to the suggestion to allow a wider result type for the fadd/fmul reductions, we could define explicit intrinsics to perform the (integer promotion and) reduction in a wider type. In the IR, this would mean adding support for explicit sign- and zero-extending variants:

  i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %input)
      ; unchanged - reduce elements into i32 result

  i64 @llvm.experimental.vector.reduce.sadd.i64.v4i32(<4 x i32> %input)
      ; new intrinsic - sign extend input elements and reduce into i64 result

  i64 @llvm.experimental.vector.reduce.uadd.i64.v4i32(<4 x i32> %input)
      ; new intrinsic - zero extend input elements and reduce into i64 result

Both MVE and SVE have instructions that match this behaviour. By having these explicit intrinsics, matching these would become more straight-forward and we avoid the possibility of the zero/sign-extension being obfuscated when it gets to CodeGen. LLVM can canonicalise to use the regular 'vector.reduce.add' intrinsic for sadd/uadd cases where the result- and element types are the same.

Thanks,

Sander

> On 16 May 2019, at 17:36, David Green <[hidden email]> wrote:
>
> Hello.
>
> Thanks for working on this.
>
> Are we talking only about the floating point versions of these, or the integer ones as well?
>
> For the integer ones, there are a number of new MVE instructions that reduce from, for example i16's into an i32. Or long versions that accumulate into an i64.
>
> For example the VADDVA.U16 Rda, Qm instruction will accumulate into a 32bit register.
> The VADDLVA.U16  RdaLo, RdaHi, Qm instruction will accumulate into a pair of 32bit registers (so a 64bit value).
>
> Thanks,
> Dave
>
> P.S. There are two different people on this thread that are named "David Greene"/"David Green". Sorry in advanced for the confusion.
>
>
>
> From: llvm-dev <[hidden email]> on behalf of Sander De Smalen via llvm-dev <[hidden email]>
> Sent: 16 May 2019 13:53
> To: [hidden email]
> Cc: David Greene; nd
> Subject: Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
>  
> Hello again,
>
> I've been meaning to follow up on this thread for the last couple of weeks, my apologies for the delay.
>
> To summarise the feedback on the proposal for vector.reduce.fadd/fmul:
>
> There seems to be consensus to keep the explicit start value to better accommodate chained reductions (as opposed to generating IR that performs the reduction of the first element using extract/fadd/insert pattern). An important use-case for these reductions is to work inside vectorized loops, where chaining happens through the reduction value's PHI node (i.e. the scalar reduction value from one iteration will be the input to the next iteration). This intrinsic would also naturally match reduction instructions of ARM SVE and NEC SX-aurora.
>
> For Option A (https://reviews.llvm.org/D60261), there is an argument that code creating or operating on these intrinsics can treat ordered and unordered reductions the same (in that they have the same arguments). Fast-math flags determine whether or not the intrinsic needs to be evaluated in strict order. Codegen for non-strict reductions should be able to fold away the identity-value.
>
> For Option B (https://reviews.llvm.org/D60262), David made the argument that making the reduction-order explicit (as opposed to deducing this from fast-math flags) would ensure the ordering is always as expected, even when FMF on the call sites are dropped for some reason.
>
>
>
> Is it correct that I sensed a slight preference for Option A? i.e. Renaming the intrinsics and keeping the same signature, but dropping the special-cased behaviour for the identity-value with non-strict reductions. For David's argument, I think that although the extra expressiveness would be nice to have, LLVM normally depends on the FMF being propagated correctly to produce faster code so this should also be sufficient for reductions.
>
> If we go for Option A, I suggest we drop the 'experimental' prefix from experimental.vector.reduce.fadd/fmul to avoid having to add an awkward '.v2' suffix to the new intrinsic. When we implement all the suggestions from this proposal (possibly including the one mentioned below), I wouldn't really know what other features we could add other than predication (which would be covered by the LLVM-VP proposal and thus require another renaming), or possibly adding 'constrained' variants which I assume would have separate intrinsics. So we might as well drop the 'experimental' prefix.
>
> Finally, do we want to remove the restriction that the result type must always match the vector-element type? A wider result type would then allow the reduction to be performed in the wider type.
>
> Thanks,
>
> Sander
>
>> On 10 Apr 2019, at 18:56, Amara Emerson <[hidden email]> wrote:
>>
>> I’m fine with the direction this is going, but let’s keep renaming to a minimum. They’ve been experimental long enough now that we should be able to now jump to a final form after all the feedback.
>>
>> Amara
>>
>>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <[hidden email]> wrote:
>>>
>>>>
>>>> On 8 Apr 2019, at 11:37, Simon Moll <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>>> Proposed change:
>>>>>>> ----------------------------
>>>>>>> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
>>>> Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.
>>> Thanks for pointing out Simon. I think for now we should keep this proposal separate from LLVM-VP as they serve different purposes and have different scope. But yes we can easily rename the intrinsics again when the VP proposal lands.
>>>
>>>>
>>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>>>>
>>>>>>>
>>>>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>>>>>>>
>>>>>>>   declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>
>>>>>>> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
>>>>>>>
>>>>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
>>>>>>>
>>>>>>>   declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>   declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>>>>
>>>>>>> This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
>>>>>>>
>>>>>>> Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>>> Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.
>>>>>>
>>>> NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>>> Great, I think combined with the argument for chaining of ordered reductions (often inside vectorized loops) and two architectures (ARM SVE and SX-Aurora) taking a scalar start register, this is enough of an argument to keep the explicit operand for the ordered reductions.
>>>
>>>>>>
>>>>>>> Further efforts:
>>>>>>> ----------------------------
>>>>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>>>>
>>>>>>>    • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
>>>>>>>    • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>>
>>>>>>>    • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
>>>>>>>   i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>>> since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.
>>>>>>>
>>>>>> Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.
>>>>>>
>>>>> Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>>
>>>>> [hidden email]
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> --
>>>>
>>>> Simon Moll
>>>> Researcher / PhD Student
>>>>
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland University, Computer Science
>>>> Building E1.3, Room 4.31
>>>>
>>>> Tel. +49 (0)681 302-57521 :
>>>> [hidden email]
>>>>
>>>> Fax. +49 (0)681 302-3065  :
>>>> http://compilers.cs.uni-saarland.de/people/moll
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Chris Lattner via llvm-dev
Hello

An accumulating version would be matched with something like this?
%reduce = call i64 @llvm.experimental.vector.reduce.uadd.i64.v4i32(<4 x i32> %a)
%sum= add i64 %reduce, %sum.phi

i.e we don't use an explicit first accumulator parameter like for floats. That sounds like it should work fine to me.

And is it worth keeping reduce.add.v4i32, if it's equivalent to a reduce.uadd.i32.v4i32?

We don't quite yet have a target in tree that will do this, but hopefully that won't be far off. I'm happy to help out with the work here.

Dave



From: Sander De Smalen
Sent: 17 May 2019 16:36
To: David Green
Cc: [hidden email]; David Greene; nd
Subject: Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
 
Hi David,

Thanks for bringing that up! Integer ones were also suggested earlier in this thread, I just thought it would be good to clear up the changes to FP intrinsics first.

Yes, similar to the suggestion to allow a wider result type for the fadd/fmul reductions, we could define explicit intrinsics to perform the (integer promotion and) reduction in a wider type. In the IR, this would mean adding support for explicit sign- and zero-extending variants:

  i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %input)
      ; unchanged - reduce elements into i32 result

  i64 @llvm.experimental.vector.reduce.sadd.i64.v4i32(<4 x i32> %input)
      ; new intrinsic - sign extend input elements and reduce into i64 result

  i64 @llvm.experimental.vector.reduce.uadd.i64.v4i32(<4 x i32> %input)
      ; new intrinsic - zero extend input elements and reduce into i64 result

Both MVE and SVE have instructions that match this behaviour. By having these explicit intrinsics, matching these would become more straight-forward and we avoid the possibility of the zero/sign-extension being obfuscated when it gets to CodeGen. LLVM can canonicalise to use the regular 'vector.reduce.add' intrinsic for sadd/uadd cases where the result- and element types are the same.

Thanks,

Sander

> On 16 May 2019, at 17:36, David Green <[hidden email]> wrote:
>
> Hello.
>
> Thanks for working on this.
>
> Are we talking only about the floating point versions of these, or the integer ones as well?
>
> For the integer ones, there are a number of new MVE instructions that reduce from, for example i16's into an i32. Or long versions that accumulate into an i64.
>
> For example the VADDVA.U16 Rda, Qm instruction will accumulate into a 32bit register.
> The VADDLVA.U16  RdaLo, RdaHi, Qm instruction will accumulate into a pair of 32bit registers (so a 64bit value).
>
> Thanks,
> Dave
>
> P.S. There are two different people on this thread that are named "David Greene"/"David Green". Sorry in advanced for the confusion.
>
>
>
> From: llvm-dev <[hidden email]> on behalf of Sander De Smalen via llvm-dev <[hidden email]>
> Sent: 16 May 2019 13:53
> To: [hidden email]
> Cc: David Greene; nd
> Subject: Re: [llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

> Hello again,
>
> I've been meaning to follow up on this thread for the last couple of weeks, my apologies for the delay.
>
> To summarise the feedback on the proposal for vector.reduce.fadd/fmul:
>
> There seems to be consensus to keep the explicit start value to better accommodate chained reductions (as opposed to generating IR that performs the reduction of the first element using extract/fadd/insert pattern). An important use-case for these reductions is to work inside vectorized loops, where chaining happens through the reduction value's PHI node (i.e. the scalar reduction value from one iteration will be the input to the next iteration). This intrinsic would also naturally match reduction instructions of ARM SVE and NEC SX-aurora.
>
> For Option A (https://reviews.llvm.org/D60261), there is an argument that code creating or operating on these intrinsics can treat ordered and unordered reductions the same (in that they have the same arguments). Fast-math flags determine whether or not the intrinsic needs to be evaluated in strict order. Codegen for non-strict reductions should be able to fold away the identity-value.
>
> For Option B (https://reviews.llvm.org/D60262), David made the argument that making the reduction-order explicit (as opposed to deducing this from fast-math flags) would ensure the ordering is always as expected, even when FMF on the call sites are dropped for some reason.
>
>
>
> Is it correct that I sensed a slight preference for Option A? i.e. Renaming the intrinsics and keeping the same signature, but dropping the special-cased behaviour for the identity-value with non-strict reductions. For David's argument, I think that although the extra expressiveness would be nice to have, LLVM normally depends on the FMF being propagated correctly to produce faster code so this should also be sufficient for reductions.
>
> If we go for Option A, I suggest we drop the 'experimental' prefix from experimental.vector.reduce.fadd/fmul to avoid having to add an awkward '.v2' suffix to the new intrinsic. When we implement all the suggestions from this proposal (possibly including the one mentioned below), I wouldn't really know what other features we could add other than predication (which would be covered by the LLVM-VP proposal and thus require another renaming), or possibly adding 'constrained' variants which I assume would have separate intrinsics. So we might as well drop the 'experimental' prefix.
>
> Finally, do we want to remove the restriction that the result type must always match the vector-element type? A wider result type would then allow the reduction to be performed in the wider type.
>
> Thanks,
>
> Sander
>
>> On 10 Apr 2019, at 18:56, Amara Emerson <[hidden email]> wrote:
>>
>> I’m fine with the direction this is going, but let’s keep renaming to a minimum. They’ve been experimental long enough now that we should be able to now jump to a final form after all the feedback.
>>
>> Amara
>>
>>> On Apr 10, 2019, at 5:59 AM, Sander De Smalen via llvm-dev <[hidden email]> wrote:
>>>
>>>>
>>>> On 8 Apr 2019, at 11:37, Simon Moll <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 4/5/19 10:47 AM, Simon Pilgrim via llvm-dev wrote:
>>>>> On 05/04/2019 09:37, Simon Pilgrim via llvm-dev wrote:
>>>>>> On 04/04/2019 14:11, Sander De Smalen wrote:
>>>>>>> Proposed change:
>>>>>>> ----------------------------
>>>>>>> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
>>>> Note that the LLVM-VP proposal also changes the way reductions are handled in IR (https://reviews.llvm.org/D57504). This could be an opportunity to avoid the "v2" suffix issue: LLVM-VP moves the intrinsic to the "llvm.vp.*" namespace and we can fix the reduction semantics in the progress.
>>> Thanks for pointing out Simon. I think for now we should keep this proposal separate from LLVM-VP as they serve different purposes and have different scope. But yes we can easily rename the intrinsics again when the VP proposal lands.
>>>
>>>>
>>>> Btw, if you are at EuroLLVM. There is a BoF at 2pm today on LLVM-VP.
>>>>
>>>>>>>
>>>>>>> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>>>>>>>
>>>>>>>   declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>
>>>>>>> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.
>>>>>>>
>>>>>>> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
>>>>>>>
>>>>>>>   declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>>>>>>>   declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
>>>>>>>
>>>>>>> This will mean that the behaviour is explicit from the intrinsic and the use of 'fast' or ‘reassoc’ on the call has no effect on how that intrinsic is lowered. The ordered reduction intrinsic will take a scalar start-value operand, where the unordered reduction intrinsic will only take a vector operand.
>>>>>>>
>>>>>>> Both options auto-upgrade the IR to use the new (version of the) intrinsics. I'm personally slightly in favour of [Option B], because it better aligns with the definition of the SelectionDAG nodes and is more explicit in its semantics. We also avoid having to use an artificial 'v2' like prefix to denote the new behaviour of the intrinsic.
>>>>>> Do we have any targets with instructions that can actually use the start value? TBH I'd be tempted to suggest we just make the initial extractelement/fadd/insertelement pattern a manual extra stage and avoid having having that argument entirely.
>>>>>>
>>>> NEC SX-Aurora has reduction instructions that take in a start value in a scalar register. We are hoping to upstream the backend: http://lists.llvm.org/pipermail/llvm-dev/2019-April/131580.html
>>> Great, I think combined with the argument for chaining of ordered reductions (often inside vectorized loops) and two architectures (ARM SVE and SX-Aurora) taking a scalar start register, this is enough of an argument to keep the explicit operand for the ordered reductions.
>>>
>>>>>>
>>>>>>> Further efforts:
>>>>>>> ----------------------------
>>>>>>> Here a non-exhaustive list of items I think work towards making the intrinsics non-experimental:
>>>>>>>
>>>>>>>    • Adding SelectionDAG legalization for the  _STRICT reduction SDNodes. After some great work from Nikita in D58015, unordered reductions are now legalized/expanded in SelectionDAG, so if we add expansion in SelectionDAG for strict reductions this would make the ExpandReductionsPass redundant.
>>>>>>>    • Better enforcing the constraints of the intrinsics (see https://reviews.llvm.org/D60260 ).
>>>>>>>
>>>>>>>    • I think we'll also want to be able to overload the result operand based on the vector element type for the intrinsics having the constraint that the result type must match the vector element type. e.g. dropping the redundant 'i32' in:
>>>>>>>   i32 @llvm.experimental.vector.reduce.and.i32.v4i32(<4 x i32> %a) => i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
>>>>>>> since i32 is implied by <4 x i32>. This would have the added benefit that LLVM would automatically check for the operands to match.
>>>>>>>
>>>>>> Won't this cause issues with overflow? Isn't the point  of an add (or mul....) reduction of say, <64 x i8> giving a larger (i32 or i64) result so we don't lose anything? I agree for bitop reductions it doesn't make sense though.
>>>>>>
>>>>> Sorry - I forgot to add: which asks the question - should we be considering signed/unsigned add/mul and possibly saturation reductions?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>>
>>>>> [hidden email]
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> --
>>>>
>>>> Simon Moll
>>>> Researcher / PhD Student
>>>>
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland University, Computer Science
>>>> Building E1.3, Room 4.31
>>>>
>>>> Tel. +49 (0)681 302-57521 :
>>>> [hidden email]
>>>>
>>>> Fax. +49 (0)681 302-3065  :
>>>> http://compilers.cs.uni-saarland.de/people/moll
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> [hidden email]
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev