[llvm-dev] [RFC] Intrinsics for Hardware Loops

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[llvm-dev] [RFC] Intrinsics for Hardware Loops

Amara Emerson via llvm-dev
Hi,

Arm have recently announced the v8.1-M architecture specification for
our  next generation microcontrollers. The architecture includes
vector extensions (MVE) and support for low-overhead branches (LoB),
which can be thought of a style of hardware loop. Hardware loops
aren't new to LLVM, other backends (at least Hexagon and PPC that I
know of) also include support. These implementations insert the loop
controlling instructions at the MachineInstr level and I'd like to
propose that we add intrinsics to support this notion at the IR
level; primarily to be able to use scalar evolution to understand the
loops instead of having to implement a machine-level analysis for
each target.

I've posted an RFC with a prototype implementation in
https://reviews.llvm.org/D62132. It contains intrinsics that are
currently Arm specific, but I hope they're general enough to be used
by all targets. The Arm v8.1-m architecture supports do-while and
while loops, but for conciseness, here, I'd like to just focus on
while loops. There's two parts to this RFC: (1) the intrinsics
and (2) a prototype implementation in the Arm backend to enable
tail-predicated machine loops.
   
1. LLVM IR Intrinsics
   
In the following definitions, I use the term 'element' to describe
the work performed by an IR loop that has not been vectorized or
unrolled by the compiler. This should be equivalent to the loop at
the source level.
   
void @llvm.arm.set.loop.iterations(i32)
- Takes as a single operand, the number of iterations to be executed.
   
i32 @llvm.arm.set.loop.elements(i32, i32)
- Takes two operands:
  - The total number of elements to be processed by the loop.
  - The maximum number of elements processed in one iteration of
    the IR loop body.
- Returns the number of iterations to be executed.
   
<X x i1> @llvm.arm.get.active.mask.X(i32)
- Takes as an operand, the number of elements that still need
  processing.
- Where 'X' denotes the vectorization factor, returns an array of i1
  indicating which vector lanes are active for the current loop
  iteration.
   
i32 @llvm.arm.loop.end(i32, i32)
- Takes two operands:
  - The number of elements that still need processing.
  - The maximum number of elements processed in one iteration of the
    IR loop body.
   
The following gives an illustration of their intended usage:
   
entry:
  %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
  %1 = icmp ne i32 %0, 0
  br i1 %1, label %vector.ph, label %for.loopexit
   
vector.ph:
  br label %vector.body
   
vector.body:
  %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
  %active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
  %load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
  tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x i1> %active)
  %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
  %cmp = icmp sgt i32 %elts.rem, 0
  br i1 %cmp, label %vector.body, label %for.loopexit
   
for.loopexit:
  ret void
   
As the example shows, control-flow is still ultimately performed
through the icmp and br pair. There's nothing connecting the
intrinsics to a given loop or any requirement that a set.loop.* call
needs to be paired with a loop.end call.
   
2. Low-overhead loops in the Arm backend
   
Disclaimer: The prototype is barebones and reuses parts of NEON and
I'm currently targeting the Cortex-A72 which does not support this
feature! opt and llc build and the provided test case doesn't cause a
crash...
   
The low-overhead branch extension can be combined with MVE to
generate vectorized loops in which the epilogue is executed within
the predicated vector body. The proposal is for this to be supported
through a series of pass:
1) IR LoopPass to identify suitable loops and insert the intrinsics
   proposed above.
2) DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo
   instruction.
3) A final MachineFunctionPass to expand the pseudo instructions.
   
To help / enable the lowering of of an i1 vector, the VPR register has
been added. This is a status register that contains the P0 predicate
and is also used to model the implicit predicates of tail-predicated
loops.
   
There are two main reasons why pseudo instructions are used instead
of generating MIs directly during ISel:
1) They gives us a chance of later inspecting the whole loop and
   confirm that it's a good idea to generate such a loop. This is
   trivial for scalar loops, but not really applicable for
   tail-predicated loops.
2) It allows us to separate the decrementing of the loop counter with
   the instruction that branches back, which should help us recover if
   LR gets spilt between these two pseudo ops.
   
For Armv8.1-M, the while.setup intrinsic is used to generate the wls
and wlstp instructions, while loop.end generates the le and letp
instructions. The active.mask can just be removed because the lane
predication is handled implicitly.
   
I'm not sure of the vectorizers limitations of generating vector
instructions that operate across lanes, such as reductions, when
generating a predicated loop but this needs to be considered.

I'd welcome any feedback here or on Phabricator and I'd especially like
to know if this would useful to current targets.

cheers,

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Intrinsics for Hardware Loops

Amara Emerson via llvm-dev


On 5/20/19 6:00 AM, Sam Parker via llvm-dev wrote:
Hi,

Arm have recently announced the v8.1-M architecture specification for
our  next generation microcontrollers. The architecture includes
vector extensions (MVE) and support for low-overhead branches (LoB),
which can be thought of a style of hardware loop. Hardware loops
aren't new to LLVM, other backends (at least Hexagon and PPC that I
know of) also include support. These implementations insert the loop
controlling instructions at the MachineInstr level and I'd like to
propose that we add intrinsics to support this notion at the IR
level;



The PPC implementation also recognizes loops at the IR level (in lib/Target/PowerPC/PPCCTRLoops.cpp) and then matches the relevant combinations of intrinsics and conditional branches during SDAG ISel. The intrinsics that PPC uses are:

  def int_ppc_mtctr : Intrinsic<[], [llvm_anyint_ty], []>;
  def int_ppc_is_decremented_ctr_nonzero :
    Intrinsic<[llvm_i1_ty], [], [IntrNoDuplicate]>;

This proposal actually sounds very similar to what PPC currently does for counter-based loops. This solution tends to work well, in part because we can use SCEV to analyze loops at the IR level and generate trip-count expressions.


primarily to be able to use scalar evolution to understand the
loops instead of having to implement a machine-level analysis for
each target.

I've posted an RFC with a prototype implementation in
https://reviews.llvm.org/D62132. It contains intrinsics that are
currently Arm specific, but I hope they're general enough to be used
by all targets. The Arm v8.1-m architecture supports do-while and
while loops, but for conciseness, here, I'd like to just focus on
while loops. There's two parts to this RFC: (1) the intrinsics
and (2) a prototype implementation in the Arm backend to enable
tail-predicated machine loops.
   
1. LLVM IR Intrinsics
   
In the following definitions, I use the term 'element' to describe
the work performed by an IR loop that has not been vectorized or
unrolled by the compiler. This should be equivalent to the loop at
the source level.
   
void @llvm.arm.set.loop.iterations(i32)
- Takes as a single operand, the number of iterations to be executed.
   
i32 @llvm.arm.set.loop.elements(i32, i32)
- Takes two operands:
  - The total number of elements to be processed by the loop.
  - The maximum number of elements processed in one iteration of
    the IR loop body.
- Returns the number of iterations to be executed.
   
<X x i1> @llvm.arm.get.active.mask.X(i32)
- Takes as an operand, the number of elements that still need
  processing.
- Where 'X' denotes the vectorization factor, returns an array of i1
  indicating which vector lanes are active for the current loop
  iteration.
   
i32 @llvm.arm.loop.end(i32, i32)
- Takes two operands:
  - The number of elements that still need processing.
  - The maximum number of elements processed in one iteration of the
    IR loop body.
   
The following gives an illustration of their intended usage:
   
entry:
  %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
  %1 = icmp ne i32 %0, 0
  br i1 %1, label %vector.ph, label %for.loopexit
   
vector.ph:
  br label %vector.body
   
vector.body:
  %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
  %active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
  %load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
  tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x i1> %active)
  %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
  %cmp = icmp sgt i32 %elts.rem, 0
  br i1 %cmp, label %vector.body, label %for.loopexit
   
for.loopexit:
  ret void
   
As the example shows, control-flow is still ultimately performed
through the icmp and br pair. There's nothing connecting the
intrinsics to a given loop or any requirement that a set.loop.* call
needs to be paired with a loop.end call.
   
2. Low-overhead loops in the Arm backend
   
Disclaimer: The prototype is barebones and reuses parts of NEON and
I'm currently targeting the Cortex-A72 which does not support this
feature! opt and llc build and the provided test case doesn't cause a
crash...
   
The low-overhead branch extension can be combined with MVE to
generate vectorized loops in which the epilogue is executed within
the predicated vector body. The proposal is for this to be supported
through a series of pass:
1) IR LoopPass to identify suitable loops and insert the intrinsics
   proposed above.
2) DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo
   instruction.
3) A final MachineFunctionPass to expand the pseudo instructions.
   
To help / enable the lowering of of an i1 vector, the VPR register has
been added. This is a status register that contains the P0 predicate
and is also used to model the implicit predicates of tail-predicated
loops.
   
There are two main reasons why pseudo instructions are used instead
of generating MIs directly during ISel:
1) They gives us a chance of later inspecting the whole loop and
   confirm that it's a good idea to generate such a loop. This is
   trivial for scalar loops, but not really applicable for
   tail-predicated loops.


Is the idea is that you'll be able to fall back to using regular branch instructions for generating the loops? Are you doing this before or after register allocation?


 -Hal


2) It allows us to separate the decrementing of the loop counter with
   the instruction that branches back, which should help us recover if
   LR gets spilt between these two pseudo ops.
   
For Armv8.1-M, the while.setup intrinsic is used to generate the wls
and wlstp instructions, while loop.end generates the le and letp
instructions. The active.mask can just be removed because the lane
predication is handled implicitly.
   
I'm not sure of the vectorizers limitations of generating vector
instructions that operate across lanes, such as reductions, when
generating a predicated loop but this needs to be considered.

I'd welcome any feedback here or on Phabricator and I'd especially like
to know if this would useful to current targets.

cheers,

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Intrinsics for Hardware Loops

Amara Emerson via llvm-dev
In reply to this post by Amara Emerson via llvm-dev

This seems like a generally reasonable approach.  I have some hesitation about the potential separation of the control flow and the intrinsics (i.e. can we every confuse which loop they apply to?), but the basic notion seems reasonable.  Particularly so as Hal points out that we already have something like this in PPC.   I'd suggest framing this as being an IR assist to backends rather than a canonical form or anything expected to be used by frontends though.


A couple of random comments; there's no coherent message here, just a collection of thoughts.


1) Your "loop.end" intrinsic is very confusingly named.  I think you definitely need something different there name wise.  Also, you fail to specify what the return value is.

2) Your get.active.mask.X is a generally useful construct, but I think it can be represented via bitmath and a bitcast right?  (i.e. does it have to be an intrinsic?)

3) There seems to be a good amount of overlap with the SVE ideas.  I'm not suggesting it needs to be reconciled, just pointing out many of the issues are common.  (The more I see discussion of these topics, there more unsettled it all feels.  Trying out a couple of experimental designs, and iterating until one wins is feeling more and more like the right approach.)


Philip




On 5/20/19 4:00 AM, Sam Parker via llvm-dev wrote:
Hi,

Arm have recently announced the v8.1-M architecture specification for
our  next generation microcontrollers. The architecture includes
vector extensions (MVE) and support for low-overhead branches (LoB),
which can be thought of a style of hardware loop. Hardware loops
aren't new to LLVM, other backends (at least Hexagon and PPC that I
know of) also include support. These implementations insert the loop
controlling instructions at the MachineInstr level and I'd like to
propose that we add intrinsics to support this notion at the IR
level; primarily to be able to use scalar evolution to understand the
loops instead of having to implement a machine-level analysis for
each target.

I've posted an RFC with a prototype implementation in
https://reviews.llvm.org/D62132. It contains intrinsics that are
currently Arm specific, but I hope they're general enough to be used
by all targets. The Arm v8.1-m architecture supports do-while and
while loops, but for conciseness, here, I'd like to just focus on
while loops. There's two parts to this RFC: (1) the intrinsics
and (2) a prototype implementation in the Arm backend to enable
tail-predicated machine loops.
   
1. LLVM IR Intrinsics
   
In the following definitions, I use the term 'element' to describe
the work performed by an IR loop that has not been vectorized or
unrolled by the compiler. This should be equivalent to the loop at
the source level.
   
void @llvm.arm.set.loop.iterations(i32)
- Takes as a single operand, the number of iterations to be executed.
   
i32 @llvm.arm.set.loop.elements(i32, i32)
- Takes two operands:
  - The total number of elements to be processed by the loop.
  - The maximum number of elements processed in one iteration of
    the IR loop body.
- Returns the number of iterations to be executed.
   
<X x i1> @llvm.arm.get.active.mask.X(i32)
- Takes as an operand, the number of elements that still need
  processing.
- Where 'X' denotes the vectorization factor, returns an array of i1
  indicating which vector lanes are active for the current loop
  iteration.
   
i32 @llvm.arm.loop.end(i32, i32)
- Takes two operands:
  - The number of elements that still need processing.
  - The maximum number of elements processed in one iteration of the
    IR loop body.
   
The following gives an illustration of their intended usage:
   
entry:
  %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
  %1 = icmp ne i32 %0, 0
  br i1 %1, label %vector.ph, label %for.loopexit
   
vector.ph:
  br label %vector.body
   
vector.body:
  %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
  %active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
  %load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
  tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x i1> %active)
  %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
  %cmp = icmp sgt i32 %elts.rem, 0
  br i1 %cmp, label %vector.body, label %for.loopexit
   
for.loopexit:
  ret void
   
As the example shows, control-flow is still ultimately performed
through the icmp and br pair. There's nothing connecting the
intrinsics to a given loop or any requirement that a set.loop.* call
needs to be paired with a loop.end call.
   
2. Low-overhead loops in the Arm backend
   
Disclaimer: The prototype is barebones and reuses parts of NEON and
I'm currently targeting the Cortex-A72 which does not support this
feature! opt and llc build and the provided test case doesn't cause a
crash...
   
The low-overhead branch extension can be combined with MVE to
generate vectorized loops in which the epilogue is executed within
the predicated vector body. The proposal is for this to be supported
through a series of pass:
1) IR LoopPass to identify suitable loops and insert the intrinsics
   proposed above.
2) DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo
   instruction.
3) A final MachineFunctionPass to expand the pseudo instructions.
   
To help / enable the lowering of of an i1 vector, the VPR register has
been added. This is a status register that contains the P0 predicate
and is also used to model the implicit predicates of tail-predicated
loops.
   
There are two main reasons why pseudo instructions are used instead
of generating MIs directly during ISel:
1) They gives us a chance of later inspecting the whole loop and
   confirm that it's a good idea to generate such a loop. This is
   trivial for scalar loops, but not really applicable for
   tail-predicated loops.
2) It allows us to separate the decrementing of the loop counter with
   the instruction that branches back, which should help us recover if
   LR gets spilt between these two pseudo ops.
   
For Armv8.1-M, the while.setup intrinsic is used to generate the wls
and wlstp instructions, while loop.end generates the le and letp
instructions. The active.mask can just be removed because the lane
predication is handled implicitly.
   
I'm not sure of the vectorizers limitations of generating vector
instructions that operate across lanes, such as reductions, when
generating a predicated loop but this needs to be considered.

I'd welcome any feedback here or on Phabricator and I'd especially like
to know if this would useful to current targets.

cheers,

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Intrinsics for Hardware Loops

Amara Emerson via llvm-dev
Hi Philip,

Yes, these constructs should really only be used by the compiler and probably always very late in the pipeline. To address your other points:

1) Agreed. loop.end has now renamed to 'loop.decrement'. I've also added 'loop.decrement.reg' which operates upon the updated loop counter, instead of some opaque system register.
2) It could be handled by normal IR, the vectorizer currently splits out the equivalent when folding the epilogue into the loop body. The reason why we need an intrinsic is to work around the limitations of basic block isel. In our new architecture, the lane predication is implicit iff we can generate the hardware loop - but that doesn't prevent other instructions, predicated on something other than the loop index, from being generated too. At ISel we can't guarantee whether a predicate is loop index based or otherwise, so it has to be explicit coming into ISel.
3) The main difference here is the same as (2). As I understand SVE, has bank of predicate registers that are explicitly accessed, whereas MVE has a status register that is used implicitly.

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com


From: Philip Reames <[hidden email]>
Sent: 28 May 2019 19:00
To: Sam Parker; [hidden email]
Cc: nd
Subject: Re: [llvm-dev] [RFC] Intrinsics for Hardware Loops
 

This seems like a generally reasonable approach.  I have some hesitation about the potential separation of the control flow and the intrinsics (i.e. can we every confuse which loop they apply to?), but the basic notion seems reasonable.  Particularly so as Hal points out that we already have something like this in PPC.   I'd suggest framing this as being an IR assist to backends rather than a canonical form or anything expected to be used by frontends though.


A couple of random comments; there's no coherent message here, just a collection of thoughts.


1) Your "loop.end" intrinsic is very confusingly named.  I think you definitely need something different there name wise.  Also, you fail to specify what the return value is.

2) Your get.active.mask.X is a generally useful construct, but I think it can be represented via bitmath and a bitcast right?  (i.e. does it have to be an intrinsic?)

3) There seems to be a good amount of overlap with the SVE ideas.  I'm not suggesting it needs to be reconciled, just pointing out many of the issues are common.  (The more I see discussion of these topics, there more unsettled it all feels.  Trying out a couple of experimental designs, and iterating until one wins is feeling more and more like the right approach.)


Philip




On 5/20/19 4:00 AM, Sam Parker via llvm-dev wrote:
Hi,

Arm have recently announced the v8.1-M architecture specification for
our  next generation microcontrollers. The architecture includes
vector extensions (MVE) and support for low-overhead branches (LoB),
which can be thought of a style of hardware loop. Hardware loops
aren't new to LLVM, other backends (at least Hexagon and PPC that I
know of) also include support. These implementations insert the loop
controlling instructions at the MachineInstr level and I'd like to
propose that we add intrinsics to support this notion at the IR
level; primarily to be able to use scalar evolution to understand the
loops instead of having to implement a machine-level analysis for
each target.

I've posted an RFC with a prototype implementation in
https://reviews.llvm.org/D62132. It contains intrinsics that are
currently Arm specific, but I hope they're general enough to be used
by all targets. The Arm v8.1-m architecture supports do-while and
while loops, but for conciseness, here, I'd like to just focus on
while loops. There's two parts to this RFC: (1) the intrinsics
and (2) a prototype implementation in the Arm backend to enable
tail-predicated machine loops.
   
1. LLVM IR Intrinsics
   
In the following definitions, I use the term 'element' to describe
the work performed by an IR loop that has not been vectorized or
unrolled by the compiler. This should be equivalent to the loop at
the source level.
   
void @llvm.arm.set.loop.iterations(i32)
- Takes as a single operand, the number of iterations to be executed.
   
i32 @llvm.arm.set.loop.elements(i32, i32)
- Takes two operands:
  - The total number of elements to be processed by the loop.
  - The maximum number of elements processed in one iteration of
    the IR loop body.
- Returns the number of iterations to be executed.
   
<X x i1> @llvm.arm.get.active.mask.X(i32)
- Takes as an operand, the number of elements that still need
  processing.
- Where 'X' denotes the vectorization factor, returns an array of i1
  indicating which vector lanes are active for the current loop
  iteration.
   
i32 @llvm.arm.loop.end(i32, i32)
- Takes two operands:
  - The number of elements that still need processing.
  - The maximum number of elements processed in one iteration of the
    IR loop body.
   
The following gives an illustration of their intended usage:
   
entry:
  %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
  %1 = icmp ne i32 %0, 0
  br i1 %1, label %vector.ph, label %for.loopexit
   
vector.ph:
  br label %vector.body
   
vector.body:
  %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
  %active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
  %load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
  tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x i1> %active)
  %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
  %cmp = icmp sgt i32 %elts.rem, 0
  br i1 %cmp, label %vector.body, label %for.loopexit
   
for.loopexit:
  ret void
   
As the example shows, control-flow is still ultimately performed
through the icmp and br pair. There's nothing connecting the
intrinsics to a given loop or any requirement that a set.loop.* call
needs to be paired with a loop.end call.
   
2. Low-overhead loops in the Arm backend
   
Disclaimer: The prototype is barebones and reuses parts of NEON and
I'm currently targeting the Cortex-A72 which does not support this
feature! opt and llc build and the provided test case doesn't cause a
crash...
   
The low-overhead branch extension can be combined with MVE to
generate vectorized loops in which the epilogue is executed within
the predicated vector body. The proposal is for this to be supported
through a series of pass:
1) IR LoopPass to identify suitable loops and insert the intrinsics
   proposed above.
2) DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo
   instruction.
3) A final MachineFunctionPass to expand the pseudo instructions.
   
To help / enable the lowering of of an i1 vector, the VPR register has
been added. This is a status register that contains the P0 predicate
and is also used to model the implicit predicates of tail-predicated
loops.
   
There are two main reasons why pseudo instructions are used instead
of generating MIs directly during ISel:
1) They gives us a chance of later inspecting the whole loop and
   confirm that it's a good idea to generate such a loop. This is
   trivial for scalar loops, but not really applicable for
   tail-predicated loops.
2) It allows us to separate the decrementing of the loop counter with
   the instruction that branches back, which should help us recover if
   LR gets spilt between these two pseudo ops.
   
For Armv8.1-M, the while.setup intrinsic is used to generate the wls
and wlstp instructions, while loop.end generates the le and letp
instructions. The active.mask can just be removed because the lane
predication is handled implicitly.
   
I'm not sure of the vectorizers limitations of generating vector
instructions that operate across lanes, such as reductions, when
generating a predicated loop but this needs to be considered.

I'd welcome any feedback here or on Phabricator and I'd especially like
to know if this would useful to current targets.

cheers,

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC] Intrinsics for Hardware Loops

Amara Emerson via llvm-dev

I'll just note that I'm generally very skeptical of the argument in (2).  Not actively objective, but every time this general line of thought comes up, I find the reasoning unconvincing. 


On 5/30/19 5:19 AM, Sam Parker wrote:
Hi Philip,

Yes, these constructs should really only be used by the compiler and probably always very late in the pipeline. To address your other points:

1) Agreed. loop.end has now renamed to 'loop.decrement'. I've also added 'loop.decrement.reg' which operates upon the updated loop counter, instead of some opaque system register.
2) It could be handled by normal IR, the vectorizer currently splits out the equivalent when folding the epilogue into the loop body. The reason why we need an intrinsic is to work around the limitations of basic block isel. In our new architecture, the lane predication is implicit iff we can generate the hardware loop - but that doesn't prevent other instructions, predicated on something other than the loop index, from being generated too. At ISel we can't guarantee whether a predicate is loop index based or otherwise, so it has to be explicit coming into ISel.
3) The main difference here is the same as (2). As I understand SVE, has bank of predicate registers that are explicitly accessed, whereas MVE has a status register that is used implicitly.

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com


From: Philip Reames [hidden email]
Sent: 28 May 2019 19:00
To: Sam Parker; [hidden email]
Cc: nd
Subject: Re: [llvm-dev] [RFC] Intrinsics for Hardware Loops
 

This seems like a generally reasonable approach.  I have some hesitation about the potential separation of the control flow and the intrinsics (i.e. can we every confuse which loop they apply to?), but the basic notion seems reasonable.  Particularly so as Hal points out that we already have something like this in PPC.   I'd suggest framing this as being an IR assist to backends rather than a canonical form or anything expected to be used by frontends though.


A couple of random comments; there's no coherent message here, just a collection of thoughts.


1) Your "loop.end" intrinsic is very confusingly named.  I think you definitely need something different there name wise.  Also, you fail to specify what the return value is.

2) Your get.active.mask.X is a generally useful construct, but I think it can be represented via bitmath and a bitcast right?  (i.e. does it have to be an intrinsic?)

3) There seems to be a good amount of overlap with the SVE ideas.  I'm not suggesting it needs to be reconciled, just pointing out many of the issues are common.  (The more I see discussion of these topics, there more unsettled it all feels.  Trying out a couple of experimental designs, and iterating until one wins is feeling more and more like the right approach.)


Philip




On 5/20/19 4:00 AM, Sam Parker via llvm-dev wrote:
Hi,

Arm have recently announced the v8.1-M architecture specification for
our  next generation microcontrollers. The architecture includes
vector extensions (MVE) and support for low-overhead branches (LoB),
which can be thought of a style of hardware loop. Hardware loops
aren't new to LLVM, other backends (at least Hexagon and PPC that I
know of) also include support. These implementations insert the loop
controlling instructions at the MachineInstr level and I'd like to
propose that we add intrinsics to support this notion at the IR
level; primarily to be able to use scalar evolution to understand the
loops instead of having to implement a machine-level analysis for
each target.

I've posted an RFC with a prototype implementation in
https://reviews.llvm.org/D62132. It contains intrinsics that are
currently Arm specific, but I hope they're general enough to be used
by all targets. The Arm v8.1-m architecture supports do-while and
while loops, but for conciseness, here, I'd like to just focus on
while loops. There's two parts to this RFC: (1) the intrinsics
and (2) a prototype implementation in the Arm backend to enable
tail-predicated machine loops.
   
1. LLVM IR Intrinsics
   
In the following definitions, I use the term 'element' to describe
the work performed by an IR loop that has not been vectorized or
unrolled by the compiler. This should be equivalent to the loop at
the source level.
   
void @llvm.arm.set.loop.iterations(i32)
- Takes as a single operand, the number of iterations to be executed.
   
i32 @llvm.arm.set.loop.elements(i32, i32)
- Takes two operands:
  - The total number of elements to be processed by the loop.
  - The maximum number of elements processed in one iteration of
    the IR loop body.
- Returns the number of iterations to be executed.
   
<X x i1> @llvm.arm.get.active.mask.X(i32)
- Takes as an operand, the number of elements that still need
  processing.
- Where 'X' denotes the vectorization factor, returns an array of i1
  indicating which vector lanes are active for the current loop
  iteration.
   
i32 @llvm.arm.loop.end(i32, i32)
- Takes two operands:
  - The number of elements that still need processing.
  - The maximum number of elements processed in one iteration of the
    IR loop body.
   
The following gives an illustration of their intended usage:
   
entry:
  %0 = call i32 @llvm.arm.set.loop.elements(i32 %N, i32 4)
  %1 = icmp ne i32 %0, 0
  br i1 %1, label %vector.ph, label %for.loopexit
   
vector.ph:
  br label %vector.body
   
vector.body:
  %elts = phi i32 [ %N, %vector.ph ], [ %elts.rem, %vector.body ]
  %active = call <4 x i1> @llvm.arm.get.active.mask(i32 %elts, i32 4)
  %load = tail call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %addr, i32 4, <4 x i1> %active, <4 x i32> undef)
  tail call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %load, <4 x i32>* %addr.1, i32 4, <4 x i1> %active)
  %elts.rem = call i32 @llvm.arm.loop.end(i32 %elts, i32 4)
  %cmp = icmp sgt i32 %elts.rem, 0
  br i1 %cmp, label %vector.body, label %for.loopexit
   
for.loopexit:
  ret void
   
As the example shows, control-flow is still ultimately performed
through the icmp and br pair. There's nothing connecting the
intrinsics to a given loop or any requirement that a set.loop.* call
needs to be paired with a loop.end call.
   
2. Low-overhead loops in the Arm backend
   
Disclaimer: The prototype is barebones and reuses parts of NEON and
I'm currently targeting the Cortex-A72 which does not support this
feature! opt and llc build and the provided test case doesn't cause a
crash...
   
The low-overhead branch extension can be combined with MVE to
generate vectorized loops in which the epilogue is executed within
the predicated vector body. The proposal is for this to be supported
through a series of pass:
1) IR LoopPass to identify suitable loops and insert the intrinsics
   proposed above.
2) DAGToDAG ISel which makes the intrinsics, almost 1-1, to a pseduo
   instruction.
3) A final MachineFunctionPass to expand the pseudo instructions.
   
To help / enable the lowering of of an i1 vector, the VPR register has
been added. This is a status register that contains the P0 predicate
and is also used to model the implicit predicates of tail-predicated
loops.
   
There are two main reasons why pseudo instructions are used instead
of generating MIs directly during ISel:
1) They gives us a chance of later inspecting the whole loop and
   confirm that it's a good idea to generate such a loop. This is
   trivial for scalar loops, but not really applicable for
   tail-predicated loops.
2) It allows us to separate the decrementing of the loop counter with
   the instruction that branches back, which should help us recover if
   LR gets spilt between these two pseudo ops.
   
For Armv8.1-M, the while.setup intrinsic is used to generate the wls
and wlstp instructions, while loop.end generates the le and letp
instructions. The active.mask can just be removed because the lane
predication is handled implicitly.
   
I'm not sure of the vectorizers limitations of generating vector
instructions that operate across lanes, such as reductions, when
generating a predicated loop but this needs to be considered.

I'd welcome any feedback here or on Phabricator and I'd especially like
to know if this would useful to current targets.

cheers,

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com


_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

_______________________________________________
LLVM Developers mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev