[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
78 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On Mon, Jul 30, 2018 at 1:12 PM, Renato Golin via llvm-dev <[hidden email]> wrote:
The worry here is not within each instruction but across instructions.
SVE (and I think RISC-V) allow register size to be dynamically set.

For example, on the same machine, it may be 256 for one process and
512 for another (for example, to save power).

But the change is via a system register, so in theory, anyone can
write an inline asm in the beginning of a function and change the
vector length to whatever they want.

Worst still, people can do that inside loops, or in a tail loop,
thinking it's a good idea (or this is a Cray machine :).

AFAIK, the interface for changing the register length will not be
exposed programmatically, so in theory, we should not worry about it.
Any inline asm hack can be considered out of scope / user error.

However, Hal's concern seems to be that, in the event of anyone
planning to add it to their APIs, we need to make sure the proposed
semantics can cope with it (do we need to update the predicates again?
what will vscale mean, then and when?).

If not, we may have to enforce that this will not come to pass in its
current form. In this case, changing it later will require *a lot*
more effort than doing it now.

So, it would be good to get a clear response from the two fronts (SVE
and RISC-V) about the future intention to expose that or not.

Some characteristics of how I believe RISC-V vectors will or could end up:

- the user's data is stored only in normal C "arrays" (which of course can mean a pointer into the middle of some arbitrary chunk of memory)

- vector register types will be used only within a loop in a single user-written function. There is no way to pass a vector variable from one function to another -- there is no effect on ABI.

- there will be some vector intrinsic functions such as trancendentals. They will use a different, private ABI used only by the compiler and implemented only in the runtime library. They will probably use the alternate link register (x5 instead of x1) and will be totally not miscible with normal functions.

- even within a single function, different loops may have different maximum vector length, depending on how many vector registers are required and of what element types (all vectors in a given loop have the same number of elements).

- the active vector length can change from iteration to iteration of a loop. In particular, it can be less on the final iteration to deal with tails.

- the active vector length is set at the head of each iteration of a loop by the program telling the hardware how many elements are left (possibly thousands or millions) and the hardware saying "you can have 17 this time"

- (maybe) the active vector length can become shorter during execution of a loop iteration as a side effect of a vector load or store getting a protection error and loading/storing only up to the protection boundary. In this case an actual trap will be taken only if the first element of the vector causes the problem. Different micro-architectures might handle this differently. It should be a rare event. An interrupt or task switch during execution of a vector loop may cause the active vector length to become zero for that iteration.


So, this is quite different in detail to ARM's SVE but it should be able to use the same type system. The main differences are probably that they seem to intend to be able to pass vector types from one function to another -- but their vector length is fixed for any given processor (or process?). RISC-V loops may need to query the active vector length at the end of each loop iteration. That's a different instruction that needs to be emitted, but has no effect on the type system.

From the point of view of the type system, I think RISC-V is a subset of SVE, as there is no need to pass vectors between functions and no effect on the ABI.



_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Renato Golin <[hidden email]> writes:

> On Mon, 30 Jul 2018 at 20:57, David A. Greene via llvm-dev
> <[hidden email]> wrote:
>> I'm not sure exactly how the SVE proposal would address this kind of
>> operation.
>
> SVE uses predication. The physical number of lanes doesn't have to
> change to have the same effect (alignment, tails).

Right.  My wording was poor.  The current proposal doesn't directly
support a more dynamic vscale target but I believe it could be simply
extended to do so.

>> I think it would be unlikely for anyone to need to change the vector
>> length during evaluation of an in-register expression.
>
> The worry here is not within each instruction but across instructions.
> SVE (and I think RISC-V) allow register size to be dynamically set.

I wasn't talking about within an instruction but rather across
instructions in the same expression tree.  Something like this would be
weird:

A = load with VL
B = load with VL
C = A + B           # VL implicit
VL = <something>
D = ~C              # VL implicit
store D

Here and beyond, read "VL" as "vscale with minimum element count 1."

The points where VL would be changed are limited and I think would
require limited, straightforward additions on top of this proposal.

> For example, on the same machine, it may be 256 for one process and
> 512 for another (for example, to save power).

Sure.

> But the change is via a system register, so in theory, anyone can
> write an inline asm in the beginning of a function and change the
> vector length to whatever they want.
>
> Worst still, people can do that inside loops, or in a tail loop,
> thinking it's a good idea (or this is a Cray machine :).
>
> AFAIK, the interface for changing the register length will not be
> exposed programmatically, so in theory, we should not worry about it.
> Any inline asm hack can be considered out of scope / user error.

That's right.  This proposal doesn't expose a way to change vscale, but
I don't think it precludes a later addition to do so.

> However, Hal's concern seems to be that, in the event of anyone
> planning to add it to their APIs, we need to make sure the proposed
> semantics can cope with it (do we need to update the predicates again?
> what will vscale mean, then and when?).

I don't see why predicate values would be affected at all.  If a machine
with variable vector length has predicates, then typically the resulting
operation would operate on the bitwise AND of the predicate and a
conceptual all 1's predicate of length VL.

As I understand it, vscale is the runtime multiple of some minimal,
guaranteed vector length.  For SVE that minimum is whatever gives a bit
width of 128.  My guess is that for a machine with a more dynamic vector
length, the minimum would be 1.  vscale would then be the vector length
and would change accordingly if the vector length is changed.

Changing vscale would be no different than changing any other value in
the program.  The dataflow determines its possible values at various
program points.  vscale is an extra (implicit) operand to all vector
operations with scalable type.

> If not, we may have to enforce that this will not come to pass in its
> current form.

Why?  If a user does asm or some other such trick to change what vscale
means, that's on the user.  If a machine has a VL that changes
iteration-to-iteration, typically the compiler would be responsible for
controlling it.

If the vendor provides some target intrinsics to let the user write
low-level vector code that changes vscale in a high-level language, then
the vendor would be responsible for adding the necessary bits to the
frontend and LLVM.  I would not recommend a vendor try to do this.  :)
It wouldn't necessarily be hard to do, but it would be wasted work IMO
because it would be better to improve the vectorizer that already
exists.

> In this case, changing it later will require *a lot* more effort than
> doing it now.

I don't see why.  Anyone adding ability to change vscale would need to
add intrinsics and specify their semantics.  That shouldn't change
anything about this proposal and any such additions shouldn't be
hampered by this proposal.

Another way to think of vscale/vector length is as a different kind of
predicate.  Right now LLVM uses select to track predicate application.
It uses a "top-down" approach in that the root of an expression tree (a
select) applies the predicate and presumably everything under it
operates under that predicate.  It also uses intrinsics for certain
operations (loads, stores, etc.) that absolutely must be predicated no
matter what for safety reasons.  So it's sort of a hybrid approach, with
predicate application at the root, certain leaves and maybe even on
interior nodes (FP operations come to mind).

To my knowledge, there's nothing in LLVM that checks to make sure these
predicate applications are all consistent with one another.  Someone
could do a load with predicate 0011 and then a "select div" with
predicate 1111, likely resulting in a runtime fault but nothing in LLVM
would assert on the predicate mismatch.

Predicates could also be applied only at the leaves and propagated up
the tree.  IIRC, Dan Gohman proposed something like this years back when
the topic of predication came up.  He called it "applymask" but
unfortunately the Google is failing to find it.  

I *could* imagine using select to also convey application of vector
length but that seems odd and unnecessarily complex.

If vector length were applied at the leaves, it would take a bit of work
to get it through instruction selection.  Target opcodes would be one
way to do it.  I think it would be straightforward to walk the DAG and
change generic opcodes to target opcodes when necessary.

I don't think we should worry about taking IR with dynamic changes to VL
and trying to generate good code for any random target from it.  Such IR
is very clearly tied to a specific kind of target and we shouldn't
bother pretending otherwise.  The vectorizer should be aware of the
target's capabilities and generate code accordingly.

                        -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On Tue, 31 Jul 2018 at 03:53, David A. Greene <[hidden email]> wrote:
> I wasn't talking about within an instruction but rather across
> instructions in the same expression tree.  Something like this would be
> weird:

Yes, that's what I was referring as "not in the API" therefore "user error".


> The points where VL would be changed are limited and I think would
> require limited, straightforward additions on top of this proposal.

Indeed. I have a limited view on the spec and even more so on hardware
implementations, but it is my understanding that there is no attempt
to change VL mid-loop.

If we can assume VL will be "the same" (not constant) throughout every
self-contained sub-graph (from scalar|memory->vector to
vector->scalar|memory), there we should encode it in the IR spec that
this is a hard requirement.

This seems consistent with your explanation of the Cray VL change as
well as Bruce's description of RISC-V (both seem very similar to me),
where VL can change between two loop iterations but not within the
same iteration.

We will still have to be careful with access safety (alias, loop
dependencies, etc), but that shouldn't be different than if VL was
required to be constant throughout the program.


> That's right.  This proposal doesn't expose a way to change vscale, but
> I don't think it precludes a later addition to do so.

That was my point about this change being harder to do later than now.

I think no one wants to do that now, so we're all happy to pay the
price later, because that will likely never come.


> I don't see why predicate values would be affected at all.  If a machine
> with variable vector length has predicates, then typically the resulting
> operation would operate on the bitwise AND of the predicate and a
> conceptual all 1's predicate of length VL.

I think the problem is that SVE is fully predicated and Cray (RISC-V?)
is not, so mixing the two could lead into weird predication
situations.

So, if a high level optimisation pass assumes full predication and
change the loop accordingly, and another pass assumes no predication
and adds VL changes (say, loop tails), then we may end up with
incompatible IR that will be hard to select down in ISel.

Given that SVE has both predication and vscale change, this could
happen in practice. It wouldn't be necessarily wrong, but it would
have to be a conscious decision.


> Changing vscale would be no different than changing any other value in
> the program.  The dataflow determines its possible values at various
> program points.  vscale is an extra (implicit) operand to all vector
> operations with scalable type.

It is, but IIGIR, changing vscale and predicating are similar
transformations to achieve the similar goals, but will not be
represented the same way in IR.

Also, they're not always interchangeable, so that complicates the IR
matching in ISel as well as potential matching in optimisation passes.


> Why?  If a user does asm or some other such trick to change what vscale
> means, that's on the user.  If a machine has a VL that changes
> iteration-to-iteration, typically the compiler would be responsible for
> controlling it.

Not asm, sorry. Inline as is "user error".

I meant: make sure adding an IR visible change in VL (say, an
intrinsic or instruction), within a self-contained block, becomes an
IR error.


> If the vendor provides some target intrinsics to let the user write
> low-level vector code that changes vscale in a high-level language, then
> the vendor would be responsible for adding the necessary bits to the
> frontend and LLVM.  I would not recommend a vendor try to do this.  :)

Not recommending by making it an explicit error. :)

It may sound harsh, but given we're taking some pretty liberal design
choices right now, which could have long lasting impact on the
stability and quality of LLVM's code generation, I'd say we need to be
as conservative as possible.


> I don't see why.  Anyone adding ability to change vscale would need to
> add intrinsics and specify their semantics.  That shouldn't change
> anything about this proposal and any such additions shouldn't be
> hampered by this proposal.

I don't think it would be hard to do, but it could have consequences
to the rest of the optimisation and code generation pipeline.

I do not claim to have a clear vision on any of this, but as I said
above, it will pay off long term is we start conservative.


> I don't think we should worry about taking IR with dynamic changes to VL
> and trying to generate good code for any random target from it.  Such IR
> is very clearly tied to a specific kind of target and we shouldn't
> bother pretending otherwise.

We're preaching for the same goals. :)

But we're trying to represent slightly different techniques
(predication, vscale change) which need to be tied down to only
exactly what they do.

Being conservative and explicit on the semantics is, IMHO, the easiest
path to get it right. We can surely expand later.


--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On Tue, Jul 31, 2018 at 9:13 PM, Renato Golin via llvm-dev <[hidden email]> wrote:
Indeed. I have a limited view on the spec and even more so on hardware
implementations, but it is my understanding that there is no attempt
to change VL mid-loop.

If we can assume VL will be "the same" (not constant) throughout every
self-contained sub-graph (from scalar|memory->vector to
vector->scalar|memory), there we should encode it in the IR spec that
this is a hard requirement.

I don't see any harm in (very occasionally) making the VL shorter somewhere within an iteration of a loop. Some work that was already done will be wasted, but that's not a correctness problem. Making the VL longer mid-iteration would of course be very bad.

The important thing is that the various source and destination pointers are updated by the correct amount at the end of the loop.
 
This seems consistent with your explanation of the Cray VL change as
well as Bruce's description of RISC-V (both seem very similar to me),
where VL can change between two loop iterations but not within the
same iteration.

I'm not sure whether it will end up being possible or not, but I did describe two situations where at least some RISC-V implementations might want to change VL within an iteration:

1) a memory protection problem on some trailing part of a vector load or store, causing that iteration to operate only on the accessible part, and the next iteration to start from the first address in the non-accessible part (and actually take a fault)

2) an interrupt/task switch in the middle of a loop iteration. Some implementations may want to save/restore only the vector configuration, not the values of the vector registers.

> I don't see why predicate values would be affected at all.  If a machine
> with variable vector length has predicates, then typically the resulting
> operation would operate on the bitwise AND of the predicate and a
> conceptual all 1's predicate of length VL.

I think the problem is that SVE is fully predicated and Cray (RISC-V?)
is not, so mixing the two could lead into weird predication
situations.

The current RISC-V proposal has a 2-bit field in each vector instruction, with the values indicating:

- it's actually scalar
- vector operation with no predication
- vector operation, masked by the predicate register
- vector operation, masked by the inverse of the predicate register


_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On Tue, 31 Jul 2018 at 13:48, Bruce Hoult <[hidden email]> wrote:
> I don't see any harm in (very occasionally) making the VL shorter somewhere within an iteration of a loop. Some work that was already done will be wasted, but that's not a correctness problem. Making the VL longer mid-iteration would of course be very bad.
> The important thing is that the various source and destination pointers are updated by the correct amount at the end of the loop.

If this is orthogonal to the IR representation, ie. doesn't need
current instructions to *know* about it, but the sequence of IR
instructions will represent it, than it should be fine.


> I'm not sure whether it will end up being possible or not, but I did describe two situations where at least some RISC-V implementations might want to change VL within an iteration:

Apologies, I may have misinterpreted them.


> 1) a memory protection problem on some trailing part of a vector load or store, causing that iteration to operate only on the accessible part, and the next iteration to start from the first address in the non-accessible part (and actually take a fault)

SVE deals with those problems with predication and FFR
(first-fault-register), not by changing the VL, but I imagine they're
semantically similar.


> 2) an interrupt/task switch in the middle of a loop iteration. Some implementations may want to save/restore only the vector configuration, not the values of the vector registers.

I assume the architecture will have to continue the program in the
same state they were when the interrupt occurred. How it does
shouldn't concern the code generation.


--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Renato Golin <[hidden email]> writes:

>> The points where VL would be changed are limited and I think would
>> require limited, straightforward additions on top of this proposal.
>
> Indeed. I have a limited view on the spec and even more so on hardware
> implementations, but it is my understanding that there is no attempt
> to change VL mid-loop.

What does "mid-loop" mean?  On traditional vector architectures it was
very common to change VL for the last loop iteration.  Otherwise you had
to have a remainder loop.  It was much better to change VL.

> If we can assume VL will be "the same" (not constant) throughout every
> self-contained sub-graph (from scalar|memory->vector to
> vector->scalar|memory), there we should encode it in the IR spec that
> this is a hard requirement.
>
> This seems consistent with your explanation of the Cray VL change as
> well as Bruce's description of RISC-V (both seem very similar to me),
> where VL can change between two loop iterations but not within the
> same iteration.

Ok, I think I am starting to grasp what you are saying.  If a value
flows from memory or some scalar computation to vector and then back to
memory or scalar, VL should only ever be set at the start of the vector
computation until it finishes and the value is deposited in memory or
otherwise extracted.  I think this is ok, but note that any vector
functions called may change VL for the duration of the call.  The change
would not be visible to the caller.

Just thinking this through, a case where one might want to change VL
mid-stream is something like a half-length set of operations that feeds
a vector concat and then a full length set of operations following.  But
again I think this would be a strange way to do things.  If someone
really wants to do this they can predicate away the upper bits of the
half-length operations and maintain the same VL throughout the
computation.  If predication isn't available they they've got more
serious problems vectorizing code.  :)

> We will still have to be careful with access safety (alias, loop
> dependencies, etc), but that shouldn't be different than if VL was
> required to be constant throughout the program.

Yep.

>> That's right.  This proposal doesn't expose a way to change vscale, but
>> I don't think it precludes a later addition to do so.
>
> That was my point about this change being harder to do later than now.

I guess I don't see why it would be any harder later.

> I think no one wants to do that now, so we're all happy to pay the
> price later, because that will likely never come.

I am not so sure about that.  Power requirements may very well drive
more dynamic vector lengths.  Even today some AVX 512 implementations
falter if there are "too many" 512-bit operations.  Scaling back SIMD
width statically is very common today and doing so dynamically seems
like an obvious extension.  I don't know of any efforts to do this so
it's all speculative at this point.  But the industry has done it in the
past and we have a curious pattern of reinventing things we did before.

>> I don't see why predicate values would be affected at all.  If a machine
>> with variable vector length has predicates, then typically the resulting
>> operation would operate on the bitwise AND of the predicate and a
>> conceptual all 1's predicate of length VL.
>
> I think the problem is that SVE is fully predicated and Cray (RISC-V?)
> is not, so mixing the two could lead into weird predication
> situations.

Cray vector ISAs were fully predicated and also used a vector length.
It didn't cause us any serious issues.  In many ways having an
adjustable VL and predication makes things easier because you don't have
to regenerate predicates to switch to a shorter VL.

> So, if a high level optimisation pass assumes full predication and
> change the loop accordingly, and another pass assumes no predication
> and adds VL changes (say, loop tails), then we may end up with
> incompatible IR that will be hard to select down in ISel.
>
> Given that SVE has both predication and vscale change, this could
> happen in practice. It wouldn't be necessarily wrong, but it would
> have to be a conscious decision.

It seems strange to me for an optimizer to operate in such a way.  The
optimizer should be fully aware of the target's capabilities and use
them accordingly.  But let's say this happens.  Pass 1 vectorizes the
loop with predication (for a conditional loop body) and creates a
remainder loop, which would also need to be predicated.  Note that such
a remainder loop is not necessary with full predication support but for
the sake of argument lets say pass 1 is not too smart.

Pass 2 comes along and says, "hey, I have the ability to change VL so we
don't need a remainder loop."  It rewrites the main loop to use dynamic
VL and removes the remainder loop.  During that rewrite, pass 2 would
have to maintain predication.  It can use the very same predicate values
pass 1 generated.  There is no need to adjust them because the VL is
applied "on top of" the predicates.

Pass 2 effectively rewrites the code to what the vectorizer should have
emitted in the first place.  I'm not seeing how ISel is any more
difficult.  SVE has an implicit vscale operand on every instruction and
ARM seems to have no difficulty selecting instructions for it.  Changing
the value of vscale shouldn't impact ISel at all.  The same instructions
are selected.

>> Changing vscale would be no different than changing any other value in
>> the program.  The dataflow determines its possible values at various
>> program points.  vscale is an extra (implicit) operand to all vector
>> operations with scalable type.
>
> It is, but IIGIR, changing vscale and predicating are similar
> transformations to achieve the similar goals, but will not be
> represented the same way in IR.

They probably will not be represented the same way, though I think they
could be (but probably shouldn't be).

> Also, they're not always interchangeable, so that complicates the IR
> matching in ISel as well as potential matching in optimisation passes.

I'm not sure it does but I haven't worked something all the way through.

>> Why?  If a user does asm or some other such trick to change what vscale
>> means, that's on the user.  If a machine has a VL that changes
>> iteration-to-iteration, typically the compiler would be responsible for
>> controlling it.
>
> Not asm, sorry. Inline as is "user error".

Ok.

> I meant: make sure adding an IR visible change in VL (say, an
> intrinsic or instruction), within a self-contained block, becomes an
> IR error.

What do you mean by "self-contained block?"  Assuming I understood it
correctly, the restriction you described at the top seems reasonable for
now.

>> If the vendor provides some target intrinsics to let the user write
>> low-level vector code that changes vscale in a high-level language, then
>> the vendor would be responsible for adding the necessary bits to the
>> frontend and LLVM.  I would not recommend a vendor try to do this.  :)
>
> Not recommending by making it an explicit error. :)
>
> It may sound harsh, but given we're taking some pretty liberal design
> choices right now, which could have long lasting impact on the
> stability and quality of LLVM's code generation, I'd say we need to be
> as conservative as possible.

Ok, but would be optimizer be prevented from introducing VL changes?

>> I don't see why.  Anyone adding ability to change vscale would need to
>> add intrinsics and specify their semantics.  That shouldn't change
>> anything about this proposal and any such additions shouldn't be
>> hampered by this proposal.
>
> I don't think it would be hard to do, but it could have consequences
> to the rest of the optimisation and code generation pipeline.

It could.  I don't think any of us has a clear idea of what those might
be.

> I do not claim to have a clear vision on any of this, but as I said
> above, it will pay off long term is we start conservative.

Being conservative is fine, but we should have a clear understanding of
exactly what that means.  I would not want to prohibit all VL changes
now and forever, because I see that as unnecessarily restrictive and
possibly damaging to supporting future architectures.

If we don't want to provide intrinsics for changing VL right now, I'm
all in favor.  There would be no reason to add error checks because
there would be no way within the IR to change VL.

But I don't want to preclude adding such intrinsics in the future.

>> I don't think we should worry about taking IR with dynamic changes to VL
>> and trying to generate good code for any random target from it.  Such IR
>> is very clearly tied to a specific kind of target and we shouldn't
>> bother pretending otherwise.
>
> We're preaching for the same goals. :)

Good!  :)

> But we're trying to represent slightly different techniques
> (predication, vscale change) which need to be tied down to only
> exactly what they do.

Wouldn't intrinsics to change vscale do exactly that?

> Being conservative and explicit on the semantics is, IMHO, the easiest
> path to get it right. We can surely expand later.

I'm all for being explicit.  I think we're basically on the same page,
though there are a few things noted above where I need a little more
clarity.

                               -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi all,

I'm starting to feel like I'm a broken record about this, but too much
of the discussion has been unclear about this and I think it causes a
fair amount of confusion, so I feel obligated to state it again as
clearly as I can: There are TWO independent notions of vector length
in this space! Namely:

1. How large are the machine's vector registers?
2. How many elements of a vector register are processed by an instruction?

This RFC addresses only the former, with the vscale concept. We have
been and still are discussing the latter in this email thread too,
sometimes under names such as "VL" or "active vector length", but
unfortunately also often as just plain "vector length". I think this
is very unfortunate: having two intermingled discussions about
different things which share a name is very confusing, especially
since I believe there is no need to discuss them together.

The active vector length can't be larger than the number of elements
in a vector register, but apart from that they are entirely separate
and whether an architecture has a fixed- or variable-size register is
completely orthogonal to whether it has a VL register. All
combinations make sense and exist in real architectures:

- SSE, NEON, etc. have fixed-size vector registers (e.g. 128 bit)
without any active vector length mechanism
- Classical Cray-style vector processors have fixed-size vector
registers (e.g., Cray-1 had 64x64bit) and an active vector length
mechanism
- SVE has variable-size vector registers and no active vector length
mechanism (loops are instead controlled by predication)
- The vector extension for RISC-V has variable-size vector register
and an active vector length mechanism

More importantly, the two mechanism are *used* very differently and
place very different demands on a compiler. Therefore, any discussion
that conflates these two concerns is doomed from the start IMHO. I
have written a bit about these differences, but since I know many
people here only have so much time, I moved this to an "appendix"
after the end of this email and will now go straight to addressing
Hal's second concern with this distinction in mind.

On 30 July 2018 at 21:10, Hal Finkel <[hidden email]> wrote:

>
> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>
> I strongly suspect that there remains widespread concern with the direction
> of this, I know I have them.
>
> I don't think that many of the people who have that concern have had time to
> come back to this RFC and make progress on it, likely because of other
> commitments or simply the amount of churn around SVE related patches and
> such. That is at least why I haven't had time to return to this RFC and try
> to write more detailed feedback.
>
> Certainly, I would want to see pretty clear and considered support for this
> change to the IR type system from Hal, Chris, Eric and/or other long time
> maintainers of core LLVM IR components before it moves forward, and I don't
> see that in this thread.
>
>
> At a high level, I'm happy with this approach. I think it will be important
> for LLVM to support runtime-determined vector lengths - I see the
> customizability and power-efficiency constraints that motivate these designs
> continuing to increase in importance. I'm still undecided on whether this
> makes vector code nicer even for fixed-vector-length architectures, but some
> of the design decisions that it forces, such as having explicit intrinsics
> for reductions and other horizontal operations, seem like the right
> direction regardless. I have two questions:
>
> 1.
>
> This is a proposal for how to deal with querying the size of scalable types
> for
>> analysis of IR. While it has not been implemented in full,
>
>
> Is this still true? The details here need to all work out, obviously, and we
> should make sure that any issues are identified.
>
> 2. I know that there has been some discussion around support for changing
> the vector length during program execution (e.g., to account for some
> (proposed?) RISC-V feature), perhaps even during the execution of a single
> function. I'm very concerned about this idea because it is not at all clear
> to me how to limit information transfer contaminated with the vector size
> from propagating between different regions. As a result, I'm concerned about
> trying to add this on later, and so if this is part of the plan, I think
> that we need to think through the details up front because it could have a
> major impact on the design.

Yes, changing vscale during program execution is necessary to some
degree for the RISC-V vector extension. Yes, doing this at arbitrary
program points is indeed extremely challenging for a compiler to
support, for the reason you describe. This is why I proposed a
tradeoff, which Graham incorporated into this RFC: vscale can only
change at function boundaries and is fixed between function entry and
exit. This restriction is OK (not ideal, but good enough IMO) for
RISC-V and it makes the problem much more manageable because most code
in LLVM operates only within one function at a time, so it never has
to encounter vscale changes. I also think this is the most we'll ever
be able to support -- the problem you describe isn't going away, and I
don't know of any major use cases that would require us to tackle this
difficult problem in its entirety. However, I might be unaware of
something people want to do with SVE that doesn't fit into this mould.

Despite also being relevant for RISC-V and being discussed extensively
in this thread, the active vector length is basically just a very
minor twist on predication, and therefore doesn't interact at all with
the type system changes proposed here. Like predication, it can just
be modelled by regular data flow between IR operations (as David
already said). As with predication, a smaller *active* vector length
(~= a mask with few elements enabled) doesn't mean vectors suddenly
have fewer elements, just that more of them are masked out while doing
calculations. While there's an interesting design space for how to
best represent this predication in the IR, it has entirely different
challenges and constraints than vscale changes. If anything, the
"active vector length" discussion has more in common with past
discussions about making predication for *fixed-length* vectors more
of a first-class citizen in LLVM IR.

So I think this RFC as-is solves the problem of changing vector
register sizes about as well as it can and needs to be solved, and in
a way that is entirely satisfactory for RISC-V (again, I can't speak
for SVE, I don't know the use cases there). While more work is needed
to deal with another aspect of the RISC-V vector architecture (the VL
register), that can and should be a separate discussion, the results
of which won't invalidate anything decided in this RFC.


Cheers,
Robin


## Appendix

The active vector length or VL register is a tool for loop control,
ensuring the vectorized loop does not run too far while still
maximizing use of the vector unit. As such, it is recomputed
frequently (at minimum once per loop iteration, possibly even within a
loop as Bruce explained) and can be seen as a particular kind of
predication. It applies to a particular operation, prevents it from
having unwanted side effects, and operates on a subset of a larger
vector. As SVE illustrates, one can use plain old masks in precisely
the same way to solve the same problem, constructing and maintaining
masks that enable "the first n elements" where n would be the active
vector length in a different architecture. Creating a special VL
register for this purpose is just an architectural accomodation for
this style of predication. While it may have significant impact on the
microarchitecture and suggest a different mental model to programmers,
it's basically just predication from a compiler's perspective.

The vector register size, on the other hand, is not something you
change just like that. Changing it is, at best, like deciding to
switch from AVX exclusively (i.e.,  no xmm registers) to SSE
exclusively. It changes fundamental properties of your register field
and vector unit. While you can easily compile one part of your
application one way and another part differently if they don't
interact directly, once you try to do this e.g. in the middle of a
vectorized code region, it gets difficult even conceptually, to say
nothing of the compiler implementation. Furthermore, in the AVX->SSE
case you know how the vector length changes, and that might also apply
to SVE -- e.g. you could halve vscale and split all your existing
N-element vectors into two (N/2)-element vectors each -- but on
RISC-V, you probably won't be able to control the vector register size
directly, so you can't even do that much.

These difference also impact how to approach the two concepts in a
compiler. The active vector length -- very much like a mask -- is just
a piece of data that is computed and then used in various vector
operations as an extra operand, as David suggested in a recent email.
For this reason, I agree with his assesment that the active vector
length is "just data flow" and doesn't interact with the type system
changes discussed in this RFC.

vscale, on the other hand, is not easily handled as "just a piece of
data". The size of vector registers impacts many things besides
individual operations that are explicit in IR, and as such many parts
of the compiler have to be acutely aware of what it isand where it
might change. To give just one example, if you increase the size of
vector registers in the middle of a function, you need to reserve more
stack space for spilling -- if you just reserve stack space in the
prologue using the *initial* register size, you won't have enough
space to spill the larger vector values later on. There's myriad more
problems like this if you sit down and sift through IR transformations
and the CodeGen infrastructure (as I have been doing for RISC-V over
the last year). A change of vscale is best considered to be a massive
barrier to all code that is even remotely vector-related. In Hal's
terms, you really want to prevent anything contaminated with the
vector register size to cross over the point where you change the
vector register size.

Like Hal, I am very skeptical how, if at all, such a barrier could be
added to IR. And I've spent a lot of time trying to come up with a
solution as part of my RISC-V work. That is why my RFC back in April
proposed a trade-off, which has been incorporated by Graham into this
RFC: vscale can change between functions, but does not change within a
function. As an analogy, consider how LLVM supports different
subtargets (each with different registers, instructions and legal
types) on a per-function basis but doesn't allow e.g. making a
register class completely unavailable at a certain point in a
function.

> Thanks again,
> Hal
>
>
>
> Put differently: I don't think silence is assent here. You really need some
> clear signal of consensus.
>
> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <[hidden email]> wrote:
>>
>> Hi,
>>
>> Are there any objections to going ahead with this? If not, we'll try to
>> get the patches reviewed and committed after the 7.0 branch occurs.
>>
>> -Graham
>>
>> > On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I've updated the RFC slightly based on the discussion within the thread,
>> > reposted below. Let me know if I've missed anything or if more clarification
>> > is needed.
>> >
>> > Thanks,
>> >
>> > -Graham
>> >
>> > =============================================================
>> > Supporting SIMD instruction sets with variable vector lengths
>> > =============================================================
>> >
>> > In this RFC we propose extending LLVM IR to support code-generation for
>> > variable
>> > length vector architectures like Arm's SVE or RISC-V's 'V' extension.
>> > Our
>> > approach is backwards compatible and should be as non-intrusive as
>> > possible; the
>> > only change needed in other backends is how size is queried on vector
>> > types, and
>> > it only requires a change in which function is called. We have created a
>> > set of
>> > proof-of-concept patches to represent a simple vectorized loop in IR and
>> > generate SVE instructions from that IR. These patches (listed in section
>> > 7 of
>> > this rfc) can be found on Phabricator and are intended to illustrate the
>> > scope
>> > of changes required by the general approach described in this RFC.
>> >
>> > ==========
>> > Background
>> > ==========
>> >
>> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension
>> > for
>> > AArch64 which is intended to scale with hardware such that the same
>> > binary
>> > running on a processor with longer vector registers can take advantage
>> > of the
>> > increased compute power without recompilation.
>> >
>> > As the vector length is no longer a compile-time known value, the way in
>> > which
>> > the LLVM vectorizer generates code requires modifications such that
>> > certain
>> > values are now runtime evaluated expressions instead of compile-time
>> > constants.
>> >
>> > Documentation for SVE can be found at
>> >
>> > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>> >
>> > ========
>> > Contents
>> > ========
>> >
>> > The rest of this RFC covers the following topics:
>> >
>> > 1. Types -- a proposal to extend VectorType to be able to represent
>> > vectors that
>> >   have a length which is a runtime-determined multiple of a known base
>> > length.
>> >
>> > 2. Size Queries - how to reason about the size of types for which the
>> > size isn't
>> >   fully known at compile time.
>> >
>> > 3. Representing the runtime multiple of vector length in IR for use in
>> > address
>> >   calculations and induction variable comparisons.
>> >
>> > 4. Generating 'constant' values in IR for vectors with a
>> > runtime-determined
>> >   number of elements.
>> >
>> > 5. An explanation of splitting/concatentating scalable vectors.
>> >
>> > 6. A brief note on code generation of these new operations for AArch64.
>> >
>> > 7. An example of C code and matching IR using the proposed extensions.
>> >
>> > 8. A list of patches demonstrating the changes required to emit SVE
>> > instructions
>> >   for a loop that has already been vectorized using the extensions
>> > described
>> >   in this RFC.
>> >
>> > ========
>> > 1. Types
>> > ========
>> >
>> > To represent a vector of unknown length a boolean `Scalable` property
>> > has been
>> > added to the `VectorType` class, which indicates that the number of
>> > elements in
>> > the vector is a runtime-determined integer multiple of the `NumElements`
>> > field.
>> > Most code that deals with vectors doesn't need to know the exact length,
>> > but
>> > does need to know relative lengths -- e.g. get a vector with the same
>> > number of
>> > elements but a different element type, or with half or double the number
>> > of
>> > elements.
>> >
>> > In order to allow code to transparently support scalable vectors, we
>> > introduce
>> > an `ElementCount` class with two members:
>> >
>> > - `unsigned Min`: the minimum number of elements.
>> > - `bool Scalable`: is the element count an unknown multiple of `Min`?
>> >
>> > For non-scalable vectors (``Scalable=false``) the scale is considered to
>> > be
>> > equal to one and thus `Min` represents the exact number of elements in
>> > the
>> > vector.
>> >
>> > The intent for code working with vectors is to use convenience methods
>> > and avoid
>> > directly dealing with the number of elements. If needed, calling
>> > `getElementCount` on a vector type instead of `getVectorNumElements` can
>> > be used
>> > to obtain the (potentially scalable) number of elements. Overloaded
>> > division and
>> > multiplication operators allow an ElementCount instance to be used in
>> > much the
>> > same manner as an integer for most cases.
>> >
>> > This mixture of compile-time and runtime quantities allow us to reason
>> > about the
>> > relationship between different scalable vector types without knowing
>> > their
>> > exact length.
>> >
>> > The runtime multiple is not expected to change during program execution
>> > for SVE,
>> > but it is possible. The model of scalable vectors presented in this RFC
>> > assumes
>> > that the multiple will be constant within a function but not necessarily
>> > across
>> > functions. As suggested in the recent RISC-V rfc, a new function
>> > attribute to
>> > inherit the multiple across function calls will allow for function calls
>> > with
>> > vector arguments/return values and inlining/outlining optimizations.
>> >
>> > IR Textual Form
>> > ---------------
>> >
>> > The textual form for a scalable vector is:
>> >
>> > ``<scalable <n> x <type>>``
>> >
>> > where `type` is the scalar type of each element, `n` is the minimum
>> > number of
>> > elements, and the string literal `scalable` indicates that the total
>> > number of
>> > elements is an unknown multiple of `n`; `scalable` is just an arbitrary
>> > choice
>> > for indicating that the vector is scalable, and could be substituted by
>> > another.
>> > For fixed-length vectors, the `scalable` is omitted, so there is no
>> > change in
>> > the format for existing vectors.
>> >
>> > Scalable vectors with the same `Min` value have the same number of
>> > elements, and
>> > the same number of bytes if `Min * sizeof(type)` is the same (assuming
>> > they are
>> > used within the same function):
>> >
>> > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>> >  elements.
>> >
>> > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number
>> > of
>> >  bytes.
>> >
>> > IR Bitcode Form
>> > ---------------
>> >
>> > To serialize scalable vectors to bitcode, a new boolean field is added
>> > to the
>> > type record. If the field is not present the type will default to a
>> > fixed-length
>> > vector type, preserving backwards compatibility.
>> >
>> > Alternatives Considered
>> > -----------------------
>> >
>> > We did consider one main alternative -- a dedicated target type, like
>> > the
>> > x86_mmx type.
>> >
>> > A dedicated target type would either need to extend all existing passes
>> > that
>> > work with vectors to recognize the new type, or to duplicate all that
>> > code
>> > in order to get reasonable code generation and autovectorization.
>> >
>> > This hasn't been done for the x86_mmx type, and so it is only capable of
>> > providing support for C-level intrinsics instead of being used and
>> > recognized by
>> > passes inside llvm.
>> >
>> > Although our current solution will need to change some of the code that
>> > creates
>> > new VectorTypes, much of that code doesn't need to care about whether
>> > the types
>> > are scalable or not -- they can use preexisting methods like
>> > `getHalfElementsVectorType`. If the code is a little more complex,
>> > `ElementCount` structs can be used instead of an `unsigned` value to
>> > represent
>> > the number of elements.
>> >
>> > ===============
>> > 2. Size Queries
>> > ===============
>> >
>> > This is a proposal for how to deal with querying the size of scalable
>> > types for
>> > analysis of IR. While it has not been implemented in full, the general
>> > approach
>> > works well for calculating offsets into structures with scalable types
>> > in a
>> > modified version of ComputeValueVTs in our downstream compiler.
>> >
>> > For current IR types that have a known size, all query functions return
>> > a single
>> > integer constant. For scalable types a second integer is needed to
>> > indicate the
>> > number of bytes/bits which need to be scaled by the runtime multiple to
>> > obtain
>> > the actual length.
>> >
>> > For primitive types, `getPrimitiveSizeInBits()` will function as it does
>> > today,
>> > except that it will no longer return a size for vector types (it will
>> > return 0,
>> > as it does for other derived types). The majority of calls to this
>> > function are
>> > already for scalar rather than vector types.
>> >
>> > For derived types, a function `getScalableSizePairInBits()` will be
>> > added, which
>> > returns a pair of integers (one to indicate unscaled bits, the other for
>> > bits
>> > that need to be scaled by the runtime multiple). For backends that do
>> > not need
>> > to deal with scalable types the existing methods will suffice, but a
>> > debug-only
>> > assert will be added to them to ensure they aren't used on scalable
>> > types.
>> >
>> > Similar functionality will be added to DataLayout.
>> >
>> > Comparisons between sizes will use the following methods, assuming that
>> > X and
>> > Y are non-zero integers and the form is of { unscaled, scaled }.
>> >
>> > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>> >
>> > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>> >                         functions that inherit vector length. Cannot be
>> >                         compared across non-inheriting functions.
>> >
>> > { X, 0 } > { 0, Y }: Cannot return true.
>> >
>> > { X, 0 } = { 0, Y }: Cannot return true.
>> >
>> > { X, 0 } < { 0, Y }: Can return true.
>> >
>> > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>> >                             terms and try the above comparisons; it
>> >                             may not be possible to get a good answer.
>> >
>> > It's worth noting that we don't expect the last case (mixed scaled and
>> > unscaled sizes) to occur. Richard Sandiford's proposed C extensions
>> > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html)
>> > explicitly
>> > prohibits mixing fixed-size types into sizeless struct.
>> >
>> > I don't know if we need a 'maybe' or 'unknown' result for cases
>> > comparing scaled
>> > vs. unscaled; I believe the gcc implementation of SVE allows for such
>> > results, but that supports a generic polynomial length representation.
>> >
>> > My current intention is to rely on functions that clone or copy values
>> > to
>> > check whether they are being used to copy scalable vectors across
>> > function
>> > boundaries without the inherit vlen attribute and raise an error there
>> > instead
>> > of requiring passing the Function a type size is from for each
>> > comparison. If
>> > there's a strong preference for moving the check to the size comparison
>> > function
>> > let me know; I will be starting work on patches for this later in the
>> > year if
>> > there's no major problems with the idea.
>> >
>> > Future Work
>> > -----------
>> >
>> > Since we cannot determine the exact size of a scalable vector, the
>> > existing logic for alias detection won't work when multiple accesses
>> > share a common base pointer with different offsets.
>> >
>> > However, SVE's predication will mean that a dynamic 'safe' vector length
>> > can be determined at runtime, so after initial support has been added we
>> > can work on vectorizing loops using runtime predication to avoid
>> > aliasing
>> > problems.
>> >
>> > Alternatives Considered
>> > -----------------------
>> >
>> > Marking scalable vectors as unsized doesn't work well, as many parts of
>> > llvm dealing with loads and stores assert that 'isSized()' returns true
>> > and make use of the size when calculating offsets.
>> >
>> > We have considered introducing multiple helper functions instead of
>> > using direct size queries, but that doesn't cover all cases. It may
>> > still be a good idea to introduce them to make the purpose in a given
>> > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>> >
>> > ========================================
>> > 3. Representing Vector Length at Runtime
>> > ========================================
>> >
>> > With a scalable vector type defined, we now need a way to represent the
>> > runtime
>> > length in IR in order to generate addresses for consecutive vectors in
>> > memory
>> > and determine how many elements have been processed in an iteration of a
>> > loop.
>> >
>> > We have added an experimental `vscale` intrinsic to represent the
>> > runtime
>> > multiple. Multiplying the result of this intrinsic by the minimum number
>> > of
>> > elements in a vector gives the total number of elements in a scalable
>> > vector.
>> >
>> > Fixed-Length Code
>> > -----------------
>> >
>> > Assuming a vector type of <4 x <ty>>
>> > ``
>> > vector.body:
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>> > %vector.body.preheader ]
>> >  ;; <loop body>
>> >  ;; Increment induction var
>> >  %index.next = add i64 %index, 4
>> >  ;; <check and branch>
>> > ``
>> > Scalable Equivalent
>> > -------------------
>> >
>> > Assuming a vector type of <scalable 4 x <ty>>
>> > ``
>> > vector.body:
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>> > %vector.body.preheader ]
>> >  ;; <loop body>
>> >  ;; Increment induction var
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>> >  ;; <check and branch>
>> > ``
>> > ===========================
>> > 4. Generating Vector Values
>> > ===========================
>> > For constant vector values, we cannot specify all the elements as we can
>> > for
>> > fixed-length vectors; fortunately only a small number of easily
>> > synthesized
>> > patterns are required for autovectorization. The `zeroinitializer`
>> > constant
>> > can be used in the same manner as fixed-length vectors for a constant
>> > zero
>> > splat. This can then be combined with `insertelement` and
>> > `shufflevector`
>> > to create arbitrary value splats in the same manner as fixed-length
>> > vectors.
>> >
>> > For constants consisting of a sequence of values, an experimental
>> > `stepvector`
>> > intrinsic has been added to represent a simple constant of the form
>> > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the
>> > new
>> > start can be added, and changing the step requires multiplying by a
>> > splat.
>> >
>> > Fixed-Length Code
>> > -----------------
>> > ``
>> >  ;; Splat a value
>> >  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>> >  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32>
>> > zeroinitializer
>> >  ;; Add a constant sequence
>> >  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
>> > ``
>> > Scalable Equivalent
>> > -------------------
>> > ``
>> >  ;; Splat a value
>> >  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>> >  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32>
>> > undef, <scalable 4 x i32> zeroinitializer
>> >  ;; Splat offset + stride (the same in this case)
>> >  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>> >  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x
>> > i32> undef, <scalable 4 x i32> zeroinitializer
>> >  ;; Create sequence for scalable vector
>> >  %stepvector = call <scalable 4 x i32>
>> > @llvm.experimental.vector.stepvector.nxv4i32()
>> >  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>> >  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>> >  ;; Add the runtime-generated sequence
>> >  %add = add <scalable 4 x i32> %splat, %addoffset
>> > ``
>> > Future Work
>> > -----------
>> >
>> > Intrinsics cannot currently be used for constant folding. Our downstream
>> > compiler (using Constants instead of intrinsics) relies quite heavily on
>> > this
>> > for good code generation, so we will need to find new ways to recognize
>> > and
>> > fold these values.
>> >
>> > ===========================================
>> > 5. Splitting and Combining Scalable Vectors
>> > ===========================================
>> >
>> > Splitting and combining scalable vectors in IR is done in the same
>> > manner as
>> > for fixed-length vectors, but with a non-constant mask for the
>> > shufflevector.
>> >
>> > The following is an example of splitting a <scalable 4 x double> into
>> > two
>> > separate <scalable 2 x double> values.
>> >
>> > ``
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  ;; Stepvector generates the element ids for first subvector
>> >  %sv1 = call <scalable 2 x i64>
>> > @llvm.experimental.vector.stepvector.nxv2i64()
>> >  ;; Add vscale * 2 to get the starting element for the second subvector
>> >  %ec = mul i64 %vscale64, 2
>> >  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>> >  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64>
>> > undef, <scalable 2 x i32> zeroinitializer
>> >  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>> >  ;; Perform the extracts
>> >  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double>
>> > undef, <scalable 2 x i64> %sv1
>> >  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double>
>> > undef, <scalable 2 x i64> %sv2
>> > ``
>> >
>> > ==================
>> > 6. Code Generation
>> > ==================
>> >
>> > IR splats will be converted to an experimental splatvector intrinsic in
>> > SelectionDAGBuilder.
>> >
>> > All three intrinsics are custom lowered and legalized in the AArch64
>> > backend.
>> >
>> > Two new AArch64ISD nodes have been added to represent the same concepts
>> > at the SelectionDAG level, while splatvector maps onto the existing
>> > AArch64ISD::DUP.
>> >
>> > GlobalISel
>> > ----------
>> >
>> > Since GlobalISel was enabled by default on AArch64, it was necessary to
>> > add
>> > scalable vector support to the LowLevelType implementation. A single bit
>> > was
>> > added to the raw_data representation for vectors and vectors of
>> > pointers.
>> >
>> > In addition, types that only exist in destination patterns are planted
>> > in
>> > the enumeration of available types for generated code. While this may
>> > not be
>> > necessary in future, generating an all-true 'ptrue' value was necessary
>> > to
>> > convert a predicated instruction into an unpredicated one.
>> >
>> > ==========
>> > 7. Example
>> > ==========
>> >
>> > The following example shows a simple C loop which assigns the array
>> > index to
>> > the array elements matching that index. The IR shows how vscale and
>> > stepvector
>> > are used to create the needed values and to advance the index variable
>> > in the
>> > loop.
>> >
>> > C Code
>> > ------
>> >
>> > ``
>> > void IdentityArrayInit(int *a, int count) {
>> >  for (int i = 0; i < count; ++i)
>> >    a[i] = i;
>> > }
>> > ``
>> >
>> > Scalable IR Vector Body
>> > -----------------------
>> >
>> > ``
>> > vector.body.preheader:
>> >  ;; Other setup
>> >  ;; Stepvector used to create initial identity vector
>> >  %stepvector = call <scalable 4 x i32>
>> > @llvm.experimental.vector.stepvector.nxv4i32()
>> >  br vector.body
>> >
>> > vector.body
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0,
>> > %vector.body.preheader ]
>> >  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>> >
>> >           ;; stepvector used for index identity on entry to loop body ;;
>> >  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>> >                                     [ %stepvector,
>> > %vector.body.preheader ]
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  %vscale32 = trunc i64 %vscale64 to i32
>> >  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>> >
>> >           ;; vscale splat used to increment identity vector ;;
>> >  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32
>> > %vscale32, i32 4), i32 0
>> >  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32>
>> > undef, <scalable 4 x i32> zeroinitializer
>> >  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>> >  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>> >  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>> >  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>> >
>> >           ;; vscale used to increment loop index
>> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>> >  %4 = icmp eq i64 %index.next, %n.vec
>> >  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>> > ``
>> >
>> > ==========
>> > 8. Patches
>> > ==========
>> >
>> > List of patches:
>> >
>> > 1. Extend VectorType: https://reviews.llvm.org/D32530
>> > 2. Vector element type Tablegen constraint:
>> > https://reviews.llvm.org/D47768
>> > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
>> > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
>> > 5. SVE Calling Convention: https://reviews.llvm.org/D47771
>> > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
>> > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
>> > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
>> > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
>> > 10. Initial store patterns: https://reviews.llvm.org/D47776
>> > 11. Initial addition patterns: https://reviews.llvm.org/D47777
>> > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
>> > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
>> > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>> >
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi Chandler,

> On 30 Jul 2018, at 11:34, Chandler Carruth <[hidden email]> wrote:
>
> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>
> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.

The core IR patches (listed in the RFC) haven't really changed much and are ready for review. I appreciate that Sander has pushed a lot of SVE-related patches for MC through recently though.

> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>
> Put differently: I don't think silence is assent here. You really need some clear signal of consensus.

Understood. Thankfully there seems to be more interest in this now... I guess people will be busy with the release in the near future but I can work on responding to all the new messages now. I'll try to log in to irc during the evenings (UK time) if that would help.

-Graham

> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <[hidden email]> wrote:
> Hi,
>
> Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.
>
> -Graham
>
> > On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
> >
> > Hi,
> >
> > I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
> >
> > Thanks,
> >
> > -Graham
> >
> > =============================================================
> > Supporting SIMD instruction sets with variable vector lengths
> > =============================================================
> >
> > In this RFC we propose extending LLVM IR to support code-generation for variable
> > length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> > approach is backwards compatible and should be as non-intrusive as possible; the
> > only change needed in other backends is how size is queried on vector types, and
> > it only requires a change in which function is called. We have created a set of
> > proof-of-concept patches to represent a simple vectorized loop in IR and
> > generate SVE instructions from that IR. These patches (listed in section 7 of
> > this rfc) can be found on Phabricator and are intended to illustrate the scope
> > of changes required by the general approach described in this RFC.
> >
> > ==========
> > Background
> > ==========
> >
> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> > AArch64 which is intended to scale with hardware such that the same binary
> > running on a processor with longer vector registers can take advantage of the
> > increased compute power without recompilation.
> >
> > As the vector length is no longer a compile-time known value, the way in which
> > the LLVM vectorizer generates code requires modifications such that certain
> > values are now runtime evaluated expressions instead of compile-time constants.
> >
> > Documentation for SVE can be found at
> > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
> >
> > ========
> > Contents
> > ========
> >
> > The rest of this RFC covers the following topics:
> >
> > 1. Types -- a proposal to extend VectorType to be able to represent vectors that
> >   have a length which is a runtime-determined multiple of a known base length.
> >
> > 2. Size Queries - how to reason about the size of types for which the size isn't
> >   fully known at compile time.
> >
> > 3. Representing the runtime multiple of vector length in IR for use in address
> >   calculations and induction variable comparisons.
> >
> > 4. Generating 'constant' values in IR for vectors with a runtime-determined
> >   number of elements.
> >
> > 5. An explanation of splitting/concatentating scalable vectors.
> >
> > 6. A brief note on code generation of these new operations for AArch64.
> >
> > 7. An example of C code and matching IR using the proposed extensions.
> >
> > 8. A list of patches demonstrating the changes required to emit SVE instructions
> >   for a loop that has already been vectorized using the extensions described
> >   in this RFC.
> >
> > ========
> > 1. Types
> > ========
> >
> > To represent a vector of unknown length a boolean `Scalable` property has been
> > added to the `VectorType` class, which indicates that the number of elements in
> > the vector is a runtime-determined integer multiple of the `NumElements` field.
> > Most code that deals with vectors doesn't need to know the exact length, but
> > does need to know relative lengths -- e.g. get a vector with the same number of
> > elements but a different element type, or with half or double the number of
> > elements.
> >
> > In order to allow code to transparently support scalable vectors, we introduce
> > an `ElementCount` class with two members:
> >
> > - `unsigned Min`: the minimum number of elements.
> > - `bool Scalable`: is the element count an unknown multiple of `Min`?
> >
> > For non-scalable vectors (``Scalable=false``) the scale is considered to be
> > equal to one and thus `Min` represents the exact number of elements in the
> > vector.
> >
> > The intent for code working with vectors is to use convenience methods and avoid
> > directly dealing with the number of elements. If needed, calling
> > `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> > to obtain the (potentially scalable) number of elements. Overloaded division and
> > multiplication operators allow an ElementCount instance to be used in much the
> > same manner as an integer for most cases.
> >
> > This mixture of compile-time and runtime quantities allow us to reason about the
> > relationship between different scalable vector types without knowing their
> > exact length.
> >
> > The runtime multiple is not expected to change during program execution for SVE,
> > but it is possible. The model of scalable vectors presented in this RFC assumes
> > that the multiple will be constant within a function but not necessarily across
> > functions. As suggested in the recent RISC-V rfc, a new function attribute to
> > inherit the multiple across function calls will allow for function calls with
> > vector arguments/return values and inlining/outlining optimizations.
> >
> > IR Textual Form
> > ---------------
> >
> > The textual form for a scalable vector is:
> >
> > ``<scalable <n> x <type>>``
> >
> > where `type` is the scalar type of each element, `n` is the minimum number of
> > elements, and the string literal `scalable` indicates that the total number of
> > elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> > for indicating that the vector is scalable, and could be substituted by another.
> > For fixed-length vectors, the `scalable` is omitted, so there is no change in
> > the format for existing vectors.
> >
> > Scalable vectors with the same `Min` value have the same number of elements, and
> > the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> > used within the same function):
> >
> > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
> >  elements.
> >
> > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
> >  bytes.
> >
> > IR Bitcode Form
> > ---------------
> >
> > To serialize scalable vectors to bitcode, a new boolean field is added to the
> > type record. If the field is not present the type will default to a fixed-length
> > vector type, preserving backwards compatibility.
> >
> > Alternatives Considered
> > -----------------------
> >
> > We did consider one main alternative -- a dedicated target type, like the
> > x86_mmx type.
> >
> > A dedicated target type would either need to extend all existing passes that
> > work with vectors to recognize the new type, or to duplicate all that code
> > in order to get reasonable code generation and autovectorization.
> >
> > This hasn't been done for the x86_mmx type, and so it is only capable of
> > providing support for C-level intrinsics instead of being used and recognized by
> > passes inside llvm.
> >
> > Although our current solution will need to change some of the code that creates
> > new VectorTypes, much of that code doesn't need to care about whether the types
> > are scalable or not -- they can use preexisting methods like
> > `getHalfElementsVectorType`. If the code is a little more complex,
> > `ElementCount` structs can be used instead of an `unsigned` value to represent
> > the number of elements.
> >
> > ===============
> > 2. Size Queries
> > ===============
> >
> > This is a proposal for how to deal with querying the size of scalable types for
> > analysis of IR. While it has not been implemented in full, the general approach
> > works well for calculating offsets into structures with scalable types in a
> > modified version of ComputeValueVTs in our downstream compiler.
> >
> > For current IR types that have a known size, all query functions return a single
> > integer constant. For scalable types a second integer is needed to indicate the
> > number of bytes/bits which need to be scaled by the runtime multiple to obtain
> > the actual length.
> >
> > For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
> > except that it will no longer return a size for vector types (it will return 0,
> > as it does for other derived types). The majority of calls to this function are
> > already for scalar rather than vector types.
> >
> > For derived types, a function `getScalableSizePairInBits()` will be added, which
> > returns a pair of integers (one to indicate unscaled bits, the other for bits
> > that need to be scaled by the runtime multiple). For backends that do not need
> > to deal with scalable types the existing methods will suffice, but a debug-only
> > assert will be added to them to ensure they aren't used on scalable types.
> >
> > Similar functionality will be added to DataLayout.
> >
> > Comparisons between sizes will use the following methods, assuming that X and
> > Y are non-zero integers and the form is of { unscaled, scaled }.
> >
> > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
> >
> > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
> >                         functions that inherit vector length. Cannot be
> >                         compared across non-inheriting functions.
> >
> > { X, 0 } > { 0, Y }: Cannot return true.
> >
> > { X, 0 } = { 0, Y }: Cannot return true.
> >
> > { X, 0 } < { 0, Y }: Can return true.
> >
> > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
> >                             terms and try the above comparisons; it
> >                             may not be possible to get a good answer.
> >
> > It's worth noting that we don't expect the last case (mixed scaled and
> > unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> > prohibits mixing fixed-size types into sizeless struct.
> >
> > I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
> > vs. unscaled; I believe the gcc implementation of SVE allows for such
> > results, but that supports a generic polynomial length representation.
> >
> > My current intention is to rely on functions that clone or copy values to
> > check whether they are being used to copy scalable vectors across function
> > boundaries without the inherit vlen attribute and raise an error there instead
> > of requiring passing the Function a type size is from for each comparison. If
> > there's a strong preference for moving the check to the size comparison function
> > let me know; I will be starting work on patches for this later in the year if
> > there's no major problems with the idea.
> >
> > Future Work
> > -----------
> >
> > Since we cannot determine the exact size of a scalable vector, the
> > existing logic for alias detection won't work when multiple accesses
> > share a common base pointer with different offsets.
> >
> > However, SVE's predication will mean that a dynamic 'safe' vector length
> > can be determined at runtime, so after initial support has been added we
> > can work on vectorizing loops using runtime predication to avoid aliasing
> > problems.
> >
> > Alternatives Considered
> > -----------------------
> >
> > Marking scalable vectors as unsized doesn't work well, as many parts of
> > llvm dealing with loads and stores assert that 'isSized()' returns true
> > and make use of the size when calculating offsets.
> >
> > We have considered introducing multiple helper functions instead of
> > using direct size queries, but that doesn't cover all cases. It may
> > still be a good idea to introduce them to make the purpose in a given
> > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
> >
> > ========================================
> > 3. Representing Vector Length at Runtime
> > ========================================
> >
> > With a scalable vector type defined, we now need a way to represent the runtime
> > length in IR in order to generate addresses for consecutive vectors in memory
> > and determine how many elements have been processed in an iteration of a loop.
> >
> > We have added an experimental `vscale` intrinsic to represent the runtime
> > multiple. Multiplying the result of this intrinsic by the minimum number of
> > elements in a vector gives the total number of elements in a scalable vector.
> >
> > Fixed-Length Code
> > -----------------
> >
> > Assuming a vector type of <4 x <ty>>
> > ``
> > vector.body:
> >  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
> >  ;; <loop body>
> >  ;; Increment induction var
> >  %index.next = add i64 %index, 4
> >  ;; <check and branch>
> > ``
> > Scalable Equivalent
> > -------------------
> >
> > Assuming a vector type of <scalable 4 x <ty>>
> > ``
> > vector.body:
> >  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
> >  ;; <loop body>
> >  ;; Increment induction var
> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
> >  ;; <check and branch>
> > ``
> > ===========================
> > 4. Generating Vector Values
> > ===========================
> > For constant vector values, we cannot specify all the elements as we can for
> > fixed-length vectors; fortunately only a small number of easily synthesized
> > patterns are required for autovectorization. The `zeroinitializer` constant
> > can be used in the same manner as fixed-length vectors for a constant zero
> > splat. This can then be combined with `insertelement` and `shufflevector`
> > to create arbitrary value splats in the same manner as fixed-length vectors.
> >
> > For constants consisting of a sequence of values, an experimental `stepvector`
> > intrinsic has been added to represent a simple constant of the form
> > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> > start can be added, and changing the step requires multiplying by a splat.
> >
> > Fixed-Length Code
> > -----------------
> > ``
> >  ;; Splat a value
> >  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
> >  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
> >  ;; Add a constant sequence
> >  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> > ``
> > Scalable Equivalent
> > -------------------
> > ``
> >  ;; Splat a value
> >  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
> >  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >  ;; Splat offset + stride (the same in this case)
> >  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
> >  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >  ;; Create sequence for scalable vector
> >  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
> >  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
> >  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
> >  ;; Add the runtime-generated sequence
> >  %add = add <scalable 4 x i32> %splat, %addoffset
> > ``
> > Future Work
> > -----------
> >
> > Intrinsics cannot currently be used for constant folding. Our downstream
> > compiler (using Constants instead of intrinsics) relies quite heavily on this
> > for good code generation, so we will need to find new ways to recognize and
> > fold these values.
> >
> > ===========================================
> > 5. Splitting and Combining Scalable Vectors
> > ===========================================
> >
> > Splitting and combining scalable vectors in IR is done in the same manner as
> > for fixed-length vectors, but with a non-constant mask for the shufflevector.
> >
> > The following is an example of splitting a <scalable 4 x double> into two
> > separate <scalable 2 x double> values.
> >
> > ``
> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
> >  ;; Stepvector generates the element ids for first subvector
> >  %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
> >  ;; Add vscale * 2 to get the starting element for the second subvector
> >  %ec = mul i64 %vscale64, 2
> >  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
> >  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
> >  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
> >  ;; Perform the extracts
> >  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
> >  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
> > ``
> >
> > ==================
> > 6. Code Generation
> > ==================
> >
> > IR splats will be converted to an experimental splatvector intrinsic in
> > SelectionDAGBuilder.
> >
> > All three intrinsics are custom lowered and legalized in the AArch64 backend.
> >
> > Two new AArch64ISD nodes have been added to represent the same concepts
> > at the SelectionDAG level, while splatvector maps onto the existing
> > AArch64ISD::DUP.
> >
> > GlobalISel
> > ----------
> >
> > Since GlobalISel was enabled by default on AArch64, it was necessary to add
> > scalable vector support to the LowLevelType implementation. A single bit was
> > added to the raw_data representation for vectors and vectors of pointers.
> >
> > In addition, types that only exist in destination patterns are planted in
> > the enumeration of available types for generated code. While this may not be
> > necessary in future, generating an all-true 'ptrue' value was necessary to
> > convert a predicated instruction into an unpredicated one.
> >
> > ==========
> > 7. Example
> > ==========
> >
> > The following example shows a simple C loop which assigns the array index to
> > the array elements matching that index. The IR shows how vscale and stepvector
> > are used to create the needed values and to advance the index variable in the
> > loop.
> >
> > C Code
> > ------
> >
> > ``
> > void IdentityArrayInit(int *a, int count) {
> >  for (int i = 0; i < count; ++i)
> >    a[i] = i;
> > }
> > ``
> >
> > Scalable IR Vector Body
> > -----------------------
> >
> > ``
> > vector.body.preheader:
> >  ;; Other setup
> >  ;; Stepvector used to create initial identity vector
> >  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
> >  br vector.body
> >
> > vector.body
> >  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
> >  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
> >
> >           ;; stepvector used for index identity on entry to loop body ;;
> >  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
> >                                     [ %stepvector, %vector.body.preheader ]
> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
> >  %vscale32 = trunc i64 %vscale64 to i32
> >  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
> >
> >           ;; vscale splat used to increment identity vector ;;
> >  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
> >  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
> >  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
> >  %2 = getelementptr inbounds i32, i32* %a, i64 %0
> >  %3 = bitcast i32* %2 to <scalable 4 x i32>*
> >  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
> >
> >           ;; vscale used to increment loop index
> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
> >  %4 = icmp eq i64 %index.next, %n.vec
> >  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> > ``
> >
> > ==========
> > 8. Patches
> > ==========
> >
> > List of patches:
> >
> > 1. Extend VectorType: https://reviews.llvm.org/D32530
> > 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> > 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> > 10. Initial store patterns: https://reviews.llvm.org/D47776
> > 11. Initial addition patterns: https://reviews.llvm.org/D47777
> > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
> >
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi David,

> On 30 Jul 2018, at 18:37, David Greene via llvm-dev <[hidden email]> wrote:
>
> Chandler Carruth wrote:
>
>> I strongly suspect that there remains widespread concern with the
>> direction of this, I know I have them.
>>
>> I don't think that many of the people who have that concern have had
>> time to come back to this RFC and make progress on it, likely because
>> of other commitments or simply the amount of churn around SVE related
>> patches and such. That is at least why I haven't had time to return to
>> this RFC and try to write more detailed feedback.
>
> We believe ARM SVE will be an important architecture going forward.  As
> such, it's important to us that these questions and concerns get posted
> and discussed, whatever the outcome may be.  If there are objections,
> alternative proposals would be helpful.

Yes, pointing out alternatives we've missed would be helpful.

> I see a lot of SVE patches on Phab that are described as "not for
> review."  I don't know how helpful that is.  It would be more helpful to
> have actual patches intended for review/commit.  It is difficult to know
> which is which in Phab.  Could patches not intended for review either be
> removed if not needed, or their subjects updated to indicate they are
> not for review but for discussion purposes so that it's easier to filter
> search results?

All 14 patches listed at the bottom of the RFC are ready for an initial round of review, so I'll change the descriptions tomorrow to indicate that. I'll check to see if I have any older ones lying around and abandon them if so.

-Graham

>
>                                  -David
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi David,

Let me put the last two comments up:

> > But we're trying to represent slightly different techniques
> > (predication, vscale change) which need to be tied down to only
> > exactly what they do.
>
> Wouldn't intrinsics to change vscale do exactly that?

You're right. I've been using the same overloaded term and this is
probably what caused the confusion.

In some cases, predicating and shortening the vectors are semantically
equivalent. In this case, the IR should also be equivalent.
Instructions/intrinsics that handle predication could be used by the
backend to simply change VL instead, as long as it's guaranteed that
the semantics are identical. There are no problems here.

In other cases, for example widening or splitting the vector, or cases
we haven't thought of yet, the semantics are not the same, and having
them in IR would be bad. I think we're all in agreements on that.

All I'm asking is that we make a list of what we want to happen and
disallow everything else explicitly, until someone comes with a strong
case for it. Makes sense?


> I'm all for being explicit.  I think we're basically on the same page,
> though there are a few things noted above where I need a little more
> clarity.

Yup, I think we are. :)



> What does "mid-loop" mean?  On traditional vector architectures it was
> very common to change VL for the last loop iteration.  Otherwise you had
> to have a remainder loop.  It was much better to change VL.

You got it below...


> Ok, I think I am starting to grasp what you are saying.  If a value
> flows from memory or some scalar computation to vector and then back to
> memory or scalar, VL should only ever be set at the start of the vector
> computation until it finishes and the value is deposited in memory or
> otherwise extracted.  I think this is ok, but note that any vector
> functions called may change VL for the duration of the call.  The change
> would not be visible to the caller.

If a function is called and changes the length, does it restore back on return?


> I am not so sure about that.  Power requirements may very well drive
> more dynamic vector lengths.  Even today some AVX 512 implementations
> falter if there are "too many" 512-bit operations.  Scaling back SIMD
> width statically is very common today and doing so dynamically seems
> like an obvious extension.  I don't know of any efforts to do this so
> it's all speculative at this point.  But the industry has done it in the
> past and we have a curious pattern of reinventing things we did before.

Right, so it's not as clear cut as I hoped. But we can start
implementing the basic idea and then expand as we go. I think trying
to hash out all potential scenarios now will drive us crazy.


> It seems strange to me for an optimizer to operate in such a way.  The
> optimizer should be fully aware of the target's capabilities and use
> them accordingly.

Mid-end optimisers tend to be fairly agnostic. And when not, they
usually ask "is this supported" instead of "which one is better".


> ARM seems to have no difficulty selecting instructions for it.  Changing
> the value of vscale shouldn't impact ISel at all.  The same instructions
> are selected.

I may very well be getting lost in too many floating future ideas, atm. :)


> > It is, but IIGIR, changing vscale and predicating are similar
> > transformations to achieve the similar goals, but will not be
> > represented the same way in IR.
>
> They probably will not be represented the same way, though I think they
> could be (but probably shouldn't be).

Maybe in the simple cases (like last iteration) they should be?


> Ok, but would be optimizer be prevented from introducing VL changes?

In the case where they're represented in similar ways in IR, it
wouldn't need to.

Otherwise, we'd have to teach the two methods to IR optimisers that
are virtually identical in semantics. It'd be left for the back end to
implement the last iteration notation as a predicate fill or a vscale
change.


> Being conservative is fine, but we should have a clear understanding of
> exactly what that means.  I would not want to prohibit all VL changes
> now and forever, because I see that as unnecessarily restrictive and
> possibly damaging to supporting future architectures.
>
> If we don't want to provide intrinsics for changing VL right now, I'm
> all in favor.  There would be no reason to add error checks because
> there would be no way within the IR to change VL.

Right, I think we're converging.

How about we don't forbid changes in vscale, but we find a common
notation for all the cases where predicating and changing vscale would
be semantically identical, and implement those in the same way.

Later on, if there are additional cases where changes in vscale would
be beneficial, we can discuss them independently.

Makes sense?

--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi Robin,

On Tue, 31 Jul 2018 at 19:03, Robin Kruppe <[hidden email]> wrote:
> 1. How large are the machine's vector registers?

This is the only one I'm talking about. :)


> Like Hal, I am very skeptical how, if at all, such a barrier could be
> added to IR. And I've spent a lot of time trying to come up with a
> solution as part of my RISC-V work. That is why my RFC back in April
> proposed a trade-off, which has been incorporated by Graham into this
> RFC: vscale can change between functions, but does not change within a
> function. As an analogy, consider how LLVM supports different
> subtargets (each with different registers, instructions and legal
> types) on a per-function basis but doesn't allow e.g. making a
> register class completely unavailable at a certain point in a
> function.

Cray seems to use changes in vscale as we use predication for the last
loop iteration, as well as RISC-V uses for giving away resources to
different functions.

In the former case, they may want to change the vscale inside the same
function in the last iteration, but given that this is semantically
equivalent to shortening predicates, it could be a back-end decision
and not an IR one. We could have the same notation for both target
behaviours and not have to worry about the boundaries.

In the latter case, it's clear that functions are hard boundaries.
Providing, of course, that you either inline all functions called
before vectorisation, or, and only if there is a scalable vector PCS
ABI, make sure that all of them have the same length?

I haven't thought long enough about the latter, and that's why I was
proposing we take a conservative approach and restrict to what we can
actually reasonably do now.

I think this is what you and Graham are trying to do, right?

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Renato Golin via llvm-dev <[hidden email]> writes:

> Hi David,
>
> Let me put the last two comments up:
>
>> > But we're trying to represent slightly different techniques
>> > (predication, vscale change) which need to be tied down to only
>> > exactly what they do.
>>
>> Wouldn't intrinsics to change vscale do exactly that?
>
> You're right. I've been using the same overloaded term and this is
> probably what caused the confusion.

Me too.  Thanks Robin for clarifying this for all of us!  I'll try to
follow this terminology:

VL/active vector length - The software notion of how many elements to
                          operate on; a special case of predication

vscale - The hardware notion of how big a vector register is

TL;DR - Changing VL in a function doesn't affect anything about this
        proposal, but changing vscale might.  Changing VL shouldn't
        impact things like ISel at all but changing vscale might.
        Changing vscale is (much) more difficult than changing VL.

> In some cases, predicating and shortening the vectors are semantically
> equivalent. In this case, the IR should also be equivalent.
> Instructions/intrinsics that handle predication could be used by the
> backend to simply change VL instead, as long as it's guaranteed that
> the semantics are identical. There are no problems here.

Right.  Changing VL is no problem.  I think even reducing vscale is ok
from an IR perspective, if a little strange.

> In other cases, for example widening or splitting the vector, or cases
> we haven't thought of yet, the semantics are not the same, and having
> them in IR would be bad. I think we're all in agreements on that.

You mean going from a shorter active vector length to a longer active
vector length?  Or smaller vscale to larger vscale?  The latter would be
bad.  The former seems ok if the dataflow is captured and the vectorizer
generates correct code to account for it.  Presumably it would if it is
the thing changing the active vector length.

> All I'm asking is that we make a list of what we want to happen and
> disallow everything else explicitly, until someone comes with a strong
> case for it. Makes sense?

Yes.

>> Ok, I think I am starting to grasp what you are saying.  If a value
>> flows from memory or some scalar computation to vector and then back to
>> memory or scalar, VL should only ever be set at the start of the vector
>> computation until it finishes and the value is deposited in memory or
>> otherwise extracted.  I think this is ok, but note that any vector
>> functions called may change VL for the duration of the call.  The change
>> would not be visible to the caller.
>
> If a function is called and changes the length, does it restore back on return?

If a function changes VL, it would typically restore it before return.
This would be an ABI guarantee just like any other callee-save register.

If a function changes vscale, I don't know.  The RISC-V people seem to
have thought the most about this.  I have no point of reference here.

> Right, so it's not as clear cut as I hoped. But we can start
> implementing the basic idea and then expand as we go. I think trying
> to hash out all potential scenarios now will drive us crazy.

Sure.

>> It seems strange to me for an optimizer to operate in such a way.  The
>> optimizer should be fully aware of the target's capabilities and use
>> them accordingly.
>
> Mid-end optimisers tend to be fairly agnostic. And when not, they
> usually ask "is this supported" instead of "which one is better".

Yes, the "is this supported" question is common.  Isn't the whole point
of VPlan to get the "which one is better" question answered for
vectorization?  That would be necessarily tied to the target.  The
questions asked can be agnostic, like the target-agnostics bits of
codegen use, but the answers would be target-specific.

>> ARM seems to have no difficulty selecting instructions for it.  Changing
>> the value of vscale shouldn't impact ISel at all.  The same instructions
>> are selected.
>
> I may very well be getting lost in too many floating future ideas, atm. :)

Given our clearer terminology, my statement above is maybe not correct.
Changing vscale *would* impact the IR and codegen (stack allocation,
etc.).  Changing VL would not, other than adding some Instructions to
capture the semantics.  I suspect neither would change ISel (I know VL
would not) but as you say I don't think we need concern ourselves with
changing vscale right now, unless others have a dire need to support it.

>> > It is, but IIGIR, changing vscale and predicating are similar
>> > transformations to achieve the similar goals, but will not be
>> > represented the same way in IR.
>>
>> They probably will not be represented the same way, though I think they
>> could be (but probably shouldn't be).
>
> Maybe in the simple cases (like last iteration) they should be?

Perhaps changing VL could be modeled the same way but I have a feeling
it will be awkward.  Changing vscale is something totally different and
likely should be represented differently if allowed at all.

>> Ok, but would be optimizer be prevented from introducing VL changes?
>
> In the case where they're represented in similar ways in IR, it
> wouldn't need to.

It would have to generate IR code to effect the software change in VL
somehow, by altering predicates or by using special instrinsics or some
other way.

> Otherwise, we'd have to teach the two methods to IR optimisers that
> are virtually identical in semantics. It'd be left for the back end to
> implement the last iteration notation as a predicate fill or a vscale
> change.

I suspect that is too late.  The vectorizer needs to account for the
choice and pick the most profitable course.  That's one of the reasons I
think modeling VL changes like predicates is maybe unnecessarily
complex.  If VL is modeled as "just another predicate" then there's no
guarantee that ISel will honor the choices the vectorizer made to use VL
over predication.  If it's modeled explicitly, ISel should have an
easier time generating the code the vectorizer expects.

VL changes aren't always on the last iteration.  The Cray X1 had an
instruction (I would have to dust off old manuals to remember the
mnemonic) with somewhat strange semantics to get the desired VL for an
iteration.  Code would look something like this:

loop top:
  vl = getvl N      #  N contains the number of iterations left
  <do computation>
  N = N - vl
  branch N > 0, loop top

The "getvl" instruction would usually return the full hardware vector
register length (MAXVL), except on the 2nd-to-last iteration if N was
larger than MAXVL but less than 2*MAXVL it would return something like
<N % 2 == 0 ? N/2 : N/2 + 1>, so in the range (0, MAXVL).  The last
iteration would then run at the same VL or one less depending on whether
N was odd or even.  So the last two iterations would often run at less
than MAXVL and often at different VLs from each other.

And no, I don't know why the hardware operated this way.  :)

>> Being conservative is fine, but we should have a clear understanding of
>> exactly what that means.  I would not want to prohibit all VL changes
>> now and forever, because I see that as unnecessarily restrictive and
>> possibly damaging to supporting future architectures.
>>
>> If we don't want to provide intrinsics for changing VL right now, I'm
>> all in favor.  There would be no reason to add error checks because
>> there would be no way within the IR to change VL.
>
> Right, I think we're converging.

Agreed.

> How about we don't forbid changes in vscale, but we find a common
> notation for all the cases where predicating and changing vscale would
> be semantically identical, and implement those in the same way.
>
> Later on, if there are additional cases where changes in vscale would
> be beneficial, we can discuss them independently.
>
> Makes sense?

Again trying to use the VL/vscale terminology:

Changing vscale - no IR support currently and less likely in the future
Changing VL     - no IR support currently but more likely in the future

The second seems like a straightforward extension to me.  There will be
some questions about how to represent VL semantics in IR but those don't
impact the proposal under discussion at all.

The first seems much harder, at least within a function.  It may or may
not impact the proposal under discussion.  It sounds like the RISC-V
people have some use cases so those should probably be the focal point
of this discussion.

                           -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On Tue, 31 Jul 2018 at 20:10, David A. Greene <[hidden email]> wrote:
> Me too.  Thanks Robin for clarifying this for all of us!  I'll try to
> follow this terminology:

+1


> TL;DR - Changing VL in a function doesn't affect anything about this
>         proposal, but changing vscale might.  Changing VL shouldn't
>         impact things like ISel at all but changing vscale might.
>         Changing vscale is (much) more difficult than changing VL.

Absolutely agreed. :)


> Right.  Changing VL is no problem.  I think even reducing vscale is ok
> from an IR perspective, if a little strange.

Yup.


> You mean going from a shorter active vector length to a longer active
> vector length?  Or smaller vscale to larger vscale?  The latter would be
> bad.

The latter. Bad indeed.


> If a function changes vscale, I don't know.  The RISC-V people seem to
> have thought the most about this.  I have no point of reference here.

I think the consensus is that this would be bad. So we should maybe
encode it as an error.


> Yes, the "is this supported" question is common.  Isn't the whole point
> of VPlan to get the "which one is better" question answered for
> vectorization?

Yes, but the cost is high. We can have that in the vectoriser, as it's
a heavy pass and we're conscious, but we shouldn't make all other
passes "that smart".


> Changing vscale *would* impact the IR and codegen (stack allocation,
> etc.).  Changing VL would not, other than adding some Instructions to
> capture the semantics.  I suspect neither would change ISel (I know VL
> would not) but as you say I don't think we need concern ourselves with
> changing vscale right now, unless others have a dire need to support it.

Perfect! :)


> Perhaps changing VL could be modeled the same way but I have a feeling
> it will be awkward.  Changing vscale is something totally different and
> likely should be represented differently if allowed at all.

Right, I was talking about vscale.

It would be awkward, but if this is the only thing the hardware
supports (ie. no predication), than it's up to the back-end to lower
how it sees fit.

In IR, we still see as a predication.


> Again trying to use the VL/vscale terminology:
>
> Changing vscale - no IR support currently and less likely in the future
> Changing VL     - no IR support currently but more likely in the future

SGTM.


> The second seems like a straightforward extension to me.  There will be
> some questions about how to represent VL semantics in IR but those don't
> impact the proposal under discussion at all.

Should be equivalent to predication, I imagine.


> The first seems much harder, at least within a function.

And it would require exposing the instruction to change it in IR.


>  It may or may not impact the proposal under discussion.

As per Robin's email, it doesn't. Functions are vscale boundaries in
their current proposal.

--
cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
On 31 July 2018 at 21:10, David A. Greene via llvm-dev
<[hidden email]> wrote:

> Renato Golin via llvm-dev <[hidden email]> writes:
>
>> Hi David,
>>
>> Let me put the last two comments up:
>>
>>> > But we're trying to represent slightly different techniques
>>> > (predication, vscale change) which need to be tied down to only
>>> > exactly what they do.
>>>
>>> Wouldn't intrinsics to change vscale do exactly that?
>>
>> You're right. I've been using the same overloaded term and this is
>> probably what caused the confusion.
>
> Me too.  Thanks Robin for clarifying this for all of us!  I'll try to
> follow this terminology:
>
> VL/active vector length - The software notion of how many elements to
>                           operate on; a special case of predication
>
> vscale - The hardware notion of how big a vector register is
>
> TL;DR - Changing VL in a function doesn't affect anything about this
>         proposal, but changing vscale might.  Changing VL shouldn't
>         impact things like ISel at all but changing vscale might.
>         Changing vscale is (much) more difficult than changing VL.

Great, seems like we're all in violent agreement that VL changes are a
non-issue for the discussion at hand.

>> In some cases, predicating and shortening the vectors are semantically
>> equivalent. In this case, the IR should also be equivalent.
>> Instructions/intrinsics that handle predication could be used by the
>> backend to simply change VL instead, as long as it's guaranteed that
>> the semantics are identical. There are no problems here.
>
> Right.  Changing VL is no problem.  I think even reducing vscale is ok
> from an IR perspective, if a little strange.
>
>> In other cases, for example widening or splitting the vector, or cases
>> we haven't thought of yet, the semantics are not the same, and having
>> them in IR would be bad. I think we're all in agreements on that.
>
> You mean going from a shorter active vector length to a longer active
> vector length?  Or smaller vscale to larger vscale?  The latter would be
> bad.  The former seems ok if the dataflow is captured and the vectorizer
> generates correct code to account for it.  Presumably it would if it is
> the thing changing the active vector length.
>
>> All I'm asking is that we make a list of what we want to happen and
>> disallow everything else explicitly, until someone comes with a strong
>> case for it. Makes sense?
>
> Yes.
>
>>> Ok, I think I am starting to grasp what you are saying.  If a value
>>> flows from memory or some scalar computation to vector and then back to
>>> memory or scalar, VL should only ever be set at the start of the vector
>>> computation until it finishes and the value is deposited in memory or
>>> otherwise extracted.  I think this is ok, but note that any vector
>>> functions called may change VL for the duration of the call.  The change
>>> would not be visible to the caller.
>>
>> If a function is called and changes the length, does it restore back on return?
>
> If a function changes VL, it would typically restore it before return.
> This would be an ABI guarantee just like any other callee-save register.
>
> If a function changes vscale, I don't know.  The RISC-V people seem to
> have thought the most about this.  I have no point of reference here.
>
>> Right, so it's not as clear cut as I hoped. But we can start
>> implementing the basic idea and then expand as we go. I think trying
>> to hash out all potential scenarios now will drive us crazy.
>
> Sure.
>
>>> It seems strange to me for an optimizer to operate in such a way.  The
>>> optimizer should be fully aware of the target's capabilities and use
>>> them accordingly.
>>
>> Mid-end optimisers tend to be fairly agnostic. And when not, they
>> usually ask "is this supported" instead of "which one is better".
>
> Yes, the "is this supported" question is common.  Isn't the whole point
> of VPlan to get the "which one is better" question answered for
> vectorization?  That would be necessarily tied to the target.  The
> questions asked can be agnostic, like the target-agnostics bits of
> codegen use, but the answers would be target-specific.

Just like the old loop vectorizer, VPlan will need a cost model that
is based on properties of the target, exposed to the optimizer in the
form of e.g. TargetLowering hooks. But we should try really hard to
avoid having a hard distinction between e.g. predication- and VL-based
loops in the VPlan representation. Duplicating or triplicating
vectorization logic would be really bad, and there are a lot of
similarities that we can exploit to avoid that. For a simple example,
SVE and RVV both want the same basic loop skeleton: strip-mining with
predication of the loop body derived from the induction variable.
Hopefully we can have a 99% unified VPlan pipeline and most
differences can be delegated to the final VPlan->IR step and the
respective backends.

+ Diego, Florian and others that have been discussing this previously

>>> ARM seems to have no difficulty selecting instructions for it.  Changing
>>> the value of vscale shouldn't impact ISel at all.  The same instructions
>>> are selected.
>>
>> I may very well be getting lost in too many floating future ideas, atm. :)
>
> Given our clearer terminology, my statement above is maybe not correct.
> Changing vscale *would* impact the IR and codegen (stack allocation,
> etc.).  Changing VL would not, other than adding some Instructions to
> capture the semantics.  I suspect neither would change ISel (I know VL
> would not) but as you say I don't think we need concern ourselves with
> changing vscale right now, unless others have a dire need to support it.
>
>>> > It is, but IIGIR, changing vscale and predicating are similar
>>> > transformations to achieve the similar goals, but will not be
>>> > represented the same way in IR.
>>>
>>> They probably will not be represented the same way, though I think they
>>> could be (but probably shouldn't be).
>>
>> Maybe in the simple cases (like last iteration) they should be?
>
> Perhaps changing VL could be modeled the same way but I have a feeling
> it will be awkward.  Changing vscale is something totally different and
> likely should be represented differently if allowed at all.
>
>>> Ok, but would be optimizer be prevented from introducing VL changes?
>>
>> In the case where they're represented in similar ways in IR, it
>> wouldn't need to.
>
> It would have to generate IR code to effect the software change in VL
> somehow, by altering predicates or by using special instrinsics or some
> other way.
>
>> Otherwise, we'd have to teach the two methods to IR optimisers that
>> are virtually identical in semantics. It'd be left for the back end to
>> implement the last iteration notation as a predicate fill or a vscale
>> change.
>
> I suspect that is too late.  The vectorizer needs to account for the
> choice and pick the most profitable course.  That's one of the reasons I
> think modeling VL changes like predicates is maybe unnecessarily
> complex.  If VL is modeled as "just another predicate" then there's no
> guarantee that ISel will honor the choices the vectorizer made to use VL
> over predication.  If it's modeled explicitly, ISel should have an
> easier time generating the code the vectorizer expects.
>
> VL changes aren't always on the last iteration.  The Cray X1 had an
> instruction (I would have to dust off old manuals to remember the
> mnemonic) with somewhat strange semantics to get the desired VL for an
> iteration.  Code would look something like this:
>
> loop top:
>   vl = getvl N      #  N contains the number of iterations left
>   <do computation>
>   N = N - vl
>   branch N > 0, loop top
>
> The "getvl" instruction would usually return the full hardware vector
> register length (MAXVL), except on the 2nd-to-last iteration if N was
> larger than MAXVL but less than 2*MAXVL it would return something like
> <N % 2 == 0 ? N/2 : N/2 + 1>, so in the range (0, MAXVL).  The last
> iteration would then run at the same VL or one less depending on whether
> N was odd or even.  So the last two iterations would often run at less
> than MAXVL and often at different VLs from each other.

FWIW this is exactly how the RISC-V vector unit works --
unsurprisingly, since it owes a lot to Cray-style processors :)

> And no, I don't know why the hardware operated this way.  :)
>
>>> Being conservative is fine, but we should have a clear understanding of
>>> exactly what that means.  I would not want to prohibit all VL changes
>>> now and forever, because I see that as unnecessarily restrictive and
>>> possibly damaging to supporting future architectures.
>>>
>>> If we don't want to provide intrinsics for changing VL right now, I'm
>>> all in favor.  There would be no reason to add error checks because
>>> there would be no way within the IR to change VL.
>>
>> Right, I think we're converging.
>
> Agreed.

+1, there is no need to deal with VL at all at this point. I would
even say there isn't even any concept of VL in IR at all at this time.

At some point in the future I will propose something in this space to
support RISC-V vectors, but we'll cross that bridge when we come to
it.

>> How about we don't forbid changes in vscale, but we find a common
>> notation for all the cases where predicating and changing vscale would
>> be semantically identical, and implement those in the same way.
>>
>> Later on, if there are additional cases where changes in vscale would
>> be beneficial, we can discuss them independently.
>>
>> Makes sense?
>
> Again trying to use the VL/vscale terminology:
>
> Changing vscale - no IR support currently and less likely in the future
> Changing VL     - no IR support currently but more likely in the future
>
> The second seems like a straightforward extension to me.  There will be
> some questions about how to represent VL semantics in IR but those don't
> impact the proposal under discussion at all.
>
> The first seems much harder, at least within a function.  It may or may
> not impact the proposal under discussion.  It sounds like the RISC-V
> people have some use cases so those should probably be the focal point
> of this discussion.

Yes, for RISC-V we definitely need vscale to vary a bit, but are fine
with limiting that to function boundaries. The use case is *not*
"changing how large vectors are" in the middle of a loop or something
like that, which we all agree is very dubious at best. The RISC-V
vector unit is just very configurable (number of registers, vector
element sizes, etc.) and this configuration can impact how large the
vector registers are. For any given vectorized loop next we want to
configure the vector unit to suit that piece of code and run the loop
with whatever register size that configuration yields. And when that
loop is done, we stop using the vector unit entirely and disable it,
so that the next loop can use it differently, possibly with a
different register size. For IR modeling purposes, I propose to
enlarge "loop nest" to "function" but the same principle applies, it
just means all vectorized loops in the function will have to share a
configuration.

Without getting too far into the details, does this make sense as a use case?


Cheers,
Robin

>                            -David
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
Robin Kruppe <[hidden email]> writes:

>> Yes, the "is this supported" question is common.  Isn't the whole point
>> of VPlan to get the "which one is better" question answered for
>> vectorization?  That would be necessarily tied to the target.  The
>> questions asked can be agnostic, like the target-agnostics bits of
>> codegen use, but the answers would be target-specific.
>
> Just like the old loop vectorizer, VPlan will need a cost model that
> is based on properties of the target, exposed to the optimizer in the
> form of e.g. TargetLowering hooks. But we should try really hard to
> avoid having a hard distinction between e.g. predication- and VL-based
> loops in the VPlan representation. Duplicating or triplicating
> vectorization logic would be really bad, and there are a lot of
> similarities that we can exploit to avoid that. For a simple example,
> SVE and RVV both want the same basic loop skeleton: strip-mining with
> predication of the loop body derived from the induction variable.
> Hopefully we can have a 99% unified VPlan pipeline and most
> differences can be delegated to the final VPlan->IR step and the
> respective backends.
>
> + Diego, Florian and others that have been discussing this previously

If VL and predication are represented the same way, how does VPlan
distinguish between the two?  How does it cost code generation just
using predication vs. code generation using a combination of predication
and VL?

Assuming it can do that, do you envision vector codegen would emit
different IR for VL+predication (say, using intrinsics to set VL) vs. a
strictly predication-only-based plan?  If not, how does the LLVM backend
know to emit code to manipulate VL in the former case?

I don't need answers to these questions right now as VL is a separate
issue and I don't want this thread to get bogged down in it.  But these
are questions that will come up if/when we tackle VL.

> At some point in the future I will propose something in this space to
> support RISC-V vectors, but we'll cross that bridge when we come to
> it.

Sounds good.

> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine
> with limiting that to function boundaries. The use case is *not*
> "changing how large vectors are" in the middle of a loop or something
> like that, which we all agree is very dubious at best. The RISC-V
> vector unit is just very configurable (number of registers, vector
> element sizes, etc.) and this configuration can impact how large the
> vector registers are. For any given vectorized loop next we want to
> configure the vector unit to suit that piece of code and run the loop
> with whatever register size that configuration yields. And when that
> loop is done, we stop using the vector unit entirely and disable it,
> so that the next loop can use it differently, possibly with a
> different register size. For IR modeling purposes, I propose to
> enlarge "loop nest" to "function" but the same principle applies, it
> just means all vectorized loops in the function will have to share a
> configuration.
>
> Without getting too far into the details, does this make sense as a
> use case?

I think so.  If changing vscale has some important advantage (saving
power?), I wonder how the compiler will deal with very large functions.
I have seen some truly massive Fortran subroutines with hundreds of loop
nests in them, possibly with very different iteration counts for each
one.

                           -David
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev

On 07/31/2018 04:32 PM, David A. Greene via llvm-dev wrote:

> Robin Kruppe <[hidden email]> writes:
>
>>> Yes, the "is this supported" question is common.  Isn't the whole point
>>> of VPlan to get the "which one is better" question answered for
>>> vectorization?  That would be necessarily tied to the target.  The
>>> questions asked can be agnostic, like the target-agnostics bits of
>>> codegen use, but the answers would be target-specific.
>> Just like the old loop vectorizer, VPlan will need a cost model that
>> is based on properties of the target, exposed to the optimizer in the
>> form of e.g. TargetLowering hooks. But we should try really hard to
>> avoid having a hard distinction between e.g. predication- and VL-based
>> loops in the VPlan representation. Duplicating or triplicating
>> vectorization logic would be really bad, and there are a lot of
>> similarities that we can exploit to avoid that. For a simple example,
>> SVE and RVV both want the same basic loop skeleton: strip-mining with
>> predication of the loop body derived from the induction variable.
>> Hopefully we can have a 99% unified VPlan pipeline and most
>> differences can be delegated to the final VPlan->IR step and the
>> respective backends.
>>
>> + Diego, Florian and others that have been discussing this previously
> If VL and predication are represented the same way, how does VPlan
> distinguish between the two?  How does it cost code generation just
> using predication vs. code generation using a combination of predication
> and VL?
>
> Assuming it can do that, do you envision vector codegen would emit
> different IR for VL+predication (say, using intrinsics to set VL) vs. a
> strictly predication-only-based plan?  If not, how does the LLVM backend
> know to emit code to manipulate VL in the former case?
>
> I don't need answers to these questions right now as VL is a separate
> issue and I don't want this thread to get bogged down in it.  But these
> are questions that will come up if/when we tackle VL.
>
>> At some point in the future I will propose something in this space to
>> support RISC-V vectors, but we'll cross that bridge when we come to
>> it.
> Sounds good.
>
>> Yes, for RISC-V we definitely need vscale to vary a bit, but are fine
>> with limiting that to function boundaries. The use case is *not*
>> "changing how large vectors are" in the middle of a loop or something
>> like that, which we all agree is very dubious at best. The RISC-V
>> vector unit is just very configurable (number of registers, vector
>> element sizes, etc.) and this configuration can impact how large the
>> vector registers are. For any given vectorized loop next we want to
>> configure the vector unit to suit that piece of code and run the loop
>> with whatever register size that configuration yields. And when that
>> loop is done, we stop using the vector unit entirely and disable it,
>> so that the next loop can use it differently, possibly with a
>> different register size. For IR modeling purposes, I propose to
>> enlarge "loop nest" to "function" but the same principle applies, it
>> just means all vectorized loops in the function will have to share a
>> configuration.
>>
>> Without getting too far into the details, does this make sense as a
>> use case?
> I think so.  If changing vscale has some important advantage (saving
> power?), I wonder how the compiler will deal with very large functions.
> I have seen some truly massive Fortran subroutines with hundreds of loop
> nests in them, possibly with very different iteration counts for each
> one.

I have two concerns:

1. If we change vscale in the middle of a function, then we have no way
to introduce a dependence, or barrier, at the point where the change is
made. Transformations, GVN/PRE/etc. for example, can move code around
the place where the change is made and I suspect that we'll have no good
options to prevent it (this could include whole subloops, although we
might not do that today). In some sense, if you make vscale dynamic,
you've introduced dependent types into LLVM's type system, but you've
done it in an implicit manner. It's not clear to me that works. If we
need dependent types, then an explicit dependence seems better. (e.g.,
<scalable <n> x %vscale_var x <type>>)

2. How would the function-call boundary work? Does the function itself
have intrinsics that change the vscale? If so, then it's not clear that
the function-call boundary makes sense unless you prevent inlining. If
you prevent inlining, when does that decision get made? Will the
vectorizer need to outline loops? If so, outlining can have a real cost
that's difficult to model. How do return types work?

To other thoughts:

 1. I can definitely see the use cases for changing vscale dynamically,
and so I do suspect that we'll want that support.

 2. LLVM does not have loops as first-class constructs. We only have SSA
(and, thus, dominance), and when specifying restrictions on placement of
things in function bodies, we need to do so in terms of these constructs
that we have (which don't include loops).

Thanks again,
Hal

>
>                            -David
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev
<[hidden email]> wrote:
> In some sense, if you make vscale dynamic,
> you've introduced dependent types into LLVM's type system, but you've
> done it in an implicit manner. It's not clear to me that works. If we
> need dependent types, then an explicit dependence seems better. (e.g.,
> <scalable <n> x %vscale_var x <type>>)

That's a shift from the current proposal and I think we can think
about it after the current changes. For now, both SVE and RISC-V are
proposing function boundaries for changes in vscale.


> 2. How would the function-call boundary work? Does the function itself
> have intrinsics that change the vscale?

Functions may not know what their vscale is until they're actually
executed. They could even have different vscales for different call
sites.

AFAIK, it's not up to the compiled program (ie via a function
attribute or an inline asm call) to change the vscale, but the
kernel/hardware can impose dynamic restrictions on the process. But,
for now, only at (binary object) function boundaries.

I don't know how that works at the kernel level (how to detect those
boundaries? instrument every branch?) but this is what I understood
from the current discussion.


> If so, then it's not clear that
> the function-call boundary makes sense unless you prevent inlining. If
> you prevent inlining, when does that decision get made? Will the
> vectorizer need to outline loops? If so, outlining can have a real cost
> that's difficult to model. How do return types work?

The dynamic nature is not part of the program, so inlining can happen
as always. Given that the vectors are agnostic of size and work
regardless of what the kernel provides (within safety boundaries), the
code generation shouldn't change too much.

We may have to create artefacts to restrict the maximum vscale (for
safety), but others are better equipped to answer that question.


>  1. I can definitely see the use cases for changing vscale dynamically,
> and so I do suspect that we'll want that support.

At a process/function level, yes. Within the same self-contained
sub-graph, I don't know.


>  2. LLVM does not have loops as first-class constructs. We only have SSA
> (and, thus, dominance), and when specifying restrictions on placement of
> things in function bodies, we need to do so in terms of these constructs
> that we have (which don't include loops).

That's why I was trying to define the "self-contained sub-graph" above
(there must be a better term for that). It has to do with data
dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure
side-effects don't leak out.

A loop iteration is usually such a block, but not all are and not all
such blocks are loops.

Changing vscale inside a function, but outside of those blocks would
be "fine", as long as we made sure code movement respects those
boundaries and that context would be restored correctly on exceptions.
But that's not part of the current proposal.

Chaning vscale inside one of those blocks would be madness. :)

cheers,
--renato
_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev

On 08/01/2018 06:15 AM, Renato Golin wrote:

> On Tue, 31 Jul 2018 at 23:46, Hal Finkel via llvm-dev
> <[hidden email]> wrote:
>> In some sense, if you make vscale dynamic,
>> you've introduced dependent types into LLVM's type system, but you've
>> done it in an implicit manner. It's not clear to me that works. If we
>> need dependent types, then an explicit dependence seems better. (e.g.,
>> <scalable <n> x %vscale_var x <type>>)
> That's a shift from the current proposal and I think we can think
> about it after the current changes. For now, both SVE and RISC-V are
> proposing function boundaries for changes in vscale.

I understand. I'm afraid that the function-boundary idea doesn't work
reasonably.

>
>
>> 2. How would the function-call boundary work? Does the function itself
>> have intrinsics that change the vscale?
> Functions may not know what their vscale is until they're actually
> executed. They could even have different vscales for different call
> sites.
>
> AFAIK, it's not up to the compiled program (ie via a function
> attribute or an inline asm call) to change the vscale, but the
> kernel/hardware can impose dynamic restrictions on the process. But,
> for now, only at (binary object) function boundaries.

I'm not sure if that's better or worse than the compiler putting in code
to indicate that the vscale might change. How do vector function
arguments work if vscale gets larger? or smaller?

So, if I have some vectorized code, and we figure out that some of it is
cold, so we outline it, and then the kernel decides to decrease vscale
for that function, now I have broken the application? Storing a vector
argument in memory in that function now doesn't store as much data as it
would have in the caller?

>
> I don't know how that works at the kernel level (how to detect those
> boundaries? instrument every branch?) but this is what I understood
> from the current discussion.

Can we find out?

>
>
>> If so, then it's not clear that
>> the function-call boundary makes sense unless you prevent inlining. If
>> you prevent inlining, when does that decision get made? Will the
>> vectorizer need to outline loops? If so, outlining can have a real cost
>> that's difficult to model. How do return types work?
> The dynamic nature is not part of the program, so inlining can happen
> as always. Given that the vectors are agnostic of size and work
> regardless of what the kernel provides (within safety boundaries), the
> code generation shouldn't change too much.
>
> We may have to create artefacts to restrict the maximum vscale (for
> safety), but others are better equipped to answer that question.
>
>
>>  1. I can definitely see the use cases for changing vscale dynamically,
>> and so I do suspect that we'll want that support.
> At a process/function level, yes. Within the same self-contained
> sub-graph, I don't know.
>
>
>>  2. LLVM does not have loops as first-class constructs. We only have SSA
>> (and, thus, dominance), and when specifying restrictions on placement of
>> things in function bodies, we need to do so in terms of these constructs
>> that we have (which don't include loops).
> That's why I was trying to define the "self-contained sub-graph" above
> (there must be a better term for that). It has to do with data
> dependencies (scalar|memory -> vector -> scalar|memory), ie. make sure
> side-effects don't leak out.
>
> A loop iteration is usually such a block, but not all are and not all
> such blocks are loops.
>
> Changing vscale inside a function, but outside of those blocks would
> be "fine", as long as we made sure code movement respects those
> boundaries and that context would be restored correctly on exceptions.
> But that's not part of the current proposal.

But I don't know how to implement that restriction without major changes
to the code base. Such a restriction doesn't follow from use/def chains,
and if we need a restriction that involves looking for non-SSA
dependencies (e.g., memory dependencies), then I think that we need
something different than the current proposal. Explicitly dependent
types might work, something like intrinsics might work, etc.

Thanks again,
Hal

>
> Chaning vscale inside one of those blocks would be madness. :)
>
> cheers,
> --renato

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev
In reply to this post by Nicholas Krause via llvm-dev
Hi Hal,

> On 30 Jul 2018, at 20:10, Hal Finkel <[hidden email]> wrote:
>
>
> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>>
>> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
>>
>> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>
> At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless.

Thanks, that's good to hear.

> 1.
>> This is a proposal for how to deal with querying the size of scalable types for
>> > analysis of IR. While it has not been implemented in full,
>
> Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.

Yes. I had hoped to get some more comments on the basic approach before progressing with the implementation, but if it makes more sense to have the implementation available to discuss then I'll start creating patches.

> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.

I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.

Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.

-Graham

>
> Thanks again,
> Hal
>
>>
>> Put differently: I don't think silence is assent here. You really need some clear signal of consensus.
>>
>> On Mon, Jul 30, 2018 at 2:23 AM Graham Hunter <[hidden email]> wrote:
>> Hi,
>>
>> Are there any objections to going ahead with this? If not, we'll try to get the patches reviewed and committed after the 7.0 branch occurs.
>>
>> -Graham
>>
>> > On 2 Jul 2018, at 10:53, Graham Hunter <[hidden email]> wrote:
>> >
>> > Hi,
>> >
>> > I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>> >
>> > Thanks,
>> >
>> > -Graham
>> >
>> > =============================================================
>> > Supporting SIMD instruction sets with variable vector lengths
>> > =============================================================
>> >
>> > In this RFC we propose extending LLVM IR to support code-generation for variable
>> > length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
>> > approach is backwards compatible and should be as non-intrusive as possible; the
>> > only change needed in other backends is how size is queried on vector types, and
>> > it only requires a change in which function is called. We have created a set of
>> > proof-of-concept patches to represent a simple vectorized loop in IR and
>> > generate SVE instructions from that IR. These patches (listed in section 7 of
>> > this rfc) can be found on Phabricator and are intended to illustrate the scope
>> > of changes required by the general approach described in this RFC.
>> >
>> > ==========
>> > Background
>> > ==========
>> >
>> > *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
>> > AArch64 which is intended to scale with hardware such that the same binary
>> > running on a processor with longer vector registers can take advantage of the
>> > increased compute power without recompilation.
>> >
>> > As the vector length is no longer a compile-time known value, the way in which
>> > the LLVM vectorizer generates code requires modifications such that certain
>> > values are now runtime evaluated expressions instead of compile-time constants.
>> >
>> > Documentation for SVE can be found at
>> > https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>> >
>> > ========
>> > Contents
>> > ========
>> >
>> > The rest of this RFC covers the following topics:
>> >
>> > 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>> >   have a length which is a runtime-determined multiple of a known base length.
>> >
>> > 2. Size Queries - how to reason about the size of types for which the size isn't
>> >   fully known at compile time.
>> >
>> > 3. Representing the runtime multiple of vector length in IR for use in address
>> >   calculations and induction variable comparisons.
>> >
>> > 4. Generating 'constant' values in IR for vectors with a runtime-determined
>> >   number of elements.
>> >
>> > 5. An explanation of splitting/concatentating scalable vectors.
>> >
>> > 6. A brief note on code generation of these new operations for AArch64.
>> >
>> > 7. An example of C code and matching IR using the proposed extensions.
>> >
>> > 8. A list of patches demonstrating the changes required to emit SVE instructions
>> >   for a loop that has already been vectorized using the extensions described
>> >   in this RFC.
>> >
>> > ========
>> > 1. Types
>> > ========
>> >
>> > To represent a vector of unknown length a boolean `Scalable` property has been
>> > added to the `VectorType` class, which indicates that the number of elements in
>> > the vector is a runtime-determined integer multiple of the `NumElements` field.
>> > Most code that deals with vectors doesn't need to know the exact length, but
>> > does need to know relative lengths -- e.g. get a vector with the same number of
>> > elements but a different element type, or with half or double the number of
>> > elements.
>> >
>> > In order to allow code to transparently support scalable vectors, we introduce
>> > an `ElementCount` class with two members:
>> >
>> > - `unsigned Min`: the minimum number of elements.
>> > - `bool Scalable`: is the element count an unknown multiple of `Min`?
>> >
>> > For non-scalable vectors (``Scalable=false``) the scale is considered to be
>> > equal to one and thus `Min` represents the exact number of elements in the
>> > vector.
>> >
>> > The intent for code working with vectors is to use convenience methods and avoid
>> > directly dealing with the number of elements. If needed, calling
>> > `getElementCount` on a vector type instead of `getVectorNumElements` can be used
>> > to obtain the (potentially scalable) number of elements. Overloaded division and
>> > multiplication operators allow an ElementCount instance to be used in much the
>> > same manner as an integer for most cases.
>> >
>> > This mixture of compile-time and runtime quantities allow us to reason about the
>> > relationship between different scalable vector types without knowing their
>> > exact length.
>> >
>> > The runtime multiple is not expected to change during program execution for SVE,
>> > but it is possible. The model of scalable vectors presented in this RFC assumes
>> > that the multiple will be constant within a function but not necessarily across
>> > functions. As suggested in the recent RISC-V rfc, a new function attribute to
>> > inherit the multiple across function calls will allow for function calls with
>> > vector arguments/return values and inlining/outlining optimizations.
>> >
>> > IR Textual Form
>> > ---------------
>> >
>> > The textual form for a scalable vector is:
>> >
>> > ``<scalable <n> x <type>>``
>> >
>> > where `type` is the scalar type of each element, `n` is the minimum number of
>> > elements, and the string literal `scalable` indicates that the total number of
>> > elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
>> > for indicating that the vector is scalable, and could be substituted by another.
>> > For fixed-length vectors, the `scalable` is omitted, so there is no change in
>> > the format for existing vectors.
>> >
>> > Scalable vectors with the same `Min` value have the same number of elements, and
>> > the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
>> > used within the same function):
>> >
>> > ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>> >  elements.
>> >
>> > ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>> >  bytes.
>> >
>> > IR Bitcode Form
>> > ---------------
>> >
>> > To serialize scalable vectors to bitcode, a new boolean field is added to the
>> > type record. If the field is not present the type will default to a fixed-length
>> > vector type, preserving backwards compatibility.
>> >
>> > Alternatives Considered
>> > -----------------------
>> >
>> > We did consider one main alternative -- a dedicated target type, like the
>> > x86_mmx type.
>> >
>> > A dedicated target type would either need to extend all existing passes that
>> > work with vectors to recognize the new type, or to duplicate all that code
>> > in order to get reasonable code generation and autovectorization.
>> >
>> > This hasn't been done for the x86_mmx type, and so it is only capable of
>> > providing support for C-level intrinsics instead of being used and recognized by
>> > passes inside llvm.
>> >
>> > Although our current solution will need to change some of the code that creates
>> > new VectorTypes, much of that code doesn't need to care about whether the types
>> > are scalable or not -- they can use preexisting methods like
>> > `getHalfElementsVectorType`. If the code is a little more complex,
>> > `ElementCount` structs can be used instead of an `unsigned` value to represent
>> > the number of elements.
>> >
>> > ===============
>> > 2. Size Queries
>> > ===============
>> >
>> > This is a proposal for how to deal with querying the size of scalable types for
>> > analysis of IR. While it has not been implemented in full, the general approach
>> > works well for calculating offsets into structures with scalable types in a
>> > modified version of ComputeValueVTs in our downstream compiler.
>> >
>> > For current IR types that have a known size, all query functions return a single
>> > integer constant. For scalable types a second integer is needed to indicate the
>> > number of bytes/bits which need to be scaled by the runtime multiple to obtain
>> > the actual length.
>> >
>> > For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
>> > except that it will no longer return a size for vector types (it will return 0,
>> > as it does for other derived types). The majority of calls to this function are
>> > already for scalar rather than vector types.
>> >
>> > For derived types, a function `getScalableSizePairInBits()` will be added, which
>> > returns a pair of integers (one to indicate unscaled bits, the other for bits
>> > that need to be scaled by the runtime multiple). For backends that do not need
>> > to deal with scalable types the existing methods will suffice, but a debug-only
>> > assert will be added to them to ensure they aren't used on scalable types.
>> >
>> > Similar functionality will be added to DataLayout.
>> >
>> > Comparisons between sizes will use the following methods, assuming that X and
>> > Y are non-zero integers and the form is of { unscaled, scaled }.
>> >
>> > { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>> >
>> > { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>> >                         functions that inherit vector length. Cannot be
>> >                         compared across non-inheriting functions.
>> >
>> > { X, 0 } > { 0, Y }: Cannot return true.
>> >
>> > { X, 0 } = { 0, Y }: Cannot return true.
>> >
>> > { X, 0 } < { 0, Y }: Can return true.
>> >
>> > { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>> >                             terms and try the above comparisons; it
>> >                             may not be possible to get a good answer.
>> >
>> > It's worth noting that we don't expect the last case (mixed scaled and
>> > unscaled sizes) to occur. Richard Sandiford's proposed C extensions
>> > (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
>> > prohibits mixing fixed-size types into sizeless struct.
>> >
>> > I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
>> > vs. unscaled; I believe the gcc implementation of SVE allows for such
>> > results, but that supports a generic polynomial length representation.
>> >
>> > My current intention is to rely on functions that clone or copy values to
>> > check whether they are being used to copy scalable vectors across function
>> > boundaries without the inherit vlen attribute and raise an error there instead
>> > of requiring passing the Function a type size is from for each comparison. If
>> > there's a strong preference for moving the check to the size comparison function
>> > let me know; I will be starting work on patches for this later in the year if
>> > there's no major problems with the idea.
>> >
>> > Future Work
>> > -----------
>> >
>> > Since we cannot determine the exact size of a scalable vector, the
>> > existing logic for alias detection won't work when multiple accesses
>> > share a common base pointer with different offsets.
>> >
>> > However, SVE's predication will mean that a dynamic 'safe' vector length
>> > can be determined at runtime, so after initial support has been added we
>> > can work on vectorizing loops using runtime predication to avoid aliasing
>> > problems.
>> >
>> > Alternatives Considered
>> > -----------------------
>> >
>> > Marking scalable vectors as unsized doesn't work well, as many parts of
>> > llvm dealing with loads and stores assert that 'isSized()' returns true
>> > and make use of the size when calculating offsets.
>> >
>> > We have considered introducing multiple helper functions instead of
>> > using direct size queries, but that doesn't cover all cases. It may
>> > still be a good idea to introduce them to make the purpose in a given
>> > case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>> >
>> > ========================================
>> > 3. Representing Vector Length at Runtime
>> > ========================================
>> >
>> > With a scalable vector type defined, we now need a way to represent the runtime
>> > length in IR in order to generate addresses for consecutive vectors in memory
>> > and determine how many elements have been processed in an iteration of a loop.
>> >
>> > We have added an experimental `vscale` intrinsic to represent the runtime
>> > multiple. Multiplying the result of this intrinsic by the minimum number of
>> > elements in a vector gives the total number of elements in a scalable vector.
>> >
>> > Fixed-Length Code
>> > -----------------
>> >
>> > Assuming a vector type of <4 x <ty>>
>> > ``
>> > vector.body:
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>> >  ;; <loop body>
>> >  ;; Increment induction var
>> >  %index.next = add i64 %index, 4
>> >  ;; <check and branch>
>> > ``
>> > Scalable Equivalent
>> > -------------------
>> >
>> > Assuming a vector type of <scalable 4 x <ty>>
>> > ``
>> > vector.body:
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>> >  ;; <loop body>
>> >  ;; Increment induction var
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>> >  ;; <check and branch>
>> > ``
>> > ===========================
>> > 4. Generating Vector Values
>> > ===========================
>> > For constant vector values, we cannot specify all the elements as we can for
>> > fixed-length vectors; fortunately only a small number of easily synthesized
>> > patterns are required for autovectorization. The `zeroinitializer` constant
>> > can be used in the same manner as fixed-length vectors for a constant zero
>> > splat. This can then be combined with `insertelement` and `shufflevector`
>> > to create arbitrary value splats in the same manner as fixed-length vectors.
>> >
>> > For constants consisting of a sequence of values, an experimental `stepvector`
>> > intrinsic has been added to represent a simple constant of the form
>> > `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
>> > start can be added, and changing the step requires multiplying by a splat.
>> >
>> > Fixed-Length Code
>> > -----------------
>> > ``
>> >  ;; Splat a value
>> >  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>> >  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>> >  ;; Add a constant sequence
>> >  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
>> > ``
>> > Scalable Equivalent
>> > -------------------
>> > ``
>> >  ;; Splat a value
>> >  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>> >  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>> >  ;; Splat offset + stride (the same in this case)
>> >  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>> >  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>> >  ;; Create sequence for scalable vector
>> >  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>> >  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>> >  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>> >  ;; Add the runtime-generated sequence
>> >  %add = add <scalable 4 x i32> %splat, %addoffset
>> > ``
>> > Future Work
>> > -----------
>> >
>> > Intrinsics cannot currently be used for constant folding. Our downstream
>> > compiler (using Constants instead of intrinsics) relies quite heavily on this
>> > for good code generation, so we will need to find new ways to recognize and
>> > fold these values.
>> >
>> > ===========================================
>> > 5. Splitting and Combining Scalable Vectors
>> > ===========================================
>> >
>> > Splitting and combining scalable vectors in IR is done in the same manner as
>> > for fixed-length vectors, but with a non-constant mask for the shufflevector.
>> >
>> > The following is an example of splitting a <scalable 4 x double> into two
>> > separate <scalable 2 x double> values.
>> >
>> > ``
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  ;; Stepvector generates the element ids for first subvector
>> >  %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>> >  ;; Add vscale * 2 to get the starting element for the second subvector
>> >  %ec = mul i64 %vscale64, 2
>> >  %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>> >  %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>> >  %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>> >  ;; Perform the extracts
>> >  %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>> >  %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
>> > ``
>> >
>> > ==================
>> > 6. Code Generation
>> > ==================
>> >
>> > IR splats will be converted to an experimental splatvector intrinsic in
>> > SelectionDAGBuilder.
>> >
>> > All three intrinsics are custom lowered and legalized in the AArch64 backend.
>> >
>> > Two new AArch64ISD nodes have been added to represent the same concepts
>> > at the SelectionDAG level, while splatvector maps onto the existing
>> > AArch64ISD::DUP.
>> >
>> > GlobalISel
>> > ----------
>> >
>> > Since GlobalISel was enabled by default on AArch64, it was necessary to add
>> > scalable vector support to the LowLevelType implementation. A single bit was
>> > added to the raw_data representation for vectors and vectors of pointers.
>> >
>> > In addition, types that only exist in destination patterns are planted in
>> > the enumeration of available types for generated code. While this may not be
>> > necessary in future, generating an all-true 'ptrue' value was necessary to
>> > convert a predicated instruction into an unpredicated one.
>> >
>> > ==========
>> > 7. Example
>> > ==========
>> >
>> > The following example shows a simple C loop which assigns the array index to
>> > the array elements matching that index. The IR shows how vscale and stepvector
>> > are used to create the needed values and to advance the index variable in the
>> > loop.
>> >
>> > C Code
>> > ------
>> >
>> > ``
>> > void IdentityArrayInit(int *a, int count) {
>> >  for (int i = 0; i < count; ++i)
>> >    a[i] = i;
>> > }
>> > ``
>> >
>> > Scalable IR Vector Body
>> > -----------------------
>> >
>> > ``
>> > vector.body.preheader:
>> >  ;; Other setup
>> >  ;; Stepvector used to create initial identity vector
>> >  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>> >  br vector.body
>> >
>> > vector.body
>> >  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>> >  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>> >
>> >           ;; stepvector used for index identity on entry to loop body ;;
>> >  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>> >                                     [ %stepvector, %vector.body.preheader ]
>> >  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>> >  %vscale32 = trunc i64 %vscale64 to i32
>> >  %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>> >
>> >           ;; vscale splat used to increment identity vector ;;
>> >  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>> >  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>> >  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>> >  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>> >  %3 = bitcast i32* %2 to <scalable 4 x i32>*
>> >  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>> >
>> >           ;; vscale used to increment loop index
>> >  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>> >  %4 = icmp eq i64 %index.next, %n.vec
>> >  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>> > ``
>> >
>> > ==========
>> > 8. Patches
>> > ==========
>> >
>> > List of patches:
>> >
>> > 1. Extend VectorType: https://reviews.llvm.org/D32530
>> > 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
>> > 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
>> > 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
>> > 5. SVE Calling Convention: https://reviews.llvm.org/D47771
>> > 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
>> > 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
>> > 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
>> > 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
>> > 10. Initial store patterns: https://reviews.llvm.org/D47776
>> > 11. Initial addition patterns: https://reviews.llvm.org/D47777
>> > 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
>> > 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
>> > 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>> >
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Reply | Threaded
Open this post in threaded view
|

Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Nicholas Krause via llvm-dev

On 08/01/2018 02:00 PM, Graham Hunter wrote:

> Hi Hal,
>
>> On 30 Jul 2018, at 20:10, Hal Finkel <[hidden email]> wrote:
>>
>>
>> On 07/30/2018 05:34 AM, Chandler Carruth wrote:
>>> I strongly suspect that there remains widespread concern with the direction of this, I know I have them.
>>>
>>> I don't think that many of the people who have that concern have had time to come back to this RFC and make progress on it, likely because of other commitments or simply the amount of churn around SVE related patches and such. That is at least why I haven't had time to return to this RFC and try to write more detailed feedback.
>>>
>>> Certainly, I would want to see pretty clear and considered support for this change to the IR type system from Hal, Chris, Eric and/or other long time maintainers of core LLVM IR components before it moves forward, and I don't see that in this thread.
>> At a high level, I'm happy with this approach. I think it will be important for LLVM to support runtime-determined vector lengths - I see the customizability and power-efficiency constraints that motivate these designs continuing to increase in importance. I'm still undecided on whether this makes vector code nicer even for fixed-vector-length architectures, but some of the design decisions that it forces, such as having explicit intrinsics for reductions and other horizontal operations, seem like the right direction regardless.
> Thanks, that's good to hear.
>
>> 1.
>>> This is a proposal for how to deal with querying the size of scalable types for
>>>> analysis of IR. While it has not been implemented in full,
>> Is this still true? The details here need to all work out, obviously, and we should make sure that any issues are identified.
> Yes. I had hoped to get some more comments on the basic approach before progressing with the implementation, but if it makes more sense to have the implementation available to discuss then I'll start creating patches.

At least on this point, I think that we'll want to have the
implementation to help make sure there aren't important details we're
overlooking.

>
>> 2. I know that there has been some discussion around support for changing the vector length during program execution (e.g., to account for some (proposed?) RISC-V feature), perhaps even during the execution of a single function. I'm very concerned about this idea because it is not at all clear to me how to limit information transfer contaminated with the vector size from propagating between different regions. As a result, I'm concerned about trying to add this on later, and so if this is part of the plan, I think that we need to think through the details up front because it could have a major impact on the design.
> I think Robin's email yesterday covered it fairly nicely; this RFC proposes that the hardware length of vectors will be consistent throughout an entire function, so we don't need to limit information inside a function, just between them. For SVE, h/w vector length will likely be consistent across the whole program as well (assuming the programmer doesn't make a prctl call to the kernel to change it) so we could drop that limit too, but I thought it best to come up with a unified approach that would work for both architectures. The 'inherits_vscale' attribute would allow us to continue optimizing across functions for SVE where desired.

I think that this will likely work, although I think we want to invert
the sense of the attribute. vscale should be inherited by default, and
some attribute can say that this isn't so. That same attribute, I
imagine, will also forbid scalable vector function arguments and return
values on those functions. If we don't have inherited vscale as the
default, we place an implicit contract on any IR transformation hat
performs outlining that it needs to scan for certain kinds of vector
operations and add the special attribute, or just always add this
special attribute, and that just becomes another special case, which
will only actually manifest on certain platforms, that it's best to avoid.

>
> Modelling the dynamic vector length for RVV is something for Robin (or others) to tackle later, but can be though of (at a high level) as an implicit predicate on all operations.

My point is that, while there may be some sense in which the details can
be worked out later, we need to have a good-enough understanding of how
this will work now in order to make sure that we're not making design
decisions now that make handling the dynamic vscale in a reasonable way
later more difficult.

Thanks again,
Hal

>
> -Graham
>
>> Thanks again,
>> Hal
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
1234